Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal for handling xml -> chado mappings #15

Closed
bradfordcondon opened this issue Nov 27, 2018 · 4 comments
Closed

proposal for handling xml -> chado mappings #15

bradfordcondon opened this issue Nov 27, 2018 · 4 comments

Comments

@bradfordcondon
Copy link
Contributor

bradfordcondon commented Nov 27, 2018

child of #8

heres the idea.

The XML parser is split into creating the base record, dbxrefs, linked records, and props. (and whatever other stuff we need).

The base record stuff is hard coded. We look for a hardcoded attribute for each column of the base record, with advanced logic to check for all possible attributes and use the "best" one.

dbxrefs are also hard-coded.

linked records.... im not there yet. let's ignore for now.

for everything else: it looks up the tag in an API. the API returns if the tag should be ignored, added as a prop, or something else.

We have a schema that stores:

ALL encountered tags. It keeps the tag name, the the ncbi db type for that tag, and if the tag is assigned to a term or not. If it's assigned, it's just the cvtermid for easy lookup. We also have a list of all the matching possible cvterms that arent necessarily assigned (probably a seperate, mview type table).

how does the schema get populated? read on...

schema population

We have a job that reads an XML file and compiles all the attribute tags: each tag is stored in the schema as unassigned. It then looks each one up in your chado.cvterm. All exact and "close enough" matches go in the possible matches schema. The admin then goes to an admin area and sees a list of all XML terms with matches. From there they can "assign" the attribute, which means when the XML gets parsed for real, it will create a property. If no attribute is assigned a term, it gets ignored. If no terms match an attribute, they are instructed to find one, with a button to automatically create a local term instead.

Furthermore, on install, we can hardcode some suggest attribute -> cvterm mappings. This is tricky because everyone's site is different, but maybe there are some attributes we would expect in ALL biosamples across plants animals fungi etc.

When someone imports a new XML, it can be configured to ignore new attribute tags (but add them to its schema as an unmatched, ignored attribute) OR to abort the load -> the admin can then assign a term and re-attempt the load.

@bradfordcondon
Copy link
Contributor Author

eutils module goals

Assembly

thinks to keep in mind about the analysis table:

program and program version wont be available, and are not nullable.

program + programversion + sourcename must be unique.

with this in mind, we'll make the sourcename the ACCESSION, and the program/ programversion be euitils v 1.0

Base:

name assemblyname
description: assemblydescription
timeexecuted : we get this from either asmreleasedate_genbank or asmreleasedate_refseq. Use the earlier of the two?
sourcename -- the unique accession, since it msut be unique. Therefore, the AssemblyAccession tag.

missing

program - won't be available. -- use eutils
programversion - won't be available. -- use eutils
algorithm - null
sourceversion - null
sourceuri - null

standard metadata

encoded in the <Meta><Stats>tag. For example:

<Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat>
<Stat category="total_length" sequence_tag="all">1373527118</Stat> <Stat category="ungapped_length" sequence_tag="all">1293730791</Stat></Stats>

so the category = the term to look for.Note that some of these props can be happily ignored.

Additional metadata

theres a lot, this is a big one to tackle...

Project

the project table just has a name and an adescription, and the name is unique.

name -do we use Name or Title?
description - Description tag

Additional metadata

some interesting ones: some in the ProjectDescr tag....

<Relevance>
               <Agricultural>yes</Agricultural>
               <Evolution>yes</Evolution>
</Relevance>
<AnnotationSource>
                    <Name>NCBI annotation pipeline</Name>
</AnnotationSource>

others in the ProjectType tag:

<RepliconSet>
                            <Replicon order="1">
                                <Type location="ePlastid">eChromosome</Type>
                                <Name>CHL</Name>
                                <Size units="Mb">0.155691</Size>
                            </Replicon>
                            <Count repliconType="eOther">1</Count>
                        </RepliconSet>

Biosample

look to analysis expression loader for base mappings.

@bradfordcondon
Copy link
Contributor Author

bradfordcondon commented Nov 28, 2018

@mpoelchau i think you'll want to talk about this with me particularly the issue iwth analyses to assembly

to clarify:

we dont have clear mappings for the following chado analysis fields:
https://github.com/NAL-i5K/tripal_eutils/tree/master/examples/assembly

I guess the alternative is to require these fields to be provided by the user when running the importer.

program - won't be available. -- use eutils
programversion - won't be available. -- use eutils
algorithm - null
sourceversion - null
sourceuri - null

obviously the program should be the assembly software used, but that isnt reliably found in the XML (i dont see it in nay of hte examples i've assembled here:

@mpoelchau
Copy link
Contributor

mpoelchau commented Nov 29, 2018

Right, I remember that we had to do some gymnastics to get the program from the NBCI ftp site. I have no idea why it's there and not in the eutils-supplied metadata.

If you look into our internal issue on this and search for ftp, you'll find the comments on this and our workaround:

https://gitlab.com/i5k_Workspace/workspace_roadmap/issues/533

@bradfordcondon
Copy link
Contributor Author

it did not occur to me we would have to do something so awful. ok im making a child issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants