Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nativeID output in mzIdentML/pepXML[/mzML/PIN] #324

Open
chambm opened this issue Jul 11, 2024 · 9 comments
Open

Add nativeID output in mzIdentML/pepXML[/mzML/PIN] #324

chambm opened this issue Jul 11, 2024 · 9 comments
Assignees

Comments

@chambm
Copy link

chambm commented Jul 11, 2024

To preserve Waters and Sciex source spectrum links, writing nativeID in the output pepXML/mzIdentML is necessary. Please read in the nativeID when reading spectra and pass it through when writing pepXML/mzIdentML elements for those spectra. It would be good to preserve it in mzML as well, but that's not as important. If the format allows, it might be helpful to write it in Percolator PIN format as well so it's simple to map PIN lines to the mzIdentML/pepXML equivalent.

Thanks!

@fcyu fcyu self-assigned this Jul 11, 2024
@fcyu
Copy link
Member

fcyu commented Aug 17, 2024

Hi Matt,

Sorry for the long delay. I finally got a chance to implement this feature. If there is not too much trouble, could you share some typical Waters and Sciex mzML files with me to test?

If the format allows, it might be helpful to write it in Percolator PIN format as well so it's simple to map PIN lines to the mzIdentML/pepXML equivalent.

I don't want to change the SpecId column because many downstream tools parse that columns. As far as I know, there is no additional column can be used for the native ID. Let me know if the latest Percolator support the native ID column.

Also, may I ask if there is any harm to make the "index" not starting from 0 and not continuous? I would like to use scan num - 1 as the index to make it consistent when the mzML file is just a subset of the scans.

Thanks,

Fengchao

@chambm
Copy link
Author

chambm commented Aug 20, 2024

I'm glad to hear this is almost done!

Unfortunately AFAIK the mzML index must be 0-based and contiguous:
https://peptideatlas.org/tmp/mzML1.1.0.html#spectrum
Usually if you make an mzML from a format where you don't have a nativeID, only a scan number, you would just make the nativeID like "scan=123" or "index=122". But you already have a real nativeID. The problem here is to map the mzML/pepXML to the PIN TSV, right?
https://github.com/percolator/percolator/wiki/Interface#pintsv-tab-delimited-file-format

As far as I can tell from that, there should be a string PsmId column and a numeric ScanNr column. It seems pretty typical for the ScanNr column to be missing though. I can understand not wanting to change the PsmId format you've been using, but that's really the only column suitable for the nativeID. :(

Maybe easiest would just be to guarantee that the number and order of lines in the pepXML is the same in the PIN?

@chambm
Copy link
Author

chambm commented Aug 20, 2024

Here's an example Waters DDA file.
010208_ecoli_003-dda2.zip

@fcyu
Copy link
Member

fcyu commented Aug 20, 2024

Mapping the mzML/pepXML to the pin file is actually OK as long as we have a consistent way to extract the scan number (from native ID if it is encoded in 1-D such as Thermo's, or index + 1 if it is not in 1-D such as Waters' and Sciex's). I asked because you want it. If it is OK not having the native ID in the pin file, I guess I can ignore it.

The problem is that if there is a mzML file that is a subset of the original mzML file, and its native ID does not encode the scan number in 1-D, like what Waters and Sciex have. Then, since the scan number = index + 1, the scan numbers in the subset mzML are different from those in the original mzML, and it is hard to map across different tools. The way I think are not generating the sub mzML file or make the index = scan number - 1 (which will not start with 0 and not contiguous)

Maybe in the future, the mzML schema can have a scan_number field for the tools to put their own-defined scan numbers. Then, still need those tools to support it.....

Best,

Fengchao

@chambm
Copy link
Author

chambm commented Aug 20, 2024

You could use a userParam. Those are arbitrary and basically unlimited.

But I think nativeID is specifically intended and useful for mapping across different tools, and for remaining valid when files are filtered or subsetted. It's why I started putting spectrumNativeID in my pepXML output, even though that wasn't an official attribute. :)

@fcyu
Copy link
Member

fcyu commented Aug 20, 2024

But I think nativeID is specifically intended and useful for mapping across different tools, and for remaining valid when files are filtered or subsetted. It's why I started putting spectrumNativeID in my pepXML output, even though that wasn't an official attribute. :)

Yes, but then, I need to maintain native ID -> scan number and scan number -> native ID maps in all the tools that read mzML and raw files because we index scans using 1-D.

Best,

Fengchao

@chambm
Copy link
Author

chambm commented Aug 20, 2024

In my tools I made nativeID a field in the Spectrum class and had a map from nativeID to Spectrum*. When possible, I dropped scan number entirely because it wasn't universally applicable, and when not possible, I parsed it out of the nativeID (or used index if not parseable).

@chambm
Copy link
Author

chambm commented Sep 24, 2024

Hi Fengchao, has this change made it into a released MSFragger?

@fcyu
Copy link
Member

fcyu commented Sep 24, 2024

For Thermo data, the spectrumNativeID is already in the pepXML file. For the others that require changing the scan indexing, I am trying to get it done before the next release.

Best,

Fengchao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants