Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to parse Tide pepXML #42

Open
wsnoble opened this issue Oct 25, 2022 · 9 comments
Open

Fail to parse Tide pepXML #42

wsnoble opened this issue Oct 25, 2022 · 9 comments
Assignees
Labels

Comments

@wsnoble
Copy link

wsnoble commented Oct 25, 2022

I tried to parse a Tide pepXML file, but failed. The error is "Failed to parse the PepXML file, please check your file." I suspect that this is because our format has changed since you first evaluated Tide's PepXML back at Crux v3.2. Can you take a look at the attached file and see if it's possible to support it, or if we need to make changes on our end?

plasmo-neighbors.trypsin-p.narrow.tide-search.pep.xml.txt.gz
MSB17171Trypsin030814.mgf.txt.gz

@wenbostar wenbostar added the Crux label Oct 25, 2022
@wenbostar
Copy link
Owner

wenbostar commented Oct 25, 2022

The log file generated by PDV when loading the files shows there is a problem in spectra mapping between pepXML and the mgf files.

Tue Oct 25 12:14:52 PDT 2022: PDV-1.7.4
java.lang.IndexOutOfBoundsException: Index: 9729, Size: 8815
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at com.compomics.util.experiment.io.massspectrometry.MgfIndex.getSpectrumTitle(MgfIndex.java:239)
        at com.compomics.util.experiment.massspectrometry.SpectrumFactory.getSpectrumTitle(SpectrumFactory.java:994)
        at PDVGUI.fileimport.PepXMLFileImport.parsePepXML(PepXMLFileImport.java:669)
        at PDVGUI.fileimport.PepXMLFileImport.access$000(PepXMLFileImport.java:33)
        at PDVGUI.fileimport.PepXMLFileImport$1.run(PepXMLFileImport.java:185)

For mgf/pepXML input from Crux, we use start_scan from the pepXML file as spectrum ID to extract MS/MS spectrum data from mgf. The start_scan from the pepXML we generated from a previous version of Crux is the index of spectrum in MGF file not scan number in MGF file. But it looks like in your pepXML file, it’s scan number in the mgf file. Are there any changes in start_scan in the latest Crux?

@wenbostar wenbostar self-assigned this Oct 25, 2022
@wsnoble
Copy link
Author

wsnoble commented Oct 26, 2022

Unfortunately, I don't know the answer to this. Looking back at the release notes, it could be that these changes were in this update:

May 28, 2020: Added fixes for pepXML schema validation failures.

@wenbostar
Copy link
Owner

If there is no scan number (SCANS) in MGF, what will be used as start_scan in latest Crux pepXML output? It is common that mgf files don't have scan number.

BEGIN IONS
TITLE=controllerType=0 controllerNumber=1 scan=7
SCANS=7
RTINSECONDS=122.229984
PEPMASS=381.409973144531
CHARGE=3+
111.1171341 1240.9335937500
113.8793030 1190.6700439453
115.0367432 1258.7552490234
153.8232880 1130.4077148438

@wsnoble
Copy link
Author

wsnoble commented Oct 26, 2022

It uses ordinal numbers instead in that case. Here is the line that gets printed to the log file:

INFO: Parser could not determine scan numbers for this file, using ordinal numbers as scan numbers.

A sample MGF and pepxml file are attached.

plasmo-neighbors.trypsin-p.narrow.tide-search.pep.xml.txt
short.mgf.txt

@wenbostar
Copy link
Owner

This new example can be imported into PDV successfuly:

image

If an MGF file is combined from multiple MGF files (e.g., multiple fractions of the same sample), this MGF file is likely to have spectra with the same scan numbers. In this case, how will Crux set the start_scan in the pepXML output?

@freejstone
Copy link

I am really unsure if this is helpful, but for what its worth I was able to parse PDV using the "database searching" feature using a pepxml containing a single PSM and using the complete mgf file. However no ions are annotated:

Screen Shot 2022-10-27 at 1 58 37 pm

As soon as I reduce the mgf file to the single scan of interest, it does not parse. What does work is using PDV's "one PSM" feature. In that case it will accept the mgf with the single scan.

I have attached the complete mgf, the single mgf, and the pepxml containing the single PSM.

MSB19717Trypsin021915_1910.mgf.txt
MSB19717Trypsin021915.mgf.txt
plasmo-neighbors.trypsin-p.wide.tide-search_single_psm.pep.xml.txt

@kfattila
Copy link

I am not familiar with the data parsing functions in Crux. But it was my understanding so far that crux uses proteowizard to parse the input files, mgf etc. I think there has been some proteowizard update. I hope my comment helps.

@wenbostar
Copy link
Owner

I am really unsure if this is helpful, but for what its worth I was able to parse PDV using the "database searching" feature using a pepxml containing a single PSM and using the complete mgf file. However no ions are annotated:

Screen Shot 2022-10-27 at 1 58 37 pm

As soon as I reduce the mgf file to the single scan of interest, it does not parse. What does work is using PDV's "one PSM" feature. In that case it will accept the mgf with the single scan.

I have attached the complete mgf, the single mgf, and the pepxml containing the single PSM.

MSB19717Trypsin021915_1910.mgf.txt MSB19717Trypsin021915.mgf.txt plasmo-neighbors.trypsin-p.wide.tide-search_single_psm.pep.xml.txt

For Crux, the current version can correctly match PSMs in pepXML to spectra in mgf file only when the start_scan in pepXML is an ordinal number of spectrum in mgf file.

@wenbostar
Copy link
Owner

wenbostar commented Oct 31, 2022

I tested Crux v4.1, the current version of PDV works well with mzML/pepXML, mzML/mzid, mzXML/pepXML, mzXML/mzid files. I added a few examples generated using Crux v4.1 to the README.

For MGF input, it looks like the spectrum ID mapping for both pepXML and mzid outputs was changed in v4.1 so PDV cannot parse the result sucessfully in some cases.

So far, I found the start_scan was assigned differently with different head formats of MGF:

  1. When there is no SCANS in MGF and it looks like scan number cannot be parsed from TITLE by Crux, start_scan is an ordinal number of spectrum in mgf file. PDV works well with this;
  2. When there is no SCANS in MGF and TITLE is in a format like "TITLE=SF_200217_U2OS_TiO2_HCD_OT_rep1.1501.1501.2", start_scan is scan number parsed from the title (1501 for TITLE=SF_200217_U2OS_TiO2_HCD_OT_rep1.1501.1501.2);
  3. When there is SCANS in MGF, start_scan is scan number from SCANS.

Considering different spectra may have the same scan number when a MGF file is combined from multiple MGF files, I would suggest to always assign start_scan as the ordinal number of spectrum in mgf file. Using a consistent way for spectrum mapping for the same format of MS/MS data will make users parse the result easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants