Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed regex for calculation of percent hemoglobin genes #229

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KriBaLin
Copy link

@KriBaLin KriBaLin commented Aug 15, 2023

Dear Theis lab,

thank you a lot for your very helpful book and tutorials.

I am currently performing my first analysis of scRNAseq data. During step 6.3 (filtering low quality reads) I wanted to understand the regex for filtering hemoglobin genes ("^HB[^(P)]").
I noticed that this regex not only includes hemoglobin-genes, but also the genes HBEGF, HBS1L, and HBP1.

I was trying to find a more specific regex to match only the hemoglobin genes, with some help from stackoverflow. I'd suggest "^HB(?!EGF|S1L|P1).+", which I changed in the jupyter notebook, an alternative might be "^HB[^(P|S)]($|[^G])".

This applies to human data, however we briefly confirmed that these regexs are applicable (with lowercase characters) to mouse data, too.

Please correct me if I am wrong and the original regex performs in the way intended by you. In this case, I would suggest extending the documentation for clarification.

Best,

Kristina

edit: added code backticks to the suggested regexs for correct display

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@KriBaLin KriBaLin changed the title Changed regex for calculation of percent hemoglobulin genes Changed regex for calculation of percent hemoglobin genes Aug 15, 2023
@Zethson
Copy link
Member

Zethson commented Aug 15, 2023

Dear @KriBaLin ,

thank you!

^HB(?!EGF|S1L|P1).+ seems a bit specific and I'm worried that there might be other genes that we're not excluding as False Positives here. Is this an unjustified fear by me?

So ^HB[^(P|S)] (which I think is equivalent to ^HB[^PS]?) might be a more appealing option if this is the case. Note that this would still match HBEGF...

What do you think?

@KriBaLin
Copy link
Author

KriBaLin commented Aug 15, 2023

Dear @Zethson,

thank you for your fast reply.

Sorry, there was a formatting mistake in my first post that turned the suggested "^HB[^(P|S)]($|[^G])" into a wrong "^HB[^(P|S)]" - I edited the post now.

Regarding the expression being too specific, I'm honestly not experienced enough to judge this with respect to future changes of gene annotations or the like. Currently, when I search the 36601 genes of my human data set for genes starting with "HB", I get 13 hits: HBEGF, HBS1L, HBP1, HBB, HBD, HBG1, HBG2, HBE1, HBZ, HBM, HBA2, HBA1, HBQ1;

The first 3 don't seem to be hemoglobin-genes. The regex "^HB[^(P)]" only excludes HBP1, whilst "^HB(?!EGF|S1L|P1).+" and "^HB[^(P|S)]($|[^G])" exclude the first three. The former regex might be a bit easier to understand.

A (maybe more robust?) option could be to explicitly check for a list of hemoglobin genes - as suggested by Konrad Rudolph on stackoverflow.

@Zethson
Copy link
Member

Zethson commented Aug 15, 2023

Guess one could look at Ensemble gene symbols to see how this regex would affect it. A list of genes is also possible but then we'd need the list ^_^

@grst
Copy link
Contributor

grst commented Aug 29, 2023

I agree with @klmr that an explicit list is preferable over a regex. Not sure what a "trusted source" of hemoglobin genes would be, but results 1-10 from this genescards search is probably a good start. At least it for sure doesn't include anything unexpected.

@Zethson
Copy link
Member

Zethson commented Aug 29, 2023

Thank you very much @grst for the link! We'll make the changes accordingly using the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants