Phage-NCBI-tails-data-cleaning

This little python script made my life easier when I had to clean up and extract data from the NCBI protein database for a biochemistry research project.

What it does

For each file in the raw_data directory, the script will:

Filter the HMM (Name) by each protein family.
Count the number of Prot RefSeqs for that family. Record this number as "Total #hits".
Filter for E values less than 0.001 and count the Prot RefSeqs that satisfy this condition.
Filter for unique Prot RefSeqs and record this number as "Unique hits".
Generate a new file containing the totals hits, number of E values less than 0.001 and unique hits.
Generate a new file containing the unique Prot RefSeqs and their smallest E values and protein family identity.
Move on to the next file in the raw_data directory to repeat steps 1-6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Phage-NCBI-tails-data-cleaning

What it does

Files

README.md

Latest commit

History

README.md

File metadata and controls

Phage-NCBI-tails-data-cleaning

What it does