Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trims .bam from cram files; Adds crai; Trick for G-Actions disk limit #3

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cgpu
Copy link
Contributor

@cgpu cgpu commented Nov 17, 2021

Overview

Does this

Purpose

To achieve that

Changes

  • Implements X
  • Refactors Y
  • Adds/Removes Z

@cgpu cgpu changed the title Trims .bam from cram files; Adds crai Trims .bam from cram files; Adds crai; Heuristic for Github Action disk size limit Nov 17, 2021
@cgpu cgpu changed the title Trims .bam from cram files; Adds crai; Heuristic for Github Action disk size limit Trims .bam from cram files; Adds crai; Trick for G-Actions disk limit Nov 17, 2021
@@ -18,3 +18,4 @@ jobs:
- name: Basic workflow tests
run: |
nextflow run ${GITHUB_WORKSPACE} --config conf/test.config
echo "Results tree view:" ; tree -a results; head results/**/*txt
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I add printing of the results and head, because I want to see the cram sizes. We are not able to inspect them as we choose not to store artifacts from the CI.

Additionally, we need to delete the generated data because we hit the disk size limits and the CI fails because of that, see example here:
https://github.com/lifebit-ai/bam2cram/runs/4233568183?check_suite_focus=true#step:4:201

// delete the actual files to save space in Github Actions
pre_script = "df -h; ls -lh"
post_script = "df -h; ls -lh > metadata.cram.txt; rm *.cram; rm *.crai"
echo = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Adding echo true so that we can see the printing in the CI test.

input = 'testdata/test_input_cloudos.csv'
reference = 's3://eu-west-1-example-data/nihr/testdata/Homo_sapiens_assembly38.fasta'
report_dir = "/opt/bin"
// delete the actual files to save space in Github Actions
pre_script = "df -h; ls -lh"
post_script = "df -h; ls -lh > metadata.cram.txt; rm *.cram; rm *.crai"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Adding a custom cleanup script after the process to capture in a txt the generated file sizes for the crams, and deleting the crams,crais after to fix the failure due to disk size limitation.

output:
file "*.cram"
file "*.cra*"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am changing to also catch the crai files (we need them if we want to use the crams for variant calling).
This also allows us to fish anything and send it to publishDir that contains .cra without forcing the suffix to be cram.


script:
"""
samtools view -T $reference -o ${bam_file}.cram -O cram,version=3.0 $bam_file
${params.pre_script}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding debugging and handy script sections, to be able to debug.

We can use for example to see if we have enough space, if we are wasting too much disk size, ls the files and many more

@imendes93
Copy link
Contributor

@cgpu I would say this is ready to merge, what do you say?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants