Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trims .bam from cram files; Adds crai; Trick for G-Actions disk limit #3

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ jobs:
- name: Basic workflow tests
run: |
nextflow run ${GITHUB_WORKSPACE} --config conf/test.config
echo "Results tree view:" ; tree -a results; head results/**/*txt
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I add printing of the results and head, because I want to see the cram sizes. We are not able to inspect them as we choose not to store artifacts from the CI.

Additionally, we need to delete the generated data because we hit the disk size limits and the CI fails because of that, see example here:
https://github.com/lifebit-ai/bam2cram/runs/4233568183?check_suite_focus=true#step:4:201

5 changes: 4 additions & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
docker.enabled = true

params {

input = 'testdata/test_input_cloudos.csv'
reference = 's3://eu-west-1-example-data/nihr/testdata/Homo_sapiens_assembly38.fasta'
report_dir = "/opt/bin"
// delete the actual files to save space in Github Actions
pre_script = "df -h; ls -lh"
post_script = "df -h; ls -lh > metadata.cram.txt; rm *.cram; rm *.crai"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Adding a custom cleanup script after the process to capture in a txt the generated file sizes for the crams, and deleting the crams,crais after to fix the failure due to disk size limitation.

echo = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Adding echo true so that we can see the printing in the CI test.

}
114 changes: 75 additions & 39 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,18 @@ def helpMessage() {
log.info """
Usage:
nextflow run main.nf --input input.csv --reference reference.fasta [Options]

Inputs Options:
--input Input csv file with bam paths
--reference Reference fasta file

Resource Options:
--cpus Number of CPUs (int)
(default: $params.cpus)
(default: $params.cpus)
--max_cpus Maximum number of CPUs (int)
(default: $params.max_cpus)
--memory Memory (memory unit)
(default: $params.memory)
(default: $params.memory)
--max_memory Maximum memory (memory unit)
(default: $params.max_memory)
--time Time limit (time unit)
Expand Down Expand Up @@ -81,13 +81,16 @@ process samtools_default_30 {
input:
file(bam_file) from ch_input_0
each file(reference) from ch_reference_0

output:
file "*.cram"
file "*.cra*"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am changing to also catch the crai files (we need them if we want to use the crams for variant calling).
This also allows us to fish anything and send it to publishDir that contains .cra without forcing the suffix to be cram.


script:
"""
samtools view -T $reference -o ${bam_file}.cram -O cram,version=3.0 $bam_file
${params.pre_script}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding debugging and handy script sections, to be able to debug.

We can use for example to see if we have enough space, if we are wasting too much disk size, ls the files and many more

samtools view -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.0 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -99,13 +102,16 @@ process samtools_default_31 {
input:
file(bam_file) from ch_input_1
each file(reference) from ch_reference_1

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.1 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.1 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -117,13 +123,16 @@ process samtools_normal_30 {
input:
file(bam_file) from ch_input_2
each file(reference) from ch_reference_2

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.0 --output-fmt-option seqs_per_slice=10000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.0 --output-fmt-option seqs_per_slice=10000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -135,13 +144,16 @@ process samtools_normal_31 {
input:
file(bam_file) from ch_input_3
each file(reference) from ch_reference_3

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.1 --output-fmt-option seqs_per_slice=10000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.1 --output-fmt-option seqs_per_slice=10000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -153,13 +165,16 @@ process samtools_fast_30 {
input:
file(bam_file) from ch_input_4
each file(reference) from ch_reference_4

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.0,level=1 --output-fmt-option seqs_per_slice=1000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.0,level=1 --output-fmt-option seqs_per_slice=1000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -171,13 +186,16 @@ process samtools_fast_31 {
input:
file(bam_file) from ch_input_5
each file(reference) from ch_reference_5

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.1,level=1 --output-fmt-option seqs_per_slice=1000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.1,level=1 --output-fmt-option seqs_per_slice=1000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -189,13 +207,16 @@ process samtools_small_30 {
input:
file(bam_file) from ch_input_6
each file(reference) from ch_reference_6

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.0,level=6,use_bzip2=1 --output-fmt-option seqs_per_slice=25000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.0,level=6,use_bzip2=1 --output-fmt-option seqs_per_slice=25000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -207,13 +228,16 @@ process samtools_small_31 {
input:
file(bam_file) from ch_input_7
each file(reference) from ch_reference_7

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.1,level=6,use_bzip2=1,use_fqz=1 --output-fmt-option seqs_per_slice=25000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.1,level=6,use_bzip2=1,use_fqz=1 --output-fmt-option seqs_per_slice=25000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -225,13 +249,16 @@ process samtools_archive_30 {
input:
file(bam_file) from ch_input_8
each file(reference) from ch_reference_8

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.0,level=7,use_bzip2=1 --output-fmt-option seqs_per_slice=100000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.0,level=7,use_bzip2=1 --output-fmt-option seqs_per_slice=100000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -243,13 +270,16 @@ process samtools_archive_31 {
input:
file(bam_file) from ch_input_9
each file(reference) from ch_reference_9

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.1,level=7,use_bzip2=1,use_fqz=1,use_arith=1 --output-fmt-option seqs_per_slice=100000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.1,level=7,use_bzip2=1,use_fqz=1,use_arith=1 --output-fmt-option seqs_per_slice=100000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -261,13 +291,16 @@ process samtools_archive_lzma_30 {
input:
file(bam_file) from ch_input_10
each file(reference) from ch_reference_10

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.0,level=7,use_bzip2=1,use_lzma=1 --output-fmt-option seqs_per_slice=100000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.0,level=7,use_bzip2=1,use_lzma=1 --output-fmt-option seqs_per_slice=100000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}

Expand All @@ -279,12 +312,15 @@ process samtools_archive_lzma_31 {
input:
file(bam_file) from ch_input_11
each file(reference) from ch_reference_11

output:
file "*.cram"
file "*.cra*"

script:
"""
samtools view --threads $task.cpus -T $reference -o ${bam_file}.cram -O cram,version=3.1,level=7,use_bzip2=1,use_fqz=1,use_arith=1,use_lzma=1 --output-fmt-option seqs_per_slice=100000 $bam_file
${params.pre_script}
samtools view --threads $task.cpus -T $reference -o ${bam_file.simpleName}.cram -O cram,version=3.1,level=7,use_bzip2=1,use_fqz=1,use_arith=1,use_lzma=1 --output-fmt-option seqs_per_slice=100000 $bam_file
samtools index ${bam_file.simpleName}.cram
${params.post_script}
"""
}
18 changes: 11 additions & 7 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

// 1. Parameters

// NOTE:
// NOTE:
// Initialise the values of the params to the preferred default value or to false
params {
// input options
Expand All @@ -20,7 +20,7 @@ params {

// when set to true, prints help and exits
help = false

// container for all processes, excluding those defined with 'withName' (see example below)
container = 'quay.io/lifebitai/samtools:1.14'

Expand All @@ -29,12 +29,12 @@ params {
memory = 4.GB
time = 8.h
disk = '30.GB'

// max resources limits defaults
max_cpus = 2
max_memory = 4.GB
max_time = 8.h

// execution related defaults
config = 'conf/standard.config'
echo = false
Expand All @@ -50,15 +50,19 @@ params {
zone = 'us-east1-b'
network = 'default'
subnetwork = 'default'

//debugging variables
pre_script = "df -h; ls -lh"
post_script = "df -h; ls -lh"
}


// 2. Profiles


// Do not update the order because the values set in params scope will not be overwritten
// Do not attempt to simplify to
// includeConfig params.config
// Do not attempt to simplify to
// includeConfig params.config
// outside of profiles scope, it will fail to update the values of the params
profiles {
standard {includeConfig params.config}
Expand All @@ -80,7 +84,7 @@ process {
maxRetries = params.maxRetries
maxForks = params.maxForks
container = params.container
errorStrategy = params.errorStrategy
errorStrategy = params.errorStrategy
}

// 4. Executor
Expand Down
3 changes: 1 addition & 2 deletions testdata/test_input_cloudos.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
bam
s3://eu-west-1-example-data/nihr/testdata/pb_normal.bam
s3://eu-west-1-example-data/nihr/testdata/pb_tumor.bam
https://eu-west-1-example-data.s3-eu-west-1.amazonaws.com/nihr/testdata/pb_normal.bam