Skip to content

Pipeline options

Scott.Hazelhurst edited this page May 28, 2017 · 19 revisions

This section describes the various ways in which the pipeline can be run and various options. Usually options are specified in the nextflow.config file (or which ever file you use). However, you can also pass parameters to the Nextflow script on the command-line. Parameters on the command line over-ride any parameters specified in the config file.

Standard Nextflow options

The following Nextflow options may be very useful.

Specifying options

When you run the the scripts there are a number of different options that you might want to use. These options are specified by using the -flag or --flag notation. The flags with a single hyphen (e.g. -resume) are standard Nextflow options applicable to all Nextflow scripts. The flags with a double hyphen (e.g., --pi_hat) are options that are specific to our scripts. Take care not to mix this up as it's an easy error to make, and may cause silent errors to occur.

Almost all the workflow options that are in the nextflow.config file can also be passed on the command line and they will then override anything in the config like. For example

nextflow run plink-qc.nf --cut_miss 0.04

sets the maximim allowable per-SNP misisng to 4%. However, this should only be used when debugging and playing round. Rather, keep the options in the a config file that you save. By putting options on the command line you reduce reproducibility. (Using the parameters that change the mode of the running -- e.g. whether using docker or whether to produce a time line only affects time taken and auxiliary data rather than the substantive results).

Partial execution and resuming execution

Often a workflow may fail in the middle of execution because there's a problem with data (perhaps a typo in the name of a file), or you may want to run the workflow with slightly different parameters. Nextflow is very good in detecting what parts of the workflow need to re-executed -- use the -resume option.

Cleaning up

If you want to clean up your work directory, say nextflow clean.

Workflow overview, and timing

Nextflow provides several options for visualising and tracing workflow. See the Nextflow documentation for details. Two of the options are:

  • A nice graphic of a run of your workflow

    nextflow run plink-qc.nf -with-dag quality-d.pdf

  • A timeline of your workflow and individual processes (produced as an html file).

    nextflow run <pipeline name> -with-timeline time.html

    This is useful for seeing how long different parts of your process took. Also useful is peak virtual memory used, which you may need to know if running on very large data to ensure you have a big enough machine and specify the right parmeters.

Running the workflow in different environment

In the quick start we gave an overview of running our workflows in different environments. Here we go through all the options, in a little more detail

Running natively on a machine

This option requires that all dependancies have been installed. You run the code by saying

nextflow run plink-qc.nf

You can add that any extra parameters at the end.

Running on a local machine with Docker

This requires the user to have docker installed.

Run by nextlow run plink-qc.nf -profile docker

Running on a cluster

Nextflow supports execution on clusters using standard resource managers, including Torque/PBS, SLURM and SGE. Log on to the head node of the cluster, and execute the workflow as shown below. Nextflow submits the jobs to the cluster on your behalf, taking care of any dependancies. If your job is likely to run for a long time because you've got really large data sets, use a tool like screen to allow you to control your session without timing out.

To run using Torque/PBS, log into the head node. Edit the nextflow.config file (either directly or using our helper script). If you are doing this manually, edit the nextflow.config file by looking for the stanza (around like 85)

    pbs {
        process.executor = 'pbs'
	process.queue = 'long'
    }

and change long to whatever queue you are using.

Then you can run by saying

nextflow run plink-qc.nf -profile pbs

If you are using another scheduler, the changes should be straight-forward. For example, to run using SLURN, add a stanza like within the _profile_environment of the nextflow.config file

    slurm {
        process.executor = 'slurm'
	process.queue = 'long'
    }

and then use this as the profile.

Running on a cluster with Docker

If you have a cluster which runs Docker, you can get the best of both worlds by editing the queue variable in the pbsDocker stanza, and then running

nextflow run plink-qc.nf -profile pbsDocker

We assume all the data is visible to all nodes in the swarm. Log into the head node of the Swarm and run your chosed workflow -- for example

Running on Docker Swarm

We have tested our workflow on different Docker Swarms. How to set up Docker Swarm is beyond the scope of this tutorial, but if you have a Docker Swarm, it is easy to run. From the head node of your Docker swarm, run

nextflow run plink-qc.nf -profile dockerSwarm

Other container services

We hope to support Singularity soon. We are unlikely to support udocker soon for the reasons discussed here..

Running on Amazon EC2

Nextflow supports execution on Amazon EC2. Of course, you can do your own custom thing on Amazon EC2, but there is direct support from Nextflow and we provide an Amazon AMI that allows you to use Amazon very easilyl. This discussion assumes you are familiar with Amazon and EC2 and shows you how to run the workflow on EC2:

  1. We assume you have an Amazon AWS account and have some familiariy with EC2. The easiest way to run is by building am Amazon Elastic File System (EFS) which persists between runs. Each time you run, you attach the EFS to the cluster you use. We assume you have
  • Your Amazon accessKey and secretKey

  • you have the ID of your EFS

  • you have the ID of the subnet you will use for your Amazon EC2.

    Edit the nextflow config file to add your keys to the aws stanza, as well as changing the AMI ID, sharedStorageID, the mount and subnet ID. BUT see point 8 below for a better way of doing things.

    aws {
       accessKey ='AAAAAAAAAAAAAAAAA'
       secretKey = 'raghdkGAHHGH13hg3hGAH18382GAJHAJHG11'
       region    ='eu-west-1'
    }
    
    cloud {
             ...
             ...
             ... other options
             imageId = "ami-9ca7b2fa"      // AMI which has cloud-init installed
             sharedStorageId   = "fs-XXXXXXXXX"   // Set a common mount point for images
             sharedStorageMount = "/mnt/shared 
         subnetId = "subnet-XXXXXXX" 
    }
    

    Note that the AMI is the H3ABionet AMI ID, which you should use. The other information such as the keys, sharedStorageID and subnetID you have to set to what you have.

  1. Create the cloud. For the simple example, you only need to have one machine. If you have many, big files adjust accordingly.

    nextflow cloud create h3agwascloud -c 1

    The name of the cluster is your choice (h3agwascloud is your choice).

  2. If successful, you will be given the ID of the headnode of the cluster to log in. You should see a message like,

> cluster name: h3agwascloud
> instances count: 1
> Launch configuration:
 - bootStorageSize: '20GB'
 - driver: 'aws'
 - imageId: 'ami-9ca7b2fa'
 - instanceType: 'm4.xlarge'
 - keyFile: /home/user/.ssh/id_rsa.pub
 - sharedStorageId: 'fs-e27f661c'
 - sharedStorageMount: '/mnt/shared'
 - subnetId: 'subnet-a321c9c2'
 - userName: 'scott'
 - autoscale:
   - enabled: true
   - maxInstances: 5
   - terminateWhenIdle: true



Please confirm you really want to launch the cluster with above configuration [y/n] y
Launching master node -- Waiting for `running` status.. ready.
Login in the master node using the following command: 
  ssh -i /home/scott/.ssh/id_rsa [email protected]
  1. ssh into the head node of your Amazon cluster. The EFS is mounted onto /mnt/shared. In our example, we will analyse the files sampleA.{bed,bim,fam} in the /mnt/shared/input directory The nextflow binary will be found in your home directory. (Note that you can choose to mount the EFS on another mount point by modifying the nextflow option sharedStorageMount;

  2. Run the workflow -- you can run directly from github. The AMI doesn't have any of the bioinformatics software installed.

    Specify the docker profile and nextflow will run using Docker, fetching any necessary images.

    nextflow run h3abionet/h3agwas --work_dir /mnt/shared --input_pat sampleA --max_plink_cores 2 -profile docker

  3. The output can be found in /mnt/shared/output. The file sampleA.pdf is a report of the analysis that was done.

  4. Remember to shutdown the Amazon cluster to avoid unduly boosting Amazon's share price.

    nextflow cloud shutdown h3agwascloud

  5. Security considerations: Note that your Amazon credentials should be kept confidential. Practically this means adding the credentials to your nextflow.config file is a bad idea, especially if you put that under git control or if you share your nextflow scripts. So a better way of handling this is to put confidential information in a separate file that you don't share. So I have a file called scott.aws which has the following:

aws {
    accessKey ='APT3YGD76GNbOP1HSTYU4'
    secretKey = 'WHATEVERYOURSECRETKEYISGOESHERE'
    region    ='eu-west-1'
}

cloud {
            sharedStorageId   = "fs-XXXXXX"   
	    subnetId = "subnet-XXXXXX" 
}

Then when you create your cloud you say this on your local machine

nextflow -c scott.aws cloud create scottcluster -c 5

Note there are two uses of -c. The positions of these arguments are crucial. The first is an argument to nextflow itself and gives a configuration file to nextflow to use. The second is an argument to cloud create which says how many nodes should be created.

The scott.aws file is not shared or put under git control. The nextflow.config can be because there is not private information.