Scatter/Gather in a CWL Workflow
Overview
Teaching: 5 min
Exercises: 0 minQuestions
How can I parallelize running of tools in a workflow?
Objectives
Learn how to scatter steps of a CWL workflow.
This workflow uses a real life example of open-source CWL CommandLineTools from Dockstore.
CWL Workflows can scatter array inputs to run in parallel within the same step and gather all the outputs together to be used in subsequent steps. It’s similar to GNU parallel in that it will execute all jobs in parallel using a single script and multiple input arguments.
Use the ScatterFeatureRequirement in the requirements section of a workflow to begin enabling scattering in a single step.
requirements:
ScatterFeatureRequirement: {}
Exercise 1
In this workflow, we have a job that uses GATK Haplotype Caller. Currently, the job is configured to run on a single chromosome. How would you parallelize
GATK_HaplotypeCaller
to run all chromosomes simultaneously?cwlVersion: v1.0 class: Workflow requirements: ScatterFeatureRequirement: {} inputs: bam: File chromosome: string outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam intervals: chromosome out: [vcf]
Solution
Key Points
Use