Shell to CWL workflow conversion
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is the difference between a CWL tool description and a CWL workflow?
How can we create a tool descriptor?
How can we use this in a single step workflow?
How can we expand to a multi-step workflow?
Objectives
dataflow objectives:
explain the difference between a CWL tool description and a CWL workflow
describe the relationship between a tool and its corresponding CWL document
exercise good practices when naming inputs and outputs
Be able to make understandable and valid names for inputs and outputs (not ‘input3’)
describing requirements objectives:
identify all the requirements of a tool and define them in the tool description
use
runtime
parameters to access information about the runtime environmentdefine environment variables necessary for execution
use
secondaryFiles
orInitialWorkDirRequirement
to access files in the same directory as another referenced fileuse
$(runtime.cores)
to define the number of cores to be useduse
type: File
, instead of a string, to reference a filepathconverting shell to cwl objectives:
identify tasks, and data links in a script
recognize loops that can be converted into scatters
finding and reusing existing CWL command line tool descriptions
capturing outputs objectives:
explain that only files explicitly mentioned in a description will be included in the output of a step/workflow
implement bulk capturing of all files produced by a step/workflow for debugging purposes
use STDIN and STDOUT as input and output
capture output written to a specific directory, the working directory, or the same directory where input is located
By the end of this episode, learners should be able to explain how a workflow document describes the input and output of a workflow and the flow of data between tools and describe all the requirements for running a tool and define the files that will be included as output of a workflow and convert a shell script into a CWL workflow.
Tool Descriptors
Use https://github.com/bcosc/fast_genome_variants/blob/main/README.md to create a CommandLineTool
Exercise 1
Create the baseCommand for running the joint_haplotype caller using the fast_genome_variants README.
Solution
Exercise 2:
When working in a cloud environment, you need to specify what machine type you would like to run on. Which means the job has to have specific parameters describing the RAM, Cores and Disk space (for both temporary and output files) it requires.
Create the
ResourceRequirements
field for running 2 BAMs for thefgv joint_haplotype
command.Solution:
Exercise 3:
- Create the
input
field for runningfgv_joint_haplotype
- Add an optional flag for calling a GVCF output
- Add a string input for intervals
chr2:1-10000
- Add an output name.
Solution:
Exercise 4:
Create the output variable for the CommandLineTool and name it output_vcf.
Solution:
Exercise 5:
TODO
Solution:
Capturing Output
Exercise 1
Using this
CommandLineTool
description, what files would be the output of the tool?cwlVersion: v1.0 class: CommandLineTool baseCommand: [bwa, mem] inputs: reference: type: File inputBinding: position: 1 fastq_reads: type: File[] inputBinding: position: 2 stdout: output.sam outputs: output: type: File outputBinding: glob: output.sam
Solution
Exercise 2
Your colleague tells you to run
fastqc
, which creates several files describing the quality of the data. For now, let’s assume the tool creates three files:
final_report_fastqc.html
final_figures_fastqc.zip
supplemental_figures_fastqc.html
Create a CWL
outputs
field using aFile
array that captures all the fastqc files in a single output variable.Solution
Actually,
fastqc
may create more than 3 of these files, depending on the input parameters you give it, it may create aresults
directory that contains additional files such asresults/fastqc_per_base_content.html
andresults/fastqc_per_base_gc_content.html
.Create a CWL
outputs
field that captures theresults/fastqc_per_base_content.html
andresults/fastqc_per_base_gc_content.html
in separate output variables.Solution
Finally, instead of explicitly defining each file to be captured, create a CWL
outputs
field that captures the entireresults
directory.Solution
Exercise 3
Since
fastqc
can be unpredictable in its outputs and file naming, create a CWL outputs field using aDirectory
that captures all the files in a single output variable.Solution
Exercise 4
Your colleague says that he is running
samtools index
in CWL, but the index is not being outputted. Fix the following CWL to have output the index along with thebam
as asecondaryFile
.cwlVersion: v1.0 class: CommandLineTool requirements: InitialWorkDirRequirement: listing: - $(inputs.bam) baseCommand: [samtools, index] inputs: bam: type: File inputBinding: position: 1 valueFrom: $(self.basename) outputs: output_bam_and_index: type: File outputBinding: glob: $(inputs.bam.basename)
Solution
Exercise 5
What if
InitialWorkDirRequirement
was not used, and the index file was created where the input bam was located? How would you capture the output? Create theoutputs
field using the same CWL in exercise 4.Solution
Key Points
First key point. Brief Answer to questions. (FIXME)