Introduction
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is Common Workflow Language?
How are CWL workflows written?
How do CWL workflows compare to shell workflows?
What are the advantages of using CWL workflows?
Objectives
First learning objective. (FIXME)
FIXME
Key Points
First key point. Brief Answer to questions. (FIXME)
Shell to CWL workflow conversion
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is the difference between a CWL tool description and a CWL workflow?
How can we create a tool descriptor?
How can we use this in a single step workflow?
How can we expand to a multi-step workflow?
Objectives
dataflow objectives:
explain the difference between a CWL tool description and a CWL workflow
describe the relationship between a tool and its corresponding CWL document
exercise good practices when naming inputs and outputs
Be able to make understandable and valid names for inputs and outputs (not ‘input3’)
describing requirements objectives:
identify all the requirements of a tool and define them in the tool description
use
runtime
parameters to access information about the runtime environmentdefine environment variables necessary for execution
use
secondaryFiles
orInitialWorkDirRequirement
to access files in the same directory as another referenced fileuse
$(runtime.cores)
to define the number of cores to be useduse
type: File
, instead of a string, to reference a filepathconverting shell to cwl objectives:
identify tasks, and data links in a script
recognize loops that can be converted into scatters
finding and reusing existing CWL command line tool descriptions
capturing outputs objectives:
explain that only files explicitly mentioned in a description will be included in the output of a step/workflow
implement bulk capturing of all files produced by a step/workflow for debugging purposes
use STDIN and STDOUT as input and output
capture output written to a specific directory, the working directory, or the same directory where input is located
By the end of this episode, learners should be able to explain how a workflow document describes the input and output of a workflow and the flow of data between tools and describe all the requirements for running a tool and define the files that will be included as output of a workflow and convert a shell script into a CWL workflow.
Tool Descriptors
Use https://github.com/bcosc/fast_genome_variants/blob/main/README.md to create a CommandLineTool
Exercise 1
Create the baseCommand for running the joint_haplotype caller using the fast_genome_variants README.
Solution
The base command should use the path to the binary and the type of variants you’re calling.
baseCommand: [fgv, joint_haplotype]
Exercise 2:
When working in a cloud environment, you need to specify what machine type you would like to run on. Which means the job has to have specific parameters describing the RAM, Cores and Disk space (for both temporary and output files) it requires.
Create the
ResourceRequirements
field for running 2 BAMs for thefgv joint_haplotype
command.Solution:
requirements: ResourceRequirement: ramMin: 4000 coresMin: 2
FGV requires 2 GiB of memory for each bam input, and the unit for
ramMin
is in MiB, so we need approximately 4000 MiB to meet the requirement. FGV also requires 1 core for each BAM, so here we ask for at least 2 cores.
Exercise 3:
- Create the
input
field for runningfgv_joint_haplotype
- Add an optional flag for calling a GVCF output
- Add a string input for intervals
chr2:1-10000
- Add an output name.
Solution:
inputs: bam: type: File[] inputBinding: position: 1 prefix: -bam secondaryFiles: - .bai gvcf: type: boolean inputBinding: position: 2 prefix: -gvcf interval: type: string inputBinding: position: 3 output_name: type: string inputBinding: position: 4 prefix: -o
Exercise 4:
Create the output variable for the CommandLineTool and name it output_vcf.
Solution:
outputs: output_vcf: type: File outputBinding: glob: $(inputs.output_name)
Exercise 5:
TODO
Solution:
Capturing Output
Exercise 1
Using this
CommandLineTool
description, what files would be the output of the tool?cwlVersion: v1.0 class: CommandLineTool baseCommand: [bwa, mem] inputs: reference: type: File inputBinding: position: 1 fastq_reads: type: File[] inputBinding: position: 2 stdout: output.sam outputs: output: type: File outputBinding: glob: output.sam
Solution
output.sam
will be the only file outputted. Only files explicitly stated in the outputs field will be included in the output of the step.
Exercise 2
Your colleague tells you to run
fastqc
, which creates several files describing the quality of the data. For now, let’s assume the tool creates three files:
final_report_fastqc.html
final_figures_fastqc.zip
supplemental_figures_fastqc.html
Create a CWL
outputs
field using aFile
array that captures all the fastqc files in a single output variable.Solution
outputs: output: type: File[] outputBinding: glob: "*fastqc*"
Actually,
fastqc
may create more than 3 of these files, depending on the input parameters you give it, it may create aresults
directory that contains additional files such asresults/fastqc_per_base_content.html
andresults/fastqc_per_base_gc_content.html
.Create a CWL
outputs
field that captures theresults/fastqc_per_base_content.html
andresults/fastqc_per_base_gc_content.html
in separate output variables.Solution
outputs: per_base_content: type: File outputBinding: glob: "results/fastqc_per_base_content.html" per_base_gc_content: type: File outputBinding: glob: "results/fastqc_per_base_gc_content.html"
Finally, instead of explicitly defining each file to be captured, create a CWL
outputs
field that captures the entireresults
directory.Solution
outputs: results: type: Directory outputBinding: glob: "results"
Exercise 3
Since
fastqc
can be unpredictable in its outputs and file naming, create a CWL outputs field using aDirectory
that captures all the files in a single output variable.Solution
outputs: output: type: Directory outputBinding: glob: .
Exercise 4
Your colleague says that he is running
samtools index
in CWL, but the index is not being outputted. Fix the following CWL to have output the index along with thebam
as asecondaryFile
.cwlVersion: v1.0 class: CommandLineTool requirements: InitialWorkDirRequirement: listing: - $(inputs.bam) baseCommand: [samtools, index] inputs: bam: type: File inputBinding: position: 1 valueFrom: $(self.basename) outputs: output_bam_and_index: type: File outputBinding: glob: $(inputs.bam.basename)
Solution
cwlVersion: v1.0 class: CommandLineTool requirements: InitialWorkDirRequirement: listing: - $(inputs.bam) baseCommand: [samtools, index] inputs: bam: type: File inputBinding: position: 1 valueFrom: $(self.basename) outputs: output_bam_and_index: type: File secondaryFiles: - .bai outputBinding: glob: $(inputs.bam.basename)
Exercise 5
What if
InitialWorkDirRequirement
was not used, and the index file was created where the input bam was located? How would you capture the output? Create theoutputs
field using the same CWL in exercise 4.Solution
outputs: output_bam_and_index: type: File secondaryFile: - .bai outputBinding: glob: $(inputs.bam)
Key Points
First key point. Brief Answer to questions. (FIXME)
Workflows Design and Development
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Workflows as dependency graphs
How to use sketches for workflow design?
Iterative workflow development
Adding your own script as a step to a workflow
Objectives
graph objectives:
explain that a workflow is a dependency graph
sketch objectives:
use cwlviewer online
generate Graphviz diagram using cwltool
exercise with the printout of a simple workflow; draw arrows on code; hand draw a graph on another sheet of paper
iterate objectives:
recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once
By the end of this episode, learners should be able to explain that a workflow is a dependency graph and sketch their workflow, both by hand, and with an automated visualizer and recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once.
Key Points
First key point. Brief Answer to questions. (FIXME)
Documentation and Citation in Workflows
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How to document your workflow?
How to cite research software in your workflow?
Objectives
Documentation Objectives:
explain the importance of documenting a workflow
use description fields to document purpose, intent, and other factors at multiple levels within their workflow
recognise when it is appropriate to include this documentation
Citation Objectives:
explain the importance of correctly citing research software
give credit for all the tools used in their workflow(s)
By the end of this episode, learners should be able to document their workflows to increase reusability and explain the importance of correctly citing research software.
TODO (CITE): define some specific objectives to capture the skills being taught in this section.
See this page.
Finding an identifier for the tool
(Something about permanent identifiers insert here)
When your workflow is using a pre-existing command line tool, it is good practice to provide citation for the tool, beyond which command line it is executed with.
The SoftwareRequirement
hint can list named packages
that should be installed in order to run the tool.
So for instance if you installed using the package management system with apt install bamtools
the package bamtools
can be
cited in CWL as:
hints:
SoftwareRequirement:
packages:
bamtools: {}
Adding version
Q: bamtools --version
prints out blablabla 2.3.1
- how would you indicate in CWL that this is the version of BAMTools the workflow was tested against?
A:
hints:
SoftwareRequirement:
packages:
bamtools:
version: ["2.3.1"]
Adding Permanent identifiers
To help identify the tool across package management systems we can also add permanent identifiers and URLs, for instance to:
- RRID to SciCrunch
- bio.tools registration
- DOI to a publication
- Homepage
- source repository (e.g. GitHub)
These can be added to the specs
list:
hints:
SoftwareRequirement:
packages:
interproscan:
specs: [ "https://identifiers.org/rrid/RRID:SCR_005829" ]
version: [ "5.21-60" ]
How to find a RRID permanent identifier
RRID provides identifiers for many commonly used resources tools in bioinformatics. For instance, a search for BAMtools finds an entry for BAMtools with identifier RRID:SCR_015987
and additional information.
We can transform the RRID into a Permanent Identifier (PID) for use in CWL using http://identifiers.org/ by appending the RRID to https://identifiers.org/rrid/
- making the PID https://identifiers.org/rrid/RRID:SCR_015987 which we see resolve to the same SciCrunch entry, and add to our specs
list:
hints:
SoftwareRequirement:
packages:
interproscan:
specs: [ "https://identifiers.org/rrid/RRID:SCR_015987" ]
Note that as CWL is based on YAML we use "quotes"
to escape these identifiers include the :
character.
Finding bio.tools identifiers
As an alternative to RRID we can add identifiers from the ELIXIR Tools Registry https://bio.tools/ - for instance https://bio.tools/bamtools
hints:
SoftwareRequirement:
packages:
bamtools:
specs:
- "https://identifiers.org/rrid/RRID:SCR_015987"
- "https://bio.tools/bamtools"
- How to write a DOI as a PID URI https://www.nature.com/articles/nmeth.1923 -> https://doi.org/ + 10.1038/nmeth.1923 -> https://doi.org/10.1038/nmeth.1923
Package manager identifiers
Q: You have used apt install bamtools
in the Linux distribution Debian 10.8 “Buster”. How would you in CWL SoftwareRequirement
identify the Debian package recipe, and with which version
?
A:
hints:
SoftwareRequirement:
packages:
bamtools:
specs:
- "https://identifiers.org/rrid/RRID:SCR_015987"
- "https://bio.tools/bamtools"
- "https://packages.debian.org/buster/bamtools"
version: ["2.5.1", "2.5.1+dfsg-3"]
This package repository has a URI for each installable package, depending on the distribution, we here pick "buster"
. While the upstream GitHub repository of bamtools has release version v2.5.1
, the Debian packaging adds +dfsg-3
to indicate the 3rd repackaging with additional patches, in this case to make the software comply with Debian Free Software Guidelines (dfsg
).
Under version
list in CWL we’ll include 2.5.1
which is the upstream version, ignoring everything after +
or -
according to semantic versioning rules. As an optional extra you can also include the Debian-specific version "2.5.1+dfsg-3"
to indicate which particular packaging we tested the workflow with at the time.
Exercise: There is a “obvious” DOI
Q: You have a workflow using bowtie2, how would you add a citation?
A:
hints:
SoftwareRequirement:
packages:
bowtie2:
specs: [ "https://doi.org/10.1038/nmeth.1923" ]
version: [ "1.x.x" ]
RRID for bowtie2
RRID:SCR_005476 -> https://scicrunch.org/resolver/RRID:SCR_005476 #bowtie not bowtie2 https://identifiers.org/rrid/ + RRID -> https://identifiers.org/rrid/RRID:SCR_005476 PID
https://bio.tools/bowtie2
http://bioconda.github.io/recipes/bowtie2/README.html vs. https://anaconda.org/bioconda/bowtie2
Giving clues to reader
Authorship/citation of a tool vs the CWL file itself (particularly of a workflow)
Add identifiers under requirements? https://www.commonwl.org/user_guide/20-software-requirements/index.html
SciCrunch - looking up RRID for Bowtie2 Then bio.tools
hints:
SoftwareRequirement:
packages:
interproscan:
specs: [ "https://identifiers.org/rrid/RRID:SCR_005829",
"http://somethingelse"]
version: [ "5.21-60" ]
Trickier: Only Github and homepage
s:codeRepository:
hints:
SoftwareRequirement:
packages:
interproscan:
specs: [ "https://github.com/BenLangmead/bowtie2"]
version: [ "fb688f7264daa09dd65fdfcb9d0f008a7817350f" ]
No version, add commit ID or date instead as version
–> (How to make Your own tool citable?)
Getting credit for your CWL files
NOTE: Difference between credit for this CWL file vs credit for the tool it calls.
s:author "Me"
s:dateModified: "2020-10-6"
s:version: "2.4.2"
s:license: https://spdx.org/licenses/GPL-3.0
https://www.commonwl.org/user_guide/17-metadata/index.html
Using s:citation
?
something like..
s:citation: https://dx.doi.org/10.1038/nmeth.1923
s:url: http://example.com/tools/
s:codeRepository: https://github.com/BenLangmead/bowtie2
$namespaces:
s: https://schema.org/
$schemas:
- http://schema.org/version/9.0/schemaorg-current-http.rdf
—> Need new guidance on how to publish workflows, making DOIs in Zenodo, Dockstore etc. https://docs.bioexcel.eu/cwl-best-practice-guide/devpractice/publishing.html https://guides.github.com/activities/citable-code/
How to do it properly to improve findability.
How to publisize CWL tools
CWL workflow descriptions
About how to wire together CommandLineTool steps in a cwl Workflow file.
Key Points
First key point. Brief Answer to questions. (FIXME)
Solving more complex problems with scripts and toolkits
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How to include and run a script in a step at runtime?
Which requirements need to be specified?
How to capture output of a script?
How to find other tools/solutions for awkward problems?
Objectives
script objectives:
Include and run a script in a step at runtime
Capture output of a script
tools objectives:
know good resources for finding solutions to common problems
By the end of this episode, learners should be able to include and run their own script in a step at runtime and be aware of where they can look for CWL recipes and more help for common, but awkward, tasks.
Exercise 1:
Which
Requirement
from the following options is used to create a script at runtime?A. InlineJavascriptRequirement B. InitialWorkDirRequirement C. ResourceRequirement D. DockerRequirement
Solution
B. InitialWorkDirRequirement
Exercise 2:
Using the template below, add the missing instructions so that a script named
script.sh
with the specified contents is created at runtime.InitialWorkDirRequirement: listing: - ------: script.sh ------: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string
Solution:
InitialWorkDirRequirement: listing: - entryname: script.sh entry: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string
Exercise 3:
Since we are using
echo
in the script (as shown below) - what is the appropriatetype
in theoutputs
section of following code block to capture standard output?class: CommandLineTool cwlVersion: v1.1 requirements: DockerRequirement: dockerPull: 'debian:stable' InitialWorkDirRequirement: listing: - entryname: script.sh entry: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string stdout: "message.txt" outputs: message: type: ----
Your options are: A. File B. Directory C. stdout D. string
Solution:
C. stdout
Exercise 4:
Fix the
baseCommand
in following code snippet to execute the script we have created in previous exercises.baseCommand: []
Solution:
baseCommand: [ sh, script.sh ]
Exercise 5:
CHALLENGE question. Extend the
outputs
section of the following CWLtool definition to capture the script we have created along with tools’ standard output.This will help you inspect the generated script and is useful in more complex situations to troubleshoot related issues.
class: CommandLineTool cwlVersion: v1.1 requirements: DockerRequirement: dockerPull: 'debian:stable' InitialWorkDirRequirement: listing: - entryname: script.sh entry: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string stdout: "message.txt" baseCommand: ["sh", "script.sh"] outputs: message: type: stdout
Solution:
class: CommandLineTool cwlVersion: v1.1 requirements: DockerRequirement: dockerPull: 'debian:stable' InitialWorkDirRequirement: listing: - entryname: script.sh entry: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string stdout: "message.txt" baseCommand: ["sh", "script.sh"] outputs: message: type: stdout script: type: File outputBinding: glob: "script.sh"
Key Points
First key point. Brief Answer to questions. (FIXME)
Scatter/Gather in a CWL Workflow
Overview
Teaching: 5 min
Exercises: 0 minQuestions
How can I parallelize running of tools in a workflow?
Objectives
Learn how to scatter steps of a CWL workflow.
This workflow uses a real life example of open-source CWL CommandLineTools from Dockstore.
CWL Workflows can scatter array inputs to run in parallel within the same step and gather all the outputs together to be used in subsequent steps. It’s similar to GNU parallel in that it will execute all jobs in parallel using a single script and multiple input arguments.
Use the ScatterFeatureRequirement in the requirements section of a workflow to begin enabling scattering in a single step.
requirements:
ScatterFeatureRequirement: {}
Exercise 1
In this workflow, we have a job that uses GATK Haplotype Caller. Currently, the job is configured to run on a single chromosome. How would you parallelize
GATK_HaplotypeCaller
to run all chromosomes simultaneously?cwlVersion: v1.0 class: Workflow requirements: ScatterFeatureRequirement: {} inputs: bam: File chromosome: string outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam intervals: chromosome out: [vcf]
Solution
Given that we need to scatter on an array of chromosomes, we changed the chromosome input variable to become a string array. Following that, in the steps section, the scatter function on the intervals is added so GATK
HaplotypeCaller
will run on each chromosome. Finally, in the outputs section, the final output will be an array of VCFs.cwlVersion: v1.0 class: Workflow requirements: ScatterFeatureRequirement: {} inputs: bam: File chromosomes: string[] outputs: HaplotypeCaller_VCFs: type: File[] outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl scatter: [intervals] in: input_bam: bam intervals: chromosomes out: [vcf]
GATK will take the array of chromosome strings, and then using it’s intervals input option, create multiple VCFs, one for each chromosome. All these jobs will be run in parallel, and run on as maybe compute nodes available.
TODO: Add diagram of how the job will run
Key Points
Use
Debugging workflows
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
interpret commonly encountered error messages
solve these common issues
By the end of this episode, learners should be able to recognize and fix simple bugs in their workflow code.
(non-exhaustive) list of possible examples:
- YAML errors
- “wiring errors” e.g. where is the output from my step?
- type mismatch
- array vs single-item mismatch
- no formats on input but format is required by workflow
Key Points
First key point. Brief Answer to questions. (FIXME)