This lesson is still being designed and assembled (Pre-Alpha version)

Introduction to Workflows with Common Workflow Language

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is Common Workflow Language?

  • How are CWL workflows written?

  • How do CWL workflows compare to shell workflows?

  • What are the advantages of using CWL workflows?

Objectives
  • First learning objective. (FIXME)

FIXME

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Shell to CWL workflow conversion

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is the difference between a CWL tool description and a CWL workflow?

  • How can we create a tool descriptor?

  • How can we use this in a single step workflow?

  • How can we expand to a multi-step workflow?

Objectives
  • dataflow objectives:

  • explain the difference between a CWL tool description and a CWL workflow

  • describe the relationship between a tool and its corresponding CWL document

  • exercise good practices when naming inputs and outputs

  • Be able to make understandable and valid names for inputs and outputs (not ‘input3’)

  • describing requirements objectives:

  • identify all the requirements of a tool and define them in the tool description

  • use runtime parameters to access information about the runtime environment

  • define environment variables necessary for execution

  • use secondaryFiles or InitialWorkDirRequirement to access files in the same directory as another referenced file

  • use $(runtime.cores) to define the number of cores to be used

  • use type: File, instead of a string, to reference a filepath

  • converting shell to cwl objectives:

  • identify tasks, and data links in a script

  • recognize loops that can be converted into scatters

  • finding and reusing existing CWL command line tool descriptions

  • capturing outputs objectives:

  • explain that only files explicitly mentioned in a description will be included in the output of a step/workflow

  • implement bulk capturing of all files produced by a step/workflow for debugging purposes

  • use STDIN and STDOUT as input and output

  • capture output written to a specific directory, the working directory, or the same directory where input is located

By the end of this episode, learners should be able to explain how a workflow document describes the input and output of a workflow and the flow of data between tools and describe all the requirements for running a tool and define the files that will be included as output of a workflow and convert a shell script into a CWL workflow.

Tool Descriptors

Use https://github.com/bcosc/fast_genome_variants/blob/main/README.md to create a CommandLineTool

Exercise 1

Create the baseCommand for running the joint_haplotype caller using the fast_genome_variants README.

Solution

The base command should use the path to the binary and the type of variants you’re calling.

baseCommand: [fgv, joint_haplotype]

Exercise 2:

When working in a cloud environment, you need to specify what machine type you would like to run on. Which means the job has to have specific parameters describing the RAM, Cores and Disk space (for both temporary and output files) it requires.

Create the ResourceRequirements field for running 2 BAMs for the fgv joint_haplotype command.

Solution:

requirements:
  ResourceRequirement:
    ramMin: 4000
    coresMin: 2

FGV requires 2 GiB of memory for each bam input, and the unit for ramMin is in MiB, so we need approximately 4000 MiB to meet the requirement. FGV also requires 1 core for each BAM, so here we ask for at least 2 cores.

Exercise 3:

  1. Create the input field for running fgv_joint_haplotype
  2. Add an optional flag for calling a GVCF output
  3. Add a string input for intervals chr2:1-10000
  4. Add an output name.

Solution:

inputs:
  bam:
    type: File[]
    inputBinding:
      position: 1
      prefix: -bam
    secondaryFiles:
      - .bai
  gvcf:
    type: boolean
    inputBinding:
      position: 2
      prefix: -gvcf
  interval:
    type: string
    inputBinding:
      position: 3
  output_name:
    type: string
    inputBinding:
      position: 4
      prefix: -o

Exercise 4:

Create the output variable for the CommandLineTool and name it output_vcf.

Solution:

outputs:
  output_vcf:
    type: File
    outputBinding:
      glob: $(inputs.output_name)

Exercise 5:

TODO

Solution:

Capturing Output

Exercise 1

Using this CommandLineTool description, what files would be the output of the tool?

cwlVersion: v1.0
class: CommandLineTool

baseCommand: [bwa, mem]

inputs:
  reference:
    type: File
    inputBinding:
      position: 1
  fastq_reads:
    type: File[]
    inputBinding:
      position: 2

stdout: output.sam

outputs:
  output:
    type: File
    outputBinding:
      glob: output.sam

Solution

output.sam will be the only file outputted. Only files explicitly stated in the outputs field will be included in the output of the step.

Exercise 2

Your colleague tells you to run fastqc, which creates several files describing the quality of the data. For now, let’s assume the tool creates three files:

  • final_report_fastqc.html
  • final_figures_fastqc.zip
  • supplemental_figures_fastqc.html

Create a CWL outputs field using a File array that captures all the fastqc files in a single output variable.

Solution

outputs:
  output:
    type: File[]
    outputBinding:
      glob: "*fastqc*"

Actually, fastqc may create more than 3 of these files, depending on the input parameters you give it, it may create a results directory that contains additional files such as results/fastqc_per_base_content.html and results/fastqc_per_base_gc_content.html.

Create a CWL outputs field that captures the results/fastqc_per_base_content.html and results/fastqc_per_base_gc_content.html in separate output variables.

Solution

outputs:
  per_base_content:
    type: File
    outputBinding:
      glob: "results/fastqc_per_base_content.html"
  per_base_gc_content:
    type: File
    outputBinding:
      glob: "results/fastqc_per_base_gc_content.html"

Finally, instead of explicitly defining each file to be captured, create a CWL outputs field that captures the entire results directory.

Solution

outputs:
  results:
    type: Directory
    outputBinding:
      glob: "results"

Exercise 3

Since fastqc can be unpredictable in its outputs and file naming, create a CWL outputs field using a Directory that captures all the files in a single output variable.

Solution

outputs:
  output:
    type: Directory
    outputBinding:
      glob: .

Exercise 4

Your colleague says that he is running samtools index in CWL, but the index is not being outputted. Fix the following CWL to have output the index along with the bam as a secondaryFile.

cwlVersion: v1.0
class: CommandLineTool

requirements:
  InitialWorkDirRequirement:
    listing:
      - $(inputs.bam)

baseCommand: [samtools, index]

inputs:
  bam:
    type: File
    inputBinding:
      position: 1
      valueFrom: $(self.basename)

outputs:
  output_bam_and_index:
    type: File
    outputBinding:
      glob: $(inputs.bam.basename)

Solution

cwlVersion: v1.0
class: CommandLineTool

requirements:
  InitialWorkDirRequirement:
    listing:
      - $(inputs.bam)

baseCommand: [samtools, index]

inputs:
  bam:
    type: File
    inputBinding:
      position: 1
      valueFrom: $(self.basename)

outputs:
  output_bam_and_index:
    type: File
    secondaryFiles:
      - .bai
    outputBinding:
      glob: $(inputs.bam.basename)

Exercise 5

What if InitialWorkDirRequirement was not used, and the index file was created where the input bam was located? How would you capture the output? Create the outputs field using the same CWL in exercise 4.

Solution

outputs:
  output_bam_and_index:
    type: File
    secondaryFile:
      - .bai
    outputBinding:
      glob: $(inputs.bam)

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Workflows Design and Development

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Workflows as dependency graphs

  • How to use sketches for workflow design?

  • Iterative workflow development

  • Adding your own script as a step to a workflow

Objectives
  • graph objectives:

  • explain that a workflow is a dependency graph

  • sketch objectives:

  • use cwlviewer online

  • generate Graphviz diagram using cwltool

  • exercise with the printout of a simple workflow; draw arrows on code; hand draw a graph on another sheet of paper

  • iterate objectives:

  • recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once

By the end of this episode, learners should be able to explain that a workflow is a dependency graph and sketch their workflow, both by hand, and with an automated visualizer and recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Documentation and Citation in Workflows

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How to document your workflow?

  • How to cite research software in your workflow?

Objectives
  • Documentation Objectives:

  • explain the importance of documenting a workflow

  • use description fields to document purpose, intent, and other factors at multiple levels within their workflow

  • recognise when it is appropriate to include this documentation

  • Citation Objectives:

  • explain the importance of correctly citing research software

  • give credit for all the tools used in their workflow(s)

By the end of this episode, learners should be able to document their workflows to increase reusability and explain the importance of correctly citing research software.

TODO (CITE): define some specific objectives to capture the skills being taught in this section.

See this page.

Finding an identifier for the tool

(Something about permanent identifiers insert here)

When your workflow is using a pre-existing command line tool, it is good practice to provide citation for the tool, beyond which command line it is executed with.

The SoftwareRequirement hint can list named packages that should be installed in order to run the tool. So for instance if you installed using the package management system with apt install bamtools the package bamtools can be cited in CWL as:

hints:
  SoftwareRequirement:
    packages:
      bamtools: {}

Adding version

Q: bamtools --version prints out blablabla 2.3.1 - how would you indicate in CWL that this is the version of BAMTools the workflow was tested against?

A:

hints:
  SoftwareRequirement:
    packages:
      bamtools:
          version: ["2.3.1"]

Adding Permanent identifiers

To help identify the tool across package management systems we can also add permanent identifiers and URLs, for instance to:

These can be added to the specs list:

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_005829" ]
        version: [ "5.21-60" ]

How to find a RRID permanent identifier

RRID provides identifiers for many commonly used resources tools in bioinformatics. For instance, a search for BAMtools finds an entry for BAMtools with identifier RRID:SCR_015987 and additional information.

We can transform the RRID into a Permanent Identifier (PID) for use in CWL using http://identifiers.org/ by appending the RRID to https://identifiers.org/rrid/ - making the PID https://identifiers.org/rrid/RRID:SCR_015987 which we see resolve to the same SciCrunch entry, and add to our specs list:

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_015987" ]

Note that as CWL is based on YAML we use "quotes" to escape these identifiers include the : character.

Finding bio.tools identifiers

As an alternative to RRID we can add identifiers from the ELIXIR Tools Registry https://bio.tools/ - for instance https://bio.tools/bamtools

hints:
  SoftwareRequirement:
    packages:
      bamtools:
        specs:
          - "https://identifiers.org/rrid/RRID:SCR_015987"
          - "https://bio.tools/bamtools"

Package manager identifiers

Q: You have used apt install bamtools in the Linux distribution Debian 10.8 “Buster”. How would you in CWL SoftwareRequirement identify the Debian package recipe, and with which version?

A:

hints:
  SoftwareRequirement:
    packages:
      bamtools:
        specs:
          - "https://identifiers.org/rrid/RRID:SCR_015987"
          - "https://bio.tools/bamtools"
          - "https://packages.debian.org/buster/bamtools"
        version: ["2.5.1", "2.5.1+dfsg-3"]

This package repository has a URI for each installable package, depending on the distribution, we here pick "buster". While the upstream GitHub repository of bamtools has release version v2.5.1, the Debian packaging adds +dfsg-3 to indicate the 3rd repackaging with additional patches, in this case to make the software comply with Debian Free Software Guidelines (dfsg).

Under version list in CWL we’ll include 2.5.1 which is the upstream version, ignoring everything after + or - according to semantic versioning rules. As an optional extra you can also include the Debian-specific version "2.5.1+dfsg-3" to indicate which particular packaging we tested the workflow with at the time.

Exercise: There is a “obvious” DOI

Q: You have a workflow using bowtie2, how would you add a citation?

A:

hints:
  SoftwareRequirement:
    packages:
      bowtie2:
        specs: [ "https://doi.org/10.1038/nmeth.1923" ]
        version: [ "1.x.x" ]

RRID for bowtie2

RRID:SCR_005476 -> https://scicrunch.org/resolver/RRID:SCR_005476 #bowtie not bowtie2 https://identifiers.org/rrid/ + RRID -> https://identifiers.org/rrid/RRID:SCR_005476 PID

https://bio.tools/bowtie2

http://bioconda.github.io/recipes/bowtie2/README.html vs. https://anaconda.org/bioconda/bowtie2

Giving clues to reader

Authorship/citation of a tool vs the CWL file itself (particularly of a workflow)

Add identifiers under requirements? https://www.commonwl.org/user_guide/20-software-requirements/index.html

SciCrunch - looking up RRID for Bowtie2 Then bio.tools

hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_005829",
                 "http://somethingelse"]
        version: [ "5.21-60" ]

Trickier: Only Github and homepage

s:codeRepository:
hints:
  SoftwareRequirement:
    packages:
      interproscan:
        specs: [ "https://github.com/BenLangmead/bowtie2"]
        version: [ "fb688f7264daa09dd65fdfcb9d0f008a7817350f" ]

No version, add commit ID or date instead as version

–> (How to make Your own tool citable?)

Getting credit for your CWL files

NOTE: Difference between credit for this CWL file vs credit for the tool it calls.

s:author "Me"
s:dateModified: "2020-10-6"
s:version: "2.4.2"
s:license: https://spdx.org/licenses/GPL-3.0

https://www.commonwl.org/user_guide/17-metadata/index.html

Using s:citation?

something like..

s:citation: https://dx.doi.org/10.1038/nmeth.1923

s:url: http://example.com/tools/

s:codeRepository: https://github.com/BenLangmead/bowtie2
$namespaces:
  s: https://schema.org/

$schemas:
 - http://schema.org/version/9.0/schemaorg-current-http.rdf

—> Need new guidance on how to publish workflows, making DOIs in Zenodo, Dockstore etc. https://docs.bioexcel.eu/cwl-best-practice-guide/devpractice/publishing.html https://guides.github.com/activities/citable-code/

How to do it properly to improve findability.

How to publisize CWL tools

CWL workflow descriptions

About how to wire together CommandLineTool steps in a cwl Workflow file.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Solving more complex problems with scripts and toolkits

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How to include and run a script in a step at runtime?

  • Which requirements need to be specified?

  • How to capture output of a script?

  • How to find other tools/solutions for awkward problems?

Objectives
  • script objectives:

  • Include and run a script in a step at runtime

  • Capture output of a script

  • tools objectives:

  • know good resources for finding solutions to common problems

By the end of this episode, learners should be able to include and run their own script in a step at runtime and be aware of where they can look for CWL recipes and more help for common, but awkward, tasks.

Exercise 1:

Which Requirement from the following options is used to create a script at runtime?

A. InlineJavascriptRequirement
B. InitialWorkDirRequirement
C. ResourceRequirement
D. DockerRequirement

Solution

B. InitialWorkDirRequirement

Exercise 2:

Using the template below, add the missing instructions so that a script named script.sh with the specified contents is created at runtime.

InitialWorkDirRequirement:
  listing:
    - ------: script.sh
      ------: |
        echo "*Documenting input*" && \
        echo "Input received: $(inputs.message)" && \
        echo "Exit"

inputs:
  message:
    type: string

Solution:

InitialWorkDirRequirement:
  listing:
    - entryname: script.sh
      entry: |
        echo "*Documenting input*" && \
        echo "Input received: $(inputs.message)" && \
        echo "Exit"

inputs:
  message:
    type: string

Exercise 3:

Since we are using echo in the script (as shown below) - what is the appropriate type in the outputs section of following code block to capture standard output?

class: CommandLineTool
cwlVersion: v1.1
requirements:
  DockerRequirement:
    dockerPull: 'debian:stable'
  InitialWorkDirRequirement:
    listing:
      - entryname: script.sh
        entry: |
          echo "*Documenting input*" && \
          echo "Input received: $(inputs.message)" && \
          echo "Exit"

inputs:
  message:
    type: string

stdout: "message.txt"

outputs:
 message:
   type: ----

Your options are: A. File B. Directory C. stdout D. string

Solution:

C. stdout

Exercise 4:

Fix the baseCommand in following code snippet to execute the script we have created in previous exercises.

baseCommand: []

Solution:

baseCommand: [ sh, script.sh ]

Exercise 5:

CHALLENGE question. Extend the outputs section of the following CWLtool definition to capture the script we have created along with tools’ standard output.

This will help you inspect the generated script and is useful in more complex situations to troubleshoot related issues.

class: CommandLineTool
cwlVersion: v1.1
requirements:
  DockerRequirement:
    dockerPull: 'debian:stable'
  InitialWorkDirRequirement:
    listing:
      - entryname: script.sh
        entry: |
          echo "*Documenting input*" && \
          echo "Input received: $(inputs.message)" && \
          echo "Exit"

inputs:
  message:
    type: string

stdout: "message.txt"
baseCommand: ["sh", "script.sh"]

outputs:
  message:
    type: stdout

Solution:

class: CommandLineTool
cwlVersion: v1.1
requirements:
  DockerRequirement:
    dockerPull: 'debian:stable'
  InitialWorkDirRequirement:
    listing:
      - entryname: script.sh
        entry: |
          echo "*Documenting input*" && \
          echo "Input received: $(inputs.message)" && \
          echo "Exit"

inputs:
  message:
    type: string

stdout: "message.txt"
baseCommand: ["sh", "script.sh"]

outputs:
  message:
    type: stdout
  script:
    type: File
    outputBinding:
      glob: "script.sh"

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Scatter/Gather in a CWL Workflow

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • How can I parallelize running of tools in a workflow?

Objectives
  • Learn how to scatter steps of a CWL workflow.

This workflow uses a real life example of open-source CWL CommandLineTools from Dockstore.

CWL Workflows can scatter array inputs to run in parallel within the same step and gather all the outputs together to be used in subsequent steps. It’s similar to GNU parallel in that it will execute all jobs in parallel using a single script and multiple input arguments.

Use the ScatterFeatureRequirement in the requirements section of a workflow to begin enabling scattering in a single step.

requirements:
   ScatterFeatureRequirement: {}

Exercise 1

In this workflow, we have a job that uses GATK Haplotype Caller. Currently, the job is configured to run on a single chromosome. How would you parallelize GATK_HaplotypeCaller to run all chromosomes simultaneously?

cwlVersion: v1.0
class: Workflow
requirements:
   ScatterFeatureRequirement: {}
inputs:
   bam: File
   chromosome: string
outputs:
  HaplotypeCaller_VCF:
    type: File
    outputSource: GATK_HaplotypeCaller/vcf
steps:
  GATK_HaplotypeCaller:
    run: GATK_HaplotypeCaller.cwl
    in:
      input_bam: bam
      intervals: chromosome
    out: [vcf]

Solution

Given that we need to scatter on an array of chromosomes, we changed the chromosome input variable to become a string array. Following that, in the steps section, the scatter function on the intervals is added so GATK HaplotypeCaller will run on each chromosome. Finally, in the outputs section, the final output will be an array of VCFs.

cwlVersion: v1.0
class: Workflow
requirements:
   ScatterFeatureRequirement: {}
inputs:
   bam: File
   chromosomes: string[]
outputs:
  HaplotypeCaller_VCFs:
    type: File[]
    outputSource: GATK_HaplotypeCaller/vcf
steps:
  GATK_HaplotypeCaller:
    run: GATK_HaplotypeCaller.cwl
    scatter: [intervals]
    in:
      input_bam: bam
      intervals: chromosomes
    out: [vcf]

GATK will take the array of chromosome strings, and then using it’s intervals input option, create multiple VCFs, one for each chromosome. All these jobs will be run in parallel, and run on as maybe compute nodes available.

TODO: Add diagram of how the job will run

Key Points

  • Use


Debugging workflows

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • interpret commonly encountered error messages

  • solve these common issues

By the end of this episode, learners should be able to recognize and fix simple bugs in their workflow code.

(non-exhaustive) list of possible examples:

Key Points

  • First key point. Brief Answer to questions. (FIXME)