The tidycwl package takes the raw Common Workflow Language (CWL) workflows encoded in JSON or YAML, and turns the workflow elements into tidy data frames or structured lists. This package follows the tidyverse design principles and can be seamlessly used together with the other packages with similar designs.
Let’s use a real-world example to see how we can read, parse, and visualize a bioinformatics workflow with tidycwl.
To read a CWL workflow into R, use read_cwl_json()
,
read_cwl_yaml()
, or read_cwl(format = ...)
depending on the workflow storage format.
flow <- system.file("cwl/sbg/workflow/gatk4-wgs.json", package = "tidycwl") %>%
read_cwl_json()
flow
Name: Whole Genome Sequencing - BWA + GATK 4.0 (with Metrics)
Class: Workflow
CWL Version: sbg:draft-2
We see the name, class (workflow or command line tool), and the version of the CWL. Currently, tidycwl supports both sbg:draft2 and v1.0 workflows. As the standard evolves, we plan to add the support for higher versions as needed.
After reading the workflow into R, let’s parse the main components from the CWL.
Besides the type (parse_type()
) and metadata
(parse_meta()
), we are more than interested in the core
components of a workflow, namely, the inputs, outputs, and the
intermediate steps.
flow %>%
parse_inputs() %>%
names()
[1] "sbg:fileTypes" "label" "id"
[4] "sbg:includeInPorts" "type" "description"
[7] "sbg:category" "sbg:toolDefaultValue"
flow %>%
parse_outputs() %>%
names()
[1] "source" "label" "required"
[4] "id" "sbg:includeInPorts" "type"
[7] "sbg:fileTypes"
flow %>%
parse_steps() %>%
names()
[1] "sbg:x" "inputs" "outputs" "run" "id" "sbg:y" "scatter"
Depending on whether these components are represented as YAML/JSON dictionaries or lists in the workflow, the parsed results could be data frames or lists. This is because we want to keep the transformations for the original data minimal, at least at this stage. Plus, these results are not too useful compared to the following granular parsers.
We can use the get_*_*()
functions to get the critical
parameters, such as the ID, label, or documentation from the parsed
inputs, outputs, and steps. For example, use
get_steps_label()
to get the labels of the steps in the
workflow:
flow %>%
parse_steps() %>%
get_steps_label()
[1] "SBG Genome Coverage" "SBG Untar fasta"
[3] "Sambamba Merge" "SBG FASTQ Quality Adjuster"
[5] "Tabix Index" "GATK CollectAlignmentSummaryMetrics"
[7] "SBG Prepare Intervals" "FastQC"
[9] "BWA INDEX 0.7.17" "SBG Pair FASTQs by Metadata"
[11] "SBG FASTA Indices" "Tabix BGZIP"
[13] "BWA MEM Bundle 0.7.17" "GATK HaplotypeCaller"
[15] "GATK IndexFeatureFile" "GATK IndexFeatureFile"
[17] "GATK IndexFeatureFile" "GATK MergeVcfs"
[19] "GATK GenotypeGVCFs" "GATK MergeVcfs"
[21] "GATK ApplyBQSR" "GATK BaseRecalibrator"
In many cases, it is useful to construct a graph with the parsed
inputs, outputs, and steps from the workflow. The functions
get_nodes()
and get_edges()
can help us tidy
the graph nodes and edges into data frames. Each row represents a node
or an edge, with each variable representing an attribute of the node or
edge.
The function get_graph()
is a wrapper which returns
everything in a list:
get_graph(
flow %>% parse_inputs(),
flow %>% parse_outputs(),
flow %>% parse_steps()
) %>% str()
List of 2
$ nodes:'data.frame': 36 obs. of 3 variables:
..$ id : chr [1:36] "intervals_file" "dbsnp" "mills" "fastq" ...
..$ label: chr [1:36] "Target BED" "dbsnp" "Mills" "Fastq" ...
..$ group: chr [1:36] "input" "input" "input" "input" ...
$ edges:'data.frame': 43 obs. of 5 variables:
..$ from : chr [1:43] "SBG_FASTA_Indices" "Sambamba_Merge" "reference" "BWA_MEM_Bundle_0_7_17" ...
..$ to : chr [1:43] "SBG_Genome_Coverage" "SBG_Genome_Coverage" "SBG_Untar_fasta" "Sambamba_Merge" ...
..$ port_from: chr [1:43] "fasta_reference" "merged_bam" NA "aligned_reads" ...
..$ port_to : chr [1:43] "fasta" "bam" "input_tar_with_reference" "bams" ...
..$ type : chr [1:43] "step_to_step" "step_to_step" "input_to_step" "step_to_step" ...
With tidycwl, we can visualize the workflow graph by calling
visualize_graph()
, which is built on the
visNetwork
package with an automatic hierarchical
layout:
if (rmarkdown::pandoc_available("1.12.3")) {
get_graph(
flow %>% parse_inputs(),
flow %>% parse_outputs(),
flow %>% parse_steps()
) %>% visualize_graph()
}
Users can interact with the visualization by zooming in/out and
dragging the view or nodes. The graphical details can be further
fine-tuned by feeding additional parameters to
visualize_graph()
.
The visualizations can be exported as HTML or static images
(PNG/JPEG/PDF) with export_html()
and
export_image()
.
The workflow visualizations can be directly embedded in Shiny apps by
using the Shiny widget in tidycwl. Check out the documentation for
render_cwl()
and cwl_output()
for an example
app.