Get documentation/description for steps
get_steps_doc(steps)
Steps object parsed by parse_steps
Vector of step documentation/descriptions
# steps represented by a dictionary
system.file("cwl/sbg/workflow/rnaseq-salmon.json", package = "tidycwl") %>%
read_cwl_json() %>%
parse_steps() %>%
get_steps_doc()
#> [1] "This tool takes multiple abundance estimates files outputted by tools like RSEM, Kallisto or Salmon and creates a single expression counts matrix, based on the input column that the user specifies (the default is 'tpm', but any other string can be input here, like 'fpkm', 'counts' or similar), that can be used for further downstream analysis.\n\nThis tool can also be used to aggregate any kind of results in tab-delimited format and create a matrix like file, it was just originally developed for creating expression matrices. \n\n### Common Issues ###\nNone"
#> [2] "This tool takes multiple abundance estimates files outputted by tools like RSEM, Kallisto or Salmon and creates a single expression counts matrix, based on the input column that the user specifies (the default is 'tpm', but any other string can be input here, like 'fpkm', 'counts' or similar), that can be used for further downstream analysis.\n\nThis tool can also be used to aggregate any kind of results in tab-delimited format and create a matrix like file, it was just originally developed for creating expression matrices. \n\n### Common Issues ###\nNone"
#> [3] "Tool accepts list of FASTQ files groups them into separate lists. This grouping is done using metadata values and their hierarchy (Sample ID > Library ID > Platform unit ID > File segment number) which should create unique combinations for each pair of FASTQ files. Important metadata fields are Sample ID, Library ID, Platform unit ID and File segment number. Not all of these four metadata fields are required, but the present set has to be sufficient to create unique combinations for each pair of FASTQ files. Files with no paired end metadata are grouped in the same way as the ones with paired end metadata, generally they should be alone in a separate list. Files with no metadata set will be grouped together. \n\nIf there are more than two files in a group, this might create errors further down most pipelines and the user should check if the metadata fields for those files are set properly."
#> [4] "**Salmon Index** tool builds an index from a transcriptome FASTA formatted file of target sequences, necessary for the **Salmon Quant** tool. \n\n**Quasi-mapping** is a process of assigning reads to transcripts, without doing the exact base-to-base alignment. Seeing that for estimating transcript abundances, the main information needed is which transcript a read originates from and not the actual mapping coordinates, the idea with the **Salmon** tool was to implement a procedure that does exactly that [1, 2]. \n\nThe result is a software running at speeds orders of magnitude faster than other tools which utilize the full likelihood model, while keeping near-optimal probabilistic RNA-seq quantification results [1, 2]. \n\n*A list of **all inputs and parameters** with corresponding descriptions can be found at the bottom of the page.*\n\n### Common Use Cases\n\n- A **Transcriptome FASTA file** needs to be provided as an input to the tool. \n\n### Changes Introduced by Seven Bridges\n\n- An already generated **Salmon index archive** can be provided to the **Salmon Index** tool (**Transcriptome FASTA or Salmon Index Archive** input), in order to skip indexing and save a little bit of time if this tool is part of a bigger workflow and there already is an index file that can be provided.\n\n### Common Issues and Important Notes\n\n- The input FASTA file (if provided instead of the already generated salmon index) should be a transcriptome FASTA, not a genomic FASTA.\n\n### Performance Benchmarking\n\nThe **Salmon Index** tool builds the index structure for **Salmon** in a very short time, therefore it is expected that all tasks using this tool should finish in under 5 minutes, costing around $0.05 on the default c4.2xlarge instance (AWS). \n\n*Cost can be significantly reduced by using **spot instances**. Visit the [Knowledge Center](https://docs.sevenbridges.com/docs/about-spot-instances) for more details.*\n\n\n### References\n\n[1] [Salmon paper](biorxiv.org/content/biorxiv/early/2016/08/30/021592.full.pdf) \n[2] [Rapmap paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908361/)"
#> [5] "**Salmon Quant - Reads** infers transcript abundance estimates from **RNA-seq data**, using a process called **quasi-mapping**. \n\n**Quasi-mapping** is a process of assigning reads to transcripts, without doing the exact base-to-base alignment. Seeing that for estimating transcript abundances, the main information needed is which transcript a read originates from and not the actual mapping coordinates, the idea with the **Salmon** tool was to implement a procedure that does exactly that [1, 2]. \n\nThe result is a software running at speeds orders of magnitude faster than other tools which utilize the full likehood model, while keeping near-optimal probabilistic RNA-seq quantification results [1, 2]. \n\nThe latest version of Salmon (0.9.x) introduces some novel concepts, like **Rich Factorization Classes**, which further increase the precision of the results, at a very negligible increase in runtime. This version of Salmon also supports quantification from already aligned BAM files, utilizing the full likelihood model (the same one as in RSEM), where the results are the same as RSEM, but the execution time is much shorter than in RSEM, this time due to engineering only [3].\n\n*A list of **all inputs and parameters** with corresponding descriptions can be found at the bottom of the page.*\n\n### Common Use Cases\n\n- The main input to the tool are **FASTQ read files** (single end or paired end). \n- A **Salmon index archive** (`-i`) also needs to be provided, in addition to an optional **Gene map** (`--geneMap`) file (which should be of the same annotations that were used in generating the **Transcriptome FASTA file**) if gene-level abundance results are desired. \n- The tool will generate transcript abundance estimates in plaintext format, and an optional file containing gene abundance estimates, if the input **Gene map** (`--gene-map`) file is provided. \n- In addition to the default output (**Quantification file**), additional outputs can be produced if the proper options are turned on for them (e.g. **Equivalent class counts** by setting `--dumpEq`, **Unmapped reads** by setting `--writeUnmappedNames`, **Bootstrap data** by setting `--numBootstraps` or `--numGibbsSamples`, **Mapping info** by setting `--write-mappings`...).\n- The **GC bias correction** option (`--gcBias`) will correct for GC bias and improve quantification accuracy, but at the cost of increased runtime (a rough estimate would be a **double** increase in runtime per sample). \n- The use of *data-driven likelihood factorization* is achieved with the **Range factorization bins** parameter (`--rangeFactorizationBins`) and can be used to bring an increase in accuracy at a very small increase in runtime [3]. \n\n### Changes Introduced by Seven Bridges\n\n- All output files will be prefixed by the input sample ID (inferred from **Sample ID** metadata if existent, or from filename otherwise), instead of having identical names between runs. \n\n### Common Issues and Important Notes\n\n- For paired-end read files, it is important to properly set the **Paired End** metadata field on your read files.\n- For FASTQ reads in multi-file format (i.e. two FASTQ files for paired-end 1 and two FASTQ files for paired-end2), the proper metadata needs to be set (the following hierarchy is valid: **Sample ID/Library ID/Platform Unit ID/File Segment Number)**.\n- The GTF and FASTA files need to have compatible transcript IDs. \n\n### Performance Benchmarking\n\nThe main advantage of the Salmon software is that it is not computationally challenging, as alignment in the traditional sense is not performed. \nBelow is a table describing the runtimes and task costs for a couple of samples with different file sizes:\n\n| Experiment type | Input size | Paired-end | # of reads | Read length | Duration | Cost | Instance (AWS) |\n|:---------------:|:-----------:|:----------:|:----------:|:-----------:|:--------:|:-----:|:----------:|\n| RNA-Seq | 2 x 4.5 GB | Yes | 20M | 101 | 5min | $0.05| c4.2xlarge |\n| RNA-Seq | 2 x 17.4 GB | Yes | 76M | 101 | 15min | $0.15 | c4.2xlarge |\n\n*Cost can be significantly reduced by using **spot instances**. Visit the [Knowledge Center](https://docs.sevenbridges.com/docs/about-spot-instances) for more details.*\n\n### References\n\n[1] [Salmon paper](biorxiv.org/content/biorxiv/early/2016/08/30/021592.full.pdf) \n[2] [Rapmap paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908361/) \n[3] [Data-driven likelihood factorization](https://academic.oup.com/bioinformatics/article/33/14/i142/3953977)"
# steps represented by a list
system.file("cwl/sbg/workflow/rnaseq-salmon.cwl", package = "tidycwl") %>%
read_cwl_yaml() %>%
parse_steps() %>%
get_steps_doc()
#> [1] "This tool takes multiple abundance estimates files outputted by tools like RSEM, Kallisto or Salmon and creates a single expression counts matrix, based on the input column that the user specifies (the default is 'tpm', but any other string can be input here, like 'fpkm', 'counts' or similar), that can be used for further downstream analysis.\n\nThis tool can also be used to aggregate any kind of results in tab-delimited format and create a matrix like file, it was just originally developed for creating expression matrices. \n\n### Common Issues ###\nNone"
#> [2] "This tool takes multiple abundance estimates files outputted by tools like RSEM, Kallisto or Salmon and creates a single expression counts matrix, based on the input column that the user specifies (the default is 'tpm', but any other string can be input here, like 'fpkm', 'counts' or similar), that can be used for further downstream analysis.\n\nThis tool can also be used to aggregate any kind of results in tab-delimited format and create a matrix like file, it was just originally developed for creating expression matrices. \n\n### Common Issues ###\nNone"
#> [3] "Tool accepts list of FASTQ files groups them into separate lists. This grouping is done using metadata values and their hierarchy (Sample ID > Library ID > Platform unit ID > File segment number) which should create unique combinations for each pair of FASTQ files. Important metadata fields are Sample ID, Library ID, Platform unit ID and File segment number. Not all of these four metadata fields are required, but the present set has to be sufficient to create unique combinations for each pair of FASTQ files. Files with no paired end metadata are grouped in the same way as the ones with paired end metadata, generally they should be alone in a separate list. Files with no metadata set will be grouped together. \n\nIf there are more than two files in a group, this might create errors further down most pipelines and the user should check if the metadata fields for those files are set properly."
#> [4] "**Salmon Index** tool builds an index from a transcriptome FASTA formatted file of target sequences, necessary for the **Salmon Quant** tool. \n\n**Quasi-mapping** is a process of assigning reads to transcripts, without doing the exact base-to-base alignment. Seeing that for estimating transcript abundances, the main information needed is which transcript a read originates from and not the actual mapping coordinates, the idea with the **Salmon** tool was to implement a procedure that does exactly that [1, 2]. \n\nThe result is a software running at speeds orders of magnitude faster than other tools which utilize the full likelihood model, while keeping near-optimal probabilistic RNA-seq quantification results [1, 2]. \n\n*A list of **all inputs and parameters** with corresponding descriptions can be found at the bottom of the page.*\n\n### Common Use Cases\n\n- A **Transcriptome FASTA file** needs to be provided as an input to the tool. \n\n### Changes Introduced by Seven Bridges\n\n- An already generated **Salmon index archive** can be provided to the **Salmon Index** tool (**Transcriptome FASTA or Salmon Index Archive** input), in order to skip indexing and save a little bit of time if this tool is part of a bigger workflow and there already is an index file that can be provided.\n\n### Common Issues and Important Notes\n\n- The input FASTA file (if provided instead of the already generated salmon index) should be a transcriptome FASTA, not a genomic FASTA.\n\n### Performance Benchmarking\n\nThe **Salmon Index** tool builds the index structure for **Salmon** in a very short time, therefore it is expected that all tasks using this tool should finish in under 5 minutes, costing around $0.05 on the default c4.2xlarge instance (AWS). \n\n*Cost can be significantly reduced by using **spot instances**. Visit the [Knowledge Center](https://docs.sevenbridges.com/docs/about-spot-instances) for more details.*\n\n\n### References\n\n[1] [Salmon paper](biorxiv.org/content/biorxiv/early/2016/08/30/021592.full.pdf) \n[2] [Rapmap paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908361/)"
#> [5] "**Salmon Quant - Reads** infers transcript abundance estimates from **RNA-seq data**, using a process called **quasi-mapping**. \n\n**Quasi-mapping** is a process of assigning reads to transcripts, without doing the exact base-to-base alignment. Seeing that for estimating transcript abundances, the main information needed is which transcript a read originates from and not the actual mapping coordinates, the idea with the **Salmon** tool was to implement a procedure that does exactly that [1, 2]. \n\nThe result is a software running at speeds orders of magnitude faster than other tools which utilize the full likehood model, while keeping near-optimal probabilistic RNA-seq quantification results [1, 2]. \n\nThe latest version of Salmon (0.9.x) introduces some novel concepts, like **Rich Factorization Classes**, which further increase the precision of the results, at a very negligible increase in runtime. This version of Salmon also supports quantification from already aligned BAM files, utilizing the full likelihood model (the same one as in RSEM), where the results are the same as RSEM, but the execution time is much shorter than in RSEM, this time due to engineering only [3].\n\n*A list of **all inputs and parameters** with corresponding descriptions can be found at the bottom of the page.*\n\n### Common Use Cases\n\n- The main input to the tool are **FASTQ read files** (single end or paired end). \n- A **Salmon index archive** (`-i`) also needs to be provided, in addition to an optional **Gene map** (`--geneMap`) file (which should be of the same annotations that were used in generating the **Transcriptome FASTA file**) if gene-level abundance results are desired. \n- The tool will generate transcript abundance estimates in plaintext format, and an optional file containing gene abundance estimates, if the input **Gene map** (`--gene-map`) file is provided. \n- In addition to the default output (**Quantification file**), additional outputs can be produced if the proper options are turned on for them (e.g. **Equivalent class counts** by setting `--dumpEq`, **Unmapped reads** by setting `--writeUnmappedNames`, **Bootstrap data** by setting `--numBootstraps` or `--numGibbsSamples`, **Mapping info** by setting `--write-mappings`...).\n- The **GC bias correction** option (`--gcBias`) will correct for GC bias and improve quantification accuracy, but at the cost of increased runtime (a rough estimate would be a **double** increase in runtime per sample). \n- The use of *data-driven likelihood factorization* is achieved with the **Range factorization bins** parameter (`--rangeFactorizationBins`) and can be used to bring an increase in accuracy at a very small increase in runtime [3]. \n\n### Changes Introduced by Seven Bridges\n\n- All output files will be prefixed by the input sample ID (inferred from **Sample ID** metadata if existent, or from filename otherwise), instead of having identical names between runs. \n\n### Common Issues and Important Notes\n\n- For paired-end read files, it is important to properly set the **Paired End** metadata field on your read files.\n- For FASTQ reads in multi-file format (i.e. two FASTQ files for paired-end 1 and two FASTQ files for paired-end2), the proper metadata needs to be set (the following hierarchy is valid: **Sample ID/Library ID/Platform Unit ID/File Segment Number)**.\n- The GTF and FASTA files need to have compatible transcript IDs. \n\n### Performance Benchmarking\n\nThe main advantage of the Salmon software is that it is not computationally challenging, as alignment in the traditional sense is not performed. \nBelow is a table describing the runtimes and task costs for a couple of samples with different file sizes:\n\n| Experiment type | Input size | Paired-end | # of reads | Read length | Duration | Cost | Instance (AWS) |\n|:---------------:|:-----------:|:----------:|:----------:|:-----------:|:--------:|:-----:|:----------:|\n| RNA-Seq | 2 x 4.5 GB | Yes | 20M | 101 | 5min | $0.05| c4.2xlarge |\n| RNA-Seq | 2 x 17.4 GB | Yes | 76M | 101 | 15min | $0.15 | c4.2xlarge |\n\n*Cost can be significantly reduced by using **spot instances**. Visit the [Knowledge Center](https://docs.sevenbridges.com/docs/about-spot-instances) for more details.*\n\n### References\n\n[1] [Salmon paper](biorxiv.org/content/biorxiv/early/2016/08/30/021592.full.pdf) \n[2] [Rapmap paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908361/) \n[3] [Data-driven likelihood factorization](https://academic.oup.com/bioinformatics/article/33/14/i142/3953977)"