User guide#

The Rp-Bp pipeline#

You want to “quality control” your Ribo-seq samples for downstrean analyses, or run the Ribo-seq ORF discovery pipeline? First, you need to prepare genome indices and annotations for your organism. This has to be done once for any given reference genome and annotation.

The pipeline itself consists of two “modules”: the ORF profile construction, where periodic read lengths and ribosome P-site offsets are inferred from the data; and the translation prediction, where translation events are predicted.

How to prepare the configuration file#

A single YAML configuration file can be used for both index creation and running the pipeline. The following keys are read from the configuration file:

Index creation

gtf (required, input) Path to the reference annotations (format GTF2).
fasta (required, input) Path to the reference genome sequence (format FASTA).
ribosomal_fasta (required, input) Path to the ribosomal sequence (format FASTA).
de_novo_gtf (optional, input) An additional GTF containing annotations constructed from a de novo assembly. See More about prepare-rpbp-genome.
start_codons (optional, input) A list of strings to use as start codons. Default: ATG.
stop_codons (optional, input) A list of strings to use as stop codons. Default: TAA, TGA, TAG.
genome_name (required, output) A descriptive name to use for the created files.
genome_base_path (required, output) Output path (directory) for the transcript fasta and ORFs.
ribosomal_index (required, output) Output path (directory/filename) for the Bowtie 2 index.
star_index (required, output) Output path (directory) for the STAR index.
orf_note (optional, output) An additional description used in the filenames. It should not contain spaces or special characters.

Pipeline

In addition to the above required keys:

riboseq_samples (required, input) A dictionary key: value, where key is used to construct filenames, and value is the full path to the FASTQ.gz file for a given sample. The key should not contain spaces or special characters.
riboseq_biological_replicates (optional, input) A dictionary key: value, where key is a condition, and value contains all samples which are replicates of the condition. Items of the value list must match the riboseq_samples key.
adapter_file (optional, input) Path to adapter sequences to be removed (FASTA).
adapter_sequence (optional, input) A single adapter sequence to be removed.
riboseq_data (required, output) The base output location for all created files.
riboseq_sample_name_map (optional, output) A dictionary key: value, where key is the same as riboseq_samples key, and value is a fancy name for key to use in downstream analyses.
riboseq_condition_name_map (optional, output) A dictionary key: value, where key is the same as riboseq_biological_replicates key, and value is a fancy name for key to use in downstream analyses.
project_name (optional, output) An additional description used in the filenames for downstream analyses. It should not contain spaces or special characters. See Visualization and quality control.
note (optional, output) An additional description used in the filenames. It should not contain spaces or special characters.

A configuration file with the above required keys is sufficient to run the pipeline with default parameters. To change the default parameters, see Default parameters and options. A “template” configuration file is available to download with the Tutorials. See also More about biological replicates, for an example of how to use biological replicates.

How to prepare genome indices and annotations#

This has to be done once for any given reference genome and annotation.

Attention

Under the hood, Rp-Bp uses STAR to align reads to the genome. If the STAR version changes, the STAR index may have to be re-generated.

General usage#

prepare-rpbp-genome <config> [options]

Rp-Bp can be run with the SLURM scheduler. For all options, consult the API. See also How to prepare the configuration file.

Required input#

Reference annotations#

The reference annotations (format GTF2), with transcript, exon, and CDS features (start_codon and stop_codon features are not required). The attribute field must include transcript_id and gene_id, and optionally transcript_biotype, gene_name, and gene_biotype. The annotations must match the version of the reference genome sequence.

Caution

The GTF2 format does not include the stop codon in the terminal exon, i.e. the stop codon is not included in the CDS. This is important, as Rp-Bp treats annotations according to the GTF2 (GTF) specifications.

Reference genome sequence#

The input FASTA file should contain all top level sequence regions excluding haplotypes (this may significantly increase runtime, and bias the alignments, since these sequences can substiantially overlap the main reference). This generally matches the primary assembly file from Ensembl for the chosen assembly release. The identifiers must match those in the GTF annotation file.

Ribosomal sequence#

A separate FASTA file for the ribosomal DNA (rDNA) sequence/cluster, which is generally not included in the genome assembly. This file can also include other sequences to filter out, depending on the goal of the analysis (.e.g snRNAs). We typically include the following

The large and small ribosomal subunit sequences, e.g. from NCBI.
The genomic tRNA sequences e.g. from GtRNAdb.
Mt_rRNA, Mt_tRNA and rRNA genes from BioMart. In particular, we select those options for the “Gene type” filter. For “Attributes”, we select “Sequences”, and then specifically “Exon sequences”. Additionally, including the “Gene type” in the header can be helpful for identifying where reads mapped, for quality control purposes.

Output files#

Output files are written in the BED, FASTA, or TAB-delimited formats.

<ribosomal_index> The Bowtie 2 index files.
<star_index> The STAR index files.

The base path for the following file is: <genome_base_path>

<genome_name>.annotated.bed.gz A BED12+ file containing all annotated transcripts, including “biotype”, “gene_id”, “gene_name”, and “gene_biotype” information.

The base path for the following files is: <genome_base_path>/transcript-index

<genome_name>.transcripts.annotated.fa A FASTA file with the annotated transcript sequences.
<genome_name>.orfs-genomic[.orf_note].bed.gz. A BED12+ with the ORFs extracted from all transcripts. The ORFs are numbered, and their length is also reported. The ORF ids are of the form: transcript_seqname:start-end:strand. The start codon is included, but the stop codon is not.
<genome_name>.orfs-exons[.orf_note].bed.gz. A BED6+ file with the ORF exons. The extra columns are exon_index, giving the order of the exon in the transcript, and transcript_start, giving the start position of that index in transcript coordinates.
<genome_name>.orfs-labels[.orf_note].tab.gz. A TAB-delimited file with ORF categories and all compatible transcripts. See More about prepare-rpbp-genome to learn about ORF categories or labels.

Note

If a de_novo_gtf file is provided, intermediate output files are split into annotated and de-novo. The files used by the pipeline, as described above, are the “concatenation” of the respective annotated and de-novo files. In addition, a GTF file is created by concatenating gtf and de_novo_gtf. This new GTF file is written to genome_base_path. Be careful not to overwrite any existing GTF file there!

How to run the pipeline#

See ORF profile construction and Translation prediction for a short description of required input and output files. See also More about biological replicates.

Rp-Bp output files are written in the BED, FASTA, sparse matrix market (MTX), or CSV format. Output from Flexbar, Bowtie2, and STAR are written in FASTQ or BAM formats.

Important

All Ribo-seq samples (including biological replicates) in the configuration file must be from the same organism and use the same genome_base_path, star_index, ribosomal_index, etc. Samples from different organisms or using different annotations must be “split” into different configuration files, and run separately.

General usage#

# Only create the ORF profiles (estimate periodicity).
run-all-rpbp-instances <config> --profiles-only [options]

# Run the ORF discovery pipeline for all samples in the configuration file (only samples, i.e. do not merge the replicates).
run-all-rpbp-instances <config> [options]

# Run the ORF discovery pipeline for all samples in the configuration file, merge the replicates, and make predictions for merged replicates.
run-all-rpbp-instances <config> --merge-replicates --run-replicates [options]

Rp-Bp can be run with the SLURM scheduler. For all options, consult the API. See also How to prepare the configuration file.

Tip

To be able to perform read filtering quality control, use the -k/--keep-intermediate-files option. Intermediate files e.g. Flexbar, or Bowtie2 output can be deleted afterwards, see Visualization and quality control.

More about biological replicates#

The Rp-Bp pipeline handles replicates by adding the ORF profiles. The Bayes factors and predictions are then calculated based on the combined profiles. The --merge-replicates flag indicates that the replicates should be merged. By default, if the --merge-replicates flag is given, then predictions will not be made for the individual samples, unless the --run-replicates flag is also given, in which case predictions will be made for both the merged replicates as well as the individual samples. This is how you would prepare a configuration file for four samples of two different “conditons”:

# example of 4 samples: 2 controls and 2 conditions
# <sample_name> below is replaced by ctrl1, ctrl2, cond1, and cond2
# <condition_name> below is replaced by ctrl and cond

riboseq_samples:
 ctrl1: /path/to/sample1.fastq.gz
 ctrl2: /path/to/sample2.fastq.gz
 cond1: /path/to/sample3.fastq.gz
 cond2: /path/to/sample4.fastq.gz

riboseq_biological_replicates:
 ctrl:
  - ctrl1
  - ctrl2
 cond:
  - cond1
  - cond2

# fancy names to use for downstream analyses
riboseq_sample_name_map:
 ctrl1: Ctrl-1
 ctrl2: Ctrl-2
 cond1: Cond-1
 cond2: Cond-2

riboseq_condition_name_map:
 ctrl: Ctrl
 cond: Cond

ORF profile construction#

To run the periodicity estimation only, pass the --profiles-only option.

Note

This part of the pipeline uses Flexbar, Bowtie2, and STAR to process and align Ribo-seq reads, however you can estimate periodicity (and predict translation events) using your own existing alignment files (BAM format), see How to use existing alignment files

Required input#

All the input files are those specified by the configuration file.

Output files#

The base path for the following files is: <riboseq_data>/without-adapters

<sample_name>[.note].fastq.gz Clean reads (adapters and low-quality reads removed).

The base path for the following files is: <riboseq_data>/with-rrna

<sample_name>[.note].fastq.gz Reads aligning to the ribosomal index. They may be kept for quality control, but are not used.

The base path for the following files is: <riboseq_data>/without-rrna

<sample_name>[.note].fastq.gz Reads not aligning to the ribosomal index, i.e. after in-silico rRNA removal. These reads are used for the genome alignment step.

The base path for the following files is: <riboseq_data>/without-rrna-mapping

<sample_name>[.note].bam A sorted BAM file with genome alignments (the Aligned.sortedByCoord.out.bam STAR output).
<sample_name>[.note]-unique.bam A sorted BAM file with unique alignments (multimapping reads removed).

Note

If keep_riboseq_multimappers is True in the configuration file, then there will be no -unique files. In general, we do not recommend to keep multimappers.

The base path for the following files is: <riboseq_data>/metagene-profiles

<sample_name>[.note][-unique].metagene-profile.csv.gz A CSV file with the metagene profiles constructed from aligned reads (given by the “position” or offset and “count” columns) for all read lengths (“length” column) found in a given sample. It include profiles for the annotated translation initiation site and translation termination site (“type” column).
<sample_name>[.note][-unique].metagene-periodicity-bayes-factors.csv.gz A CSV file with the model outputs and Bayes factor estimates for all P-site offsets and read lengths.
<sample_name>[.note][-unique].periodic-offsets.csv.gz A CSV file with the best P-site offset for each read length. All read lengths are included, even if the estimates do not meet the prediction criteria (filtering occurs on the fly).

The base path for the following files is: <riboseq_data>/orf-profiles

<sample_name>[.note][-unique].length-<lengths>.offset-<offsets>.profiles.mtx.gz A MTX file with the profiles for all ORFs (“orf_num”, “orf_position”, i.e. position within the ORF, and “read_count”). The matrix market format uses base-1 indexing!
<condition_name>[.note][-unique].profiles.mtx.gz Same as above for condition, if using --merge-replicates.

Translation prediction#

Without the --profiles-only option, the pipeline will predict which ORFs show evidence of translation, using only the periodic footprint lengths. The --merge-replicates options is used to predict translation events in merged profiles, see More about biological replicates.

Tip

If you first created profiles and estimated periodicity using the --profiles-only option, you can decide to continue with the translation prediction step at a later stage. You only have to run-all-rpbp-instances <config> [--merge-replicates] [--run-replicates] using the same configuration file. Steps for which output files already exists will be skipped, unless the --overwrite option is set.

Required input#

All the input files are those specified by the configuration file. In addition, metagene and ORF profile output files are required (see output files from ORF profile construction). If the pipeline is run sequentially, you do not normally have to worry about the intermediate output.

Output files#

The base path for the following files is: <riboseq_data>/orf-predictions

<sample_name>[.note][-unique].length-<lengths>.offset-<offsets>.bayes-factors.bed.gz A BED12+ file with model outputs for all ORFs. Additional columns include the ORF number, ORF length, model outputs, Bayes factor mean and variance, and P-site coverage across 3 frames.
<sample_name>[.note][-unique].length-<lengths>.offset-<offsets>[.filtered].predicted-orfs.bed.gz Same format as above, with the predicted translation events. This file contains the translated Ribo-seq ORFs.
<sample_name>[.note][-unique].length-<lengths>.offset-<offsets>[.filtered].predicted-orfs.dna.fa A FASTA file with the predicted translation events. The FASTA header matches the “id” column in the corresponding BED file. This file contains the DNA sequence for each translated Ribo-seq ORF.
<sample_name>[.note][-unique].length-<lengths>.offset-<offsets>[.filtered].predicted-orfs.protein.fa A FASTA file with the predicted translation events. The FASTA header matches the “id” column in the corresponding BED file. This file contains the protein sequence for each translated Ribo-seq ORF.
<condition_name>[.note][-unique].bayes-factors.bed.gz Same as above for condition, if using --merge-replicates.
<condition_name>[.note][-unique][.filtered].predicted-orfs.bed.gz Same as above for condition, if using --merge-replicates.
<condition_name>[.note][-unique][.filtered].predicted-orfs.dna.fa Same as above for condition, if using --merge-replicates.
<condition_name>[.note][-unique][.filtered].predicted-orfs.protein.fa Same as above for condition, if using --merge-replicates.

Attention

Translation events are predicted using Bayesian model selection. Our model does not distinguishes between overlapping ORFs. To select the best overlapping ORF among a group of overlapping ORFs, we first select the longest ORF, then the highest Bayes factor. This is referred to as the filtered predictions.

In previous versions, both filtered and unfiltered (including all overlapping ORFs) predictions were written to file. In general, we recommend to use filtered predictions. Unless the --write-unfiltered option is used, Rp-Bp now only outputs the filtered predictions. If using --write-unfiltered, unfiltered predictions are also written to file, without the [.filtered] flag. Hence to avoid confusion with older results, the filtered predictions have kept the [.filtered] flag.

Note

If smoothing parameters (see Default parameters and options) are given in the configuration file, the following string .frac-<smoothing_fraction>.rw-<smoothing_reweighting_iterations> is also added to the file names. Default values (unless they are explicitly given in the configuration file) are not written.

Default parameters and options#

The parameters and options decribed below are all optional. All parameters and options have default values that do not normally need to be modified.

Important

Rp-Bp parameters can be changed via the configuration file, and options for external programs (Flexbar, STAR) are handled via command line arguments. You do not need to include Rp-Bp parameters in the configuration file, unless you wish to change their values.

Flexbar and STAR options#

Default options for external programs (Flexbar, STAR) are overridden via command line using --flexbar-options or --star-options. Currently, no options can be passed to Bowtie2.

Flexbar#

max-uncalled Default: 1.
pre-trim-left Default: 0.
qtrim-format Default: sanger.
qtrim Default: TAIL.
qtrim-threshold Default: 10.
zip-output Default: GZ.

STAR#

readFilesCommand Default: zcat (gzcat for macOS).
limitBAMsortRAM Default: 0 (set to --mem at run-time).
alignIntronMin Default: 20.
alignIntronMax Default: 100000.
outFilterMismatchNmax Default: 1.
outFilterMismatchNoverLmax Default: 0.04.
outFilterType Default: BySJout.
outFilterIntronMotifs Default: RemoveNoncanonicalUnannotated.
outSAMattributes Default: AS NH HI nM MD.
outSAMtype Default: BAM SortedByCoordinate.
sjdbOverhang Default: 33.
seedSearchStartLmaxOverLread Default: 0.5.
winAnchorMultimapNmax Default: 100.

Rp-Bp parameters#

keep_riboseq_multimappers If True in the configuration file, then multimapping riboseq reads will not be removed. They will be treated as “normal” reads in every place they map, i.e. the weight of the read will not be distributed fractionally, probabilistically, etc. We do not in general recommend to use this option.
models_base The path to the compiled models, if installed in a different location. The models are included with the source distribution and compiled as part of the installation. Do not change this, unless you know what you are doing!

Shared MCMC parameters#

seed The random seed for the MCMC sampling, used for periodicity estimation and translation prediction. Default: 8675309.
chains The number of chains to use in the MCMC sampling, used for periodicity estimation and translation prediction. Default: 2

Metagene and periodicity estimation parameters#

metagene_start_upstream The number of bases upstream of the translation initiation site to begin constructing the metagene profile. Default: 300.
metagene_start_downstream The number of bases downstream of the translation initiation site to end the metagene profile. Default: 300.
metagene_end_upstream The number of bases upstream of the translation termination site to begin constructing the metagene profile. Default: 300.
metagene_end_downstream The number of bases downstream of the translation termination site to end the metagene profile. Default: 300.
periodic_offset_start The position, relative to the translation initiation site, to begin calculating periodicity Bayes factors. Default: -20 (inclusive).
periodic_offset_end The position, relative to the translation initiation site, to stop calculating periodicity Bayes factors. Default: 0 (inclusive).
metagene_profile_length The length of the profile to use in the models. metagene_profile_length + periodic_offset_end must be consistent with the length of the extracted metagene profile. Default: 21.
metagene_iterations The number of iterations to use for each chain in the MCMC sampling. Default: 500 (includes warmup).
min_metagene_profile_count Read lengths with fewer than this number of reads will not be used. Default: 1000.
min_metagene_bf_mean If max_metagene_bf_var and min_metagene_bf_likelihood are None (null in YAML), this is taken as a hard threshold on the estimated Bayes factor mean. Default: 5.
max_metagene_bf_var A hard threshold on the estimated Bayes factor variance. Default: None.
min_metagene_bf_likelihood A threshold on the likelihood of periodicity. Default: 0.5.

Note

A profile is periodic if [P(bf > min_metagene_bf_mean)] > min_metagene_bf_likelihood. By default, we do not filter on the variance. If given, then both filters are applied and the result is the intersection.

Fixed lengths and offsets#

use_fixed_lengths If True in the configuration file, fixed values given by lengths and offsets are used (no periodicity estimation).
lengths A list of read lengths to use for creating the profiles if the use_fixed_lengths option is True. Presumably, these are lengths that have periodic metagene profiles.
offsets The P-site offset to use for each read length specifed by lengths if the use_fixed_lengths option is True. The number of offsets must match the number of lengths, and they are assumed to match. For example lengths: [26, 29] with offsets: [9, 12] means only reads of lengths 26 bp and 29 bp are used to create the profiles. The 26 bp reads will be shifted by 9 bp in the 5’ direction, while reads of length 29 bp will be shifted by 12 bp.

Smoothing parameters#

smoothing_fraction The fraction of the data used when estimating each y-value for LOWESS. Default: 0.2.
smoothing_reweighting_iterations The number of residual-based reweightings to perform for LOWESS. See the statsmodels documentation. Default: 0.

Translation prediction parameters#

orf_min_length_pre ORFs with length < orf_min_length_pre (nucleotides) are not processed. Default: 0 (ignore option).
orf_max_length_pre ORFs with length > orf_max_length_pre (nucleotides) are not processed. Default: 0 (ignore option).
orf_min_length Only ORFs with length > orf_min_length (nucleotides) are kept in the final set. Default: 8.
orf_min_profile_count_pre ORF with profile sum < orf_min_profile_count_pre are not processed. Default: 5.
orf_min_profile_count Only ORFs with profile sum > orf_min_profile_count are kept in the final set. Default: None.
translation_iterations The number of iterations to use for each chain in the MCMC sampling. Default: 500 (includes warmup).
min_bf_mean If max_bf_var and min_bf_likelihood are None (null in YAML), this is taken as a hard threshold on the estimated Bayes factor mean. Default: 5.
max_bf_var A hard threshold on the estimated Bayes factor variance. Default: None.
min_bf_likelihood A threshold on the likelihood to select an ORF as translated. Default: 0.5.
chisq_alpha For the chi-square test, this value is first Bonferroni corrected based on the number of ORFs which pass the smoothing filters. It is then used as the significance threshold to select translated ORFs. Default: 0.01.

Note

A Ribo-seq ORF is translated if [P(bf > min_bf_mean)] > min_bf_likelihood. By default, we do not filter on the variance. If given, then both filters are applied and the result is the intersection.

Attention

Chi-square values are reported, but they are not used for prediction, unless the chi_square_only flag is present in the configuration file, in which case the translation models are not fit to the data, and the posterior distributions are not estimated. This is mostly kept for historical reasons, and may eventually be removed.