RSeQC v2.3.3
RSeQC v2.3.2
RSeQC v2.3.1
Deep transcriptome sequencing (RNA-seq) provides massive and valuable information about functional elements in the genome. Ideally, transcriptome sequencing should be able to directly identify and quantify all RNA species, small or large, low or high abundance. However, RNA-seq is a complicated, multistep process involving sample preparation, amplification, fragmentation, purification and sequencing. A single improper operation would result in biased or even unusable data. Therefore, it is always a good practice to check the quality of your RNA-seq data before pursuing any directions listed above. Here we developed RSeQC package to comprehensively evaluate RNA-seq datasets generated from clinical tissues or other well annotated organisms such as mouse, fly and yeast. For organisms lacking reference gene models, many modules will not work.
RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. “Basic modules” quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while “RNA-seq specific modules” investigate sequencing saturation status of both splicing junction detection and expression estimation, mapped reads clipping profile, mapped reads distribution, coverage uniformity over gene body, reproducibility, strand specificity and splice junction annotation
We only provide rRNA bed files for human and mouse. These ribosome RNAs were downloaded from UCSC table browser, we provide them here to facilitate users with NO WARRANTY in completeness. We are appreciated if user can make current list more complete or provide additional gene list for other species.
Prerequisite: gcc; python2.7; numpy; R
Install RSeQC globally (with root privilege):
tar zxf RSeQC-VERSION.tar.gz
cd RSeQC-VERSION
python setup.py install
export PYTHONPATH=/usr/local/lib/python2.7/site-packages:$PYTHONPATH
export PATH=/user/local/bin:$PATH
Install RSeQC at user’s location:
tar zxf RSeQC-VERSION.tar.gz
cd RSeQC-VERSION
python setup.py install --root=/home/user
export PYTHONPATH=/home/user/lib/python2.7/site-packages/:$PYTHONPATH
export PATH=/user/local/bin:$PATH
Finally, type: python -c ‘from qcmodule import SAM’. If no error message comes out, RSeQC modules have been installed successfully.
RSeQC accepts 4 file formats as input:
download this script and save as ‘fetchChromSizes’:
chmod 0755 fetchChromSizes
fetchChromSizes hg19 >hg19.chrom.sizes
fetchChromSizes danRer7 >zebrafish.chrom.sizes
Visualization is the most straightforward and effective way to QC your RNA-seq data. For example, change of expression or new splicing can be easily checked by visually comparing two RNA-seq tracks using genome browser such as UCSC, IGB and IGV. bam2wig.py converts all types of RNA-seq data from BAM format into wiggle format in one-stop. wiggle file can then be easily converted into bigwig using UCSC wigToBigWig tool. Bigwig is indexed, binary format of wiggle file, and it’s particular useful to display large, continuous dataset on genome browser. To use bam2wig.py, BAM file must be sorted and indexed properly using samTools. Below example shows how to sort and index BAM file using samTools:
samtools sort -m 1000000000 input.bam input.sorted.bam
samtools index input.sorted.bam input.sorted.bam.bai
After indexing, .bam and .bai file should be placed in the same directory
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM format. BAM file must be sorted and indexed using samTools. .bam and .bai files should be placed in the same directory. | |
| -s CHROMSIZE, --chromSize=CHROMSIZE | |
| Chromosome size file. Tab or space separated text file with 2 columns: first column is chromosome name, second column is chromosome size. Chromosome name (such as “chr1”) should be consistent between this file and BAM file. | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output wiggle files(s). One wiggle file will be generated for non-strand specific data, two wiggle files (“Prefix_Forward.wig” and “Prefix_Reverse.wig”) will be generated for strand specific RNA-seq data. | |
| -t TOTAL_WIGSUM, --wigsum=TOTAL_WIGSUM | |
| Specified wigsum. wigsum of 100000000 equals to coverage achieved by 1 million 100nt reads. Ignore this option to disable normalization | |
| -u, --skip-multi-hits | |
| Presence this option render the program to skip multiple hit reads. | |
| -d STRAND_RULE, --strand=STRAND_RULE | |
| How read(s) were stranded during sequencing. For example: –strand=‘1++,1–,2+-,2-+’ means that this is a pair-end, strand-specific RNA-seq, and the strand rule is: read1 mapped to ‘+’ => parental gene on ‘+’; read1 mapped to ‘-‘ => parental gene on ‘-‘; read2 mapped to ‘+’ => parental gene on ‘-‘; read2 mapped to ‘-‘ => parental gene on ‘+’. If you are not sure about the strand rule, run ‘infer_experiment.py’ default=none (Not a strand specific RNA-seq data). | |
This program is used to calculate reads mapping statistics from provided BAM or SAM file. This script determines “uniquely mapped reads” from the “NH” tag in BAM/SAM file (please note “NH” is an optional tag, bam_stat.py may fail if aligner does NOT provide this tag):
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. | |
Example:
bam_stat.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam
Output
| Category | count |
|---|---|
| Total Records: | 9810963 |
| QC failed: | 0 |
| Optical/PCR duplicate: | 0 |
| Non Primary Hits | 0 |
| Unmapped reads: | 0 |
| Multiple mapped reads: | 1512031 |
| Uniquely mapped: | 8298932 |
| Read-1: | 4258636 |
| Read-2: | 4040296 |
| Reads map to ‘+’: | 4153400 |
| Reads map to ‘-‘: | 4145532 |
| Non-splice reads: | 8013159 |
| Splice reads: | 285773 |
| Reads mapped in proper pairs: | 5978742 |
This program is used to estimate clipping profile of RNA-seq reads from BAM or SAM file. Note that to use this funciton, CIGAR strings within SAM/BAM file should have ‘S’ operation (This means your reads aligner should support clipped mapping).
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). | |
Example:
clipping_profile.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output
Output
Read coverage over gene body. This module is used to check if reads coverage is uniform and if there is any 5’/3’ bias. This module scales all transcripts to 100 nt and calculates the number of reads covering each nucleotide position. Finally, it generates a plot illustrating the coverage profile along the gene body. NOTE: this module requires lots of memory for large BAM files, because it load the entire BAM file into memory. We add another script “geneBody_coverage2.py” into v2.3.1 which takes bigwig (instead of BAM) as input. It only use 200M RAM, but users need to convert BAM into WIG, and then WIG into BigWig.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. | |
| -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL | |
| Reference gene model in bed format. [required] | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). [required] | |
Example:
geneBody_coverage.py -r hg19.refseq.bed12 -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
Output:
Similar to geneBody_coverage.py. This module takes bigwig instead of BAM as input, and thus requires much less memory. The BigWig file could be arbitrarily large.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Coverage signal file in bigwig format | |
| -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL | |
| Reference gene model in bed format. [required] | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). [required] | |
| -t GRAPH_TYPE, --graph-type=GRAPH_TYPE | |
| Graphic file type in “pdf”, “jpeg”, “bmp”, “bmp”, “tiff” or “png”.default=png [optional] | |
This program is used to speculate how RNA-seq sequencing were configured, especially how reads were stranded for strand-specific RNA-seq data, through comparing reads’ mapping information to the underneath gene model. For pair-end RNA-seq, there are two different ways to strand reads (such as Illumina ScriptSeq protocol):
For single-end RNA-seq, there are also two different ways to strand reads:
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Input alignment file in SAM or BAM format | |
| -r REFGENE_BED, --refgene=REFGENE_BED | |
| Reference gene model in bed fomat. | |
| -s SAMPLE_SIZE, --sample-size=SAMPLE_SIZE | |
| Number of reads sampled from SAM/BAM file. default=200000 | |
Example 1:
infer_experiment.py -r hg19.refseq.bed12 -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam
Output:
This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4992
Fraction of reads explained by "1+-,1-+,2++,2--": 0.5008
Fraction of reads explained by other combinations: 0.0000
Conclusion: We can infer that this is NOT a strand specific because 50% of reads can be explained by “1++,1–,2+-,2-+”, while the other 50% can be explained by “1+-,1-+,2++,2–”.
Example 2:
infer_experiment.py -r hg19.refseq.bed12 -i Pairend_StrandSpecific_51mer_Human_hg19.bam
Output:
This is PairEnd Data Fraction of reads explained by "1++,1--,2+-,2-+": 0.9644
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0356
Fraction of reads explained by other combinations: 0.0000
Conclusion: We can infer that this is a strand-specific RNA-seq data. strandness of read1 is consistent with that of gene model, while strandness of read2 is opposite to the strand of reference gene model.
Example 3:
infer_experiment.py -r hg19.refseq.bed12 -i SingleEnd_StrandSpecific_36mer_Human_hg19.bam
Output:
This is SingleEnd Data
Fraction of reads explained by "++,--": 0.9840
Fraction of reads explained by "+-,-+": 0.0160
Fraction of reads explained by other combinations: 0.0000
Conclusion: This is single-end, strand specific RNA-seq data. Strandness of reads are concordant with strandness of reference gene.
This module is used to calculate the inner distance (or insert size) between two paired RNA reads. The distance is the mRNA length between two paired fragments. We first determine the genomic (DNA) size between two paired reads: D_size = read2_start - read1_end, then
NOTE: Not all read pairs were used to estimate the inner distance distribution. Those low quality, PCR duplication, multiple mapped reads were skipped.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s) | |
| -r REF_GENE, --refgene=REF_GENE | |
| Prefix of output files(s). | |
| -l LOWER_BOUND_SIZE, --lower-bound=LOWER_BOUND_SIZE | |
| Lower bound of inner distance (bp). This option is used for ploting histograme. default=-250 | |
| -u UPPER_BOUND_SIZE, --upper-bound=UPPER_BOUND_SIZE | |
| Upper bound of inner distance (bp). This option is used for plotting histogram. default=250 | |
| -s STEP_SIZE, --step=STEP_SIZE | |
| Step size (bp) of histograme. This option is used for plotting histogram. default=5 | |
Example:
inner_distance.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output -r hg19.refseq.bed12
Output:
For a given alignment file (-i) in BAM or SAM format and a reference gene model (-r) in BED format, this program will compare detected splice junctions to reference gene model. splicing annotation is performed in two levels: splice event level and splice junction level.
All detected junctions can be grouped to 3 exclusive categories:
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. | |
| -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL | |
| Reference gene model in bed format. This file is better to be a pooled gene model as it will be used to annotate splicing junctions [required] | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). [required] | |
| -m MIN_INTRON, --min-intron=MIN_INTRON | |
| Minimum intron length (bp). default=50 [optional] | |
Example:
junction_annotation.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output -r hg19.refseq.bed12
Output:
It’s very important to check if current sequencing depth is deep enough to perform alternative splicing analyses. For a well annotated organism, the number of expressed genes in particular tissue is almost fixed so the number of splice junctions is also fixed. The fixed splice junctions can be predetermined from reference gene model. All (annotated) splice junctions should be rediscovered from a saturated RNA-seq data, otherwise, downstream alternative splicing analysis is problematic because low abundance splice junctions are missing. This module checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments from BAM or SAM file, and then detects splice junctions from each subset and compares them to reference gene model.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format.[required] | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). [required] | |
| -r REFGENE_BED, --refgene=REFGENE_BED | |
| Reference gene model in bed fomat. This gene model is used to determine known splicing junctions. [required] | |
| -l PERCENTILE_LOW_BOUND, --percentile-floor=PERCENTILE_LOW_BOUND | |
| Sampling starts from this percentile. A integer between 0 and 100. default=5 | |
| -u PERCENTILE_UP_BOUND, --percentile-ceiling=PERCENTILE_UP_BOUND | |
| Sampling ends at this percentile. A integer between 0 and 100. default=100 | |
| -s PERCENTILE_STEP, --percentile-step=PERCENTILE_STEP | |
| Sampling frequency. Smaller value means more sampling times. A integer between 0 and 100. default=5 | |
| -m MINIMUM_INTRON_SIZE, --min-intron=MINIMUM_INTRON_SIZE | |
| Minimum intron size (bp). default=50 | |
| -v MINIMUM_SPLICE_READ, --min-coverage=MINIMUM_SPLICE_READ | |
| Minimum number of supportting reads to call a junction. default=1 | |
Example:
junction_saturation.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -r hg19.refseq.bed12 -o output
Output:
In this example, current sequencing depth is almost saturated for “known junction” (red line) detection because the number of “known junction” reaches a plateau. In other words, nearly all “known junctions” (expressed in this particular tissue) have already been detected, and continue sequencing will not detect additional “known junction” and will only increase junction coverage (i.e. junction covered by more reads). While current sequencing depth is not saturated for novel junctions (green).
This module allow users to manipulate two BigWig files.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i BIGWIG_FILE1, --bwfile1=BIGWIG_FILE1 | |
| One BigWig file | |
| -j BIGWIG_FILE2, --bwfile2=BIGWIG_FILE2 | |
| Another BigWig file | |
| -a ACTION, --action=ACTION | |
| After pairwise align two bigwig files, perform the follow actions (Only select one keyword):”Add” = add signals. “Average” = average signals. “Division”= divide bigwig2 from bigwig1. Add 1 to both bigwig. “Max” = pick the signal that is larger. “Min” = pick the signal that is smaller. “Product” = multiply signals. “Subtract” = subtract signals in 2nd bigwig file from the corresponiding ones in the 1st bigwig file. “geometricMean” = take the geometric mean of signals. | |
| -o OUTPUT_WIG, --output=OUTPUT_WIG | |
| Output wig file | |
| -s CHROMSIZE, --chromSize=CHROMSIZE | |
| Chromosome size file. Tab or space separated text file with 2 columns: first column is chromosome name, second column is size of the chromosome. | |
| -c CHUNK_SIZE, --chunk=CHUNK_SIZE | |
| Chromosome chunk size. Each chomosome will be cut into samll chunks of this size. Decrease chunk size will save more RAM. default=100000 (bp) | |
Visualizing is the most straightforward and effective way to QC your RNA-seq data. For example, differential expression can be easily checked by comparing two RNA-seq tracks using genome browser. However, one must make sure that all samples are comparable before “visual checking”. Signal values in wig (or bigwig) file are contributed form two factors: 1) total read number. 2) read length. Therefore, only normalized to ‘total read count’ is problematic if read length is different between samples. Here we normalize every bigwig file into the same wigsum. wigsum is the summary of signal value across the genome. for example, wigsum = 100,000,000 equals to the coverage achieved by 1 million 100nt long reads or 2 million 50nt long reads.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i BIGWIG_FILE, --bwfile=BIGWIG_FILE | |
| Input BigWig file. [required] | |
| -o OUTPUT_WIG, --output=OUTPUT_WIG | |
| Output wig file. [required] | |
| -s CHROMSIZE, --chromSize=CHROMSIZE | |
| Chromosome size file. Tab or space separated text file with 2 columns: first column is chromosome name, second column is size of the chromosome. [required] | |
| -t TOTAL_WIGSUM, --wigsum=TOTAL_WIGSUM | |
| Specified wigsum. 100000000 equals to coverage of 1 million 100nt reads. default=100000000 [optional] | |
| -r REFGENE_BED, --refgene=REFGENE_BED | |
| Reference gene model in bed format. [optional] | |
| -c CHUNK_SIZE, --chunk=CHUNK_SIZE | |
| Chromosome chunk size. Each chomosome will be cut into samll chunks of this size. Decrease chunk size will save more RAM. default=100000 (bp) [optional] | |
Provided a BAM/SAM file and reference gene model, this module will calculate how mapped reads were distributed over genome feature (like CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions). When genome features are overlapped (e.g. a region could be annotated as both exon and intron by two different transcripts) , they are prioritize as: CDS exons > UTR exons > Introns > Intergenic regions, for example, if a read was mapped to both CDS exon and intron, it will be assigned to CDS exons.
RSeQC cannot assign those reads that:
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. | |
| -r REF_GENE_MODEL, --refgene=REF_GENE_MODEL | |
| Reference gene model in bed format. | |
Example:
read_distribution.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -r hg19.refseq.bed12
Output:
| Group | Total_bases | Tag_count | Tags/Kb |
|---|---|---|---|
| CDS_Exons | 33302033 | 20002271 | 600.63 |
| 5’UTR_Exons | 21717577 | 4408991 | 203.01 |
| 3’UTR_Exons | 15347845 | 3643326 | 237.38 |
| Introns | 1132597354 | 6325392 | 5.58 |
| TSS_up_1kb | 17957047 | 215331 | 11.99 |
| TSS_up_5kb | 81621382 | 392296 | 4.81 |
| TSS_up_10kb | 149730983 | 769231 | 5.14 |
| TES_down_1kb | 18298543 | 266161 | 14.55 |
| TES_down_5kb | 78900674 | 729997 | 9.25 |
| TES_down_10kb | 140361190 | 896882 | 6.39 |
Two strategies were used to determine reads duplication rate: * Sequence based: reads with exactly the same sequence content are regarded as duplicated reads. * Mapping based: reads mapped to the same genomic location are regarded as duplicated reads. For splice reads, reads mapped to the same starting position and splice the same way are regarded as duplicated reads.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). | |
| -u UPPER_LIMIT, --up-limit=UPPER_LIMIT | |
| upper limit of duplicated times. Only used for plotting, default=500 (times) | |
Example:
read_duplication.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
Output:
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). | |
Example:
read_GC.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
Output:
This module is used to check the nucleotide composition bias. Due to random priming, certain patterns are over represented at the beginning (5’end) of reads. This bias could be easily examined by NVC (Nucleotide versus cycle) plot. NVC plot is generated by overlaying all reads together, then calculating nucleotide composition for each position of read (or each sequencing cycle). In ideal condition (genome is random and RNA-seq reads is randomly sampled from genome), we expect A%=C%=G%=T%=25% at each position of reads.
NOTE: this program expect a fixed read length
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Input file in BAM or SAM format.[required] | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). [required] | |
| -x, --nx | Flag option. Presense of this flag tells program to include N,X in output NVC plot [required] |
Example:
read_NVC.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
Output:
According to SAM specification, if Q is the character to represent “base calling quality” in SAM file, then Phred Quality Score = ord(Q) - 33. Here ord() is python function that returns an integer representing the Unicode code point of the character when the argument is a unicode object, for example, ord(‘a’) returns 97. Phred quality score is widely used to measure “reliability” of base-calling, for example, phred quality score of 20 means there is 1/100 chance that the base-calling is wrong, phred quality score of 30 means there is 1/1000 chance that the base-calling is wrong. In general: Phred quality score = -10xlog(10)P, here P is probability that base-calling is wrong.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. [required] | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). [required] | |
| -r REDUCE_FOLD, --reduce=REDUCE_FOLD | |
| To avoid making huge vector in R, nucleotide with particular phred score represented less than this number will be ignored. Increase this number save more memory while reduce precision. This option only applies to the ‘boxplot’. default=1000 | |
Example:
read_quality.py -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam -o output
Output:
Heatmap: use different color to represent nucleotide density (“blue”=low density,”orange”=median density,”red”=high density”)
calculate hexamer (6mer) frequency. If ‘-r’ was specified, hexamer frequency was also calculated for the reference genome. If ‘-g’ was provided, hexamer frequency was also calculated for the mRNA sequences.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_READ, --input=INPUT_READ | |
| Read sequence in fasta or fastq format. Multiple fasta/fastq files should be separated by ‘,’. For example: read.fq,read2.fa,read3,fa | |
| -r REF_GENOME, --refgenome=REF_GENOME | |
| Reference genome sequence in fasta format. Optional | |
| -g REF_GENE, --refgene=REF_GENE | |
| Reference mRNA sequence in fasta format. Optional | |
Given a BAM file and reference gene model, this program will calculate the raw count and RPKM values for transcript at exon, intron and mRNA level. For strand specific RNA-seq data, program will assign read to its parental gene according to strand rule, if you don’t know the strand rule, run infer_experiment.py. Please note that chromosome ID, genome cooridinates should be concordant between BAM and BED files.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM format (SAM is not supported). [required] | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). [required] | |
| -r REFGENE_BED, --refgene=REFGENE_BED | |
| Reference gene model in bed fomat. [required] | |
| -d STRAND_RULE, --strand=STRAND_RULE | |
| How read(s) were stranded during sequencing. For example: –strand=‘1++,1–,2+-,2-+’ means that this is a pair-end, strand-specific RNA-seq, and the strand rule is: read1 mapped to ‘+’ => parental gene on ‘+’; read1 mapped to ‘-‘ => parental gene on ‘-‘; read2 mapped to ‘+’ => parental gene on ‘-‘; read2 mapped to ‘-‘ => parental gene on ‘+’. If you are not sure about the strand rule, run ‘infer_experiment.py’ default=none (Not a strand specific RNA-seq data) | |
| -u, --skip-multi-hits | |
| How to deal with multiple hit reads. Presence this option renders program to skip multiple hits reads. | |
| -e, --only-exonic | |
| How to count total reads. Presence of this option renders program only used exonic (UTR exons and CDS exons) reads, otherwise use all reads. | |
Example:
RPKM_count.py -d '1++,1--,2+-,2-+' -r hg19.refseq.bed12 -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output
Output:
| chrom | start | end | accession | score | gene strand | tag count (+) | tag count (-) | RPKM (+) | RPKM (-) |
|---|---|---|---|---|---|---|---|---|---|
| chr1 | 29213722 | 29313959 | NM_001166007_intron_1 | 0 | ‘+’ | 431 | 4329 | 0.086 | 0.863 |
| chr1 | 29314417 | 29319841 | NM_001166007_intron_2 | 0 | ‘+’ | 31 | 1 | 0.114 | 0.004 |
| chr1 | 29320054 | 29323726 | NM_001166007_intron_3 | 0 | ‘+’ | 32 | 0 | 0.174 | 0.000 |
| chr1 | 29213602 | 29213722 | NM_001166007_exon_1 | 0 | ‘+’ | 164 | 0 | 27.321 | 0.000 |
| chr1 | 29313959 | 29314417 | NM_001166007_exon_2 | 0 | ‘+’ | 1699 | 4 | 74.158 | 0.175 |
| chr1 | 29319841 | 29320054 | NM_001166007_exon_3 | 0 | ‘+’ | 528 | 1 | 49.554 | 0.094 |
The precision of any sample statitics (RPKM) is affected by sample size (sequencing depth); “resampling” or “jackknifing” is a method to estimate the precision of sample statistics by using subsets of available data. This module will resample a series of subsets from total RNA reads and then calculate RPKM value using each subset. By doing this we are able to check if the current sequencing depth was saturated or not (or if the RPKM values were stable or not) in terms of genes’ expression estimation. If sequencing depth was saturated, the estimated RPKM value will be stationary or reproducible. By default, this module will calculate 20 RPKM values (using 5%, 10%, ... , 95%,100% of total reads) for each transcripts.
In the output figure, Y axis is “Percent Relative Error” or “Percent Error” which is used to measures how the RPKM estimated from subset of reads (i.e. RPKMobs) deviates from real expression level (i.e. RPKMreal). However, in practice one cannot know the RPKMreal. As a proxy, we use the RPKM estimated from total reads to approximate RPKMreal.
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. [required] | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output files(s). [required] | |
| -r REFGENE_BED, --refgene=REFGENE_BED | |
| Reference gene model in bed fomat. [required] | |
| -d STRAND_RULE, --strand=STRAND_RULE | |
| How read(s) were stranded during sequencing. For example: –strand=‘1++,1–,2+-,2-+’ means that this is a pair-end, strand-specific RNA-seq, and the strand rule is: read1 mapped to ‘+’ => parental gene on ‘+’; read1 mapped to ‘-‘ => parental gene on ‘-‘; read2 mapped to ‘+’ => parental gene on ‘-‘; read2 mapped to ‘-‘ => parental gene on ‘+’. If you are not sure about the strand rule, run ‘infer_experiment.py’ default=none (Not a strand specific RNA-seq data) | |
| -l PERCENTILE_LOW_BOUND, --percentile-floor=PERCENTILE_LOW_BOUND | |
| Sampling starts from this percentile. A integer between 0 and 100. default=5 | |
| -u PERCENTILE_UP_BOUND, --percentile-ceiling=PERCENTILE_UP_BOUND | |
| Sampling ends at this percentile. A integer between 0 and 100. default=100 | |
| -s PERCENTILE_STEP, --percentile-step=PERCENTILE_STEP | |
| Sampling frequency. Smaller value means more sampling times. A integer between 0 and 100. default=5 | |
| -c RPKM_CUTOFF, --rpkm-cutoff=RPKM_CUTOFF | |
| Transcripts with RPKM smaller than this number will be ignored in visualization plot. default=0.01 | |
Example:
RPKM_saturation.py -r hg19.refseq.bed12 -d '1++,1--,2+-,2-+' -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output
Output:
All transcripts were sorted in ascending order according to expression level (RPKM). Then they are divided into 4 groups:
BAM/SAM file containing more than 100 million alignments will make module very slow. Follow example below to visualize a particular transcript (using R console):
pdf("xxx.pdf") #starts the graphics device driver for producing PDF graphics
x <- seq(5,100,5) #resampling percentage (5,10,15,...,100)
rpkm <- c(32.95,35.43,35.15,36.04,36.41,37.76,38.96,38.62,37.81,38.14,37.97,38.58,38.59,38.54,38.67, 38.67,38.87,38.68, 38.42, 38.23) #Paste RPKM values calculated from each subsets
scatter.smooth(x,100*abs(rpkm-rpkm[length(rpkm)])/(rpkm[length(rpkm)]),type="p",ylab="Precent Relative Error",xlab="Resampling Percentage")
dev.off() #close graphical device
Provide gene list (bed) and BAM file, this module will split the original BAM file into 3 small BAM files: 1. *.in.bam: reads that are mapped to exon regions of the gene list (or reads consumed by gene list). 2. *.ex.bam: reads that cannot be mapped the exon regions of the original gene list. 3. *.junk.bam: qcfailed reads or unmapped reads. It is particular useful if the input gene list is ribosomal RNA, in this situation, user can estimate how many reads are originated from ribosomal RNA. Download rRNA
| --version | show program’s version number and exit |
| -h, --help | show this help message and exit |
| -i INPUT_FILE, --input-file=INPUT_FILE | |
| Alignment file in BAM or SAM format. BAM file should be sorted and indexed | |
| -r GENE_LIST, --genelist=GENE_LIST | |
| Gene list in bed foramt. All reads hits to exon regions (defined by this gene list) will be spilt into a separate BAM file, the remaining reads will saved into another BAM file. | |
| -o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX | |
| Prefix of output BAM files. “prefix.in.bam” file contains reads hit to the gene list specified by “-r”, “prefix.ex.bam” contains reads that cannot mapped gene list. “prefix.junk.bam” contains qcfailed or unmapped reads. | |
Example:
python2.7 split_bam.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -r hg19.rRNA.bed -o output
Output:
Total records: 44826454
output.in.bam (Reads consumed by input gene list): 5185
output.ex.bam (Reads not consumed by input gene list): 44821269
output.junk.bam (qcfailed, unmapped reads): 0