Beruflich Dokumente
Kultur Dokumente
RNA-seq analysis
Denis Puthier, Claire Rioualen and Jacques van Helden
1
Transcriptome analysis
Tentative definition
Transcriptome: the set of all RNA produced by a cell
2
Some players of the RNA world
miRNA
Regulatory RNA (mostly through binding of 3UTR
target genes )
SnRNA
Uridine-rich
Several are related to splicing mechanism
Some are found in the nucleolus (snoRNA)
Related to rRNA biogenesis
4
Some players of the RNA world
eRNA
Enhancer RNA
And many others(e.g LncRNA)
5
Microarrays drawbacks
Cross-hybridization
Probe design issues
Content limited
Can only search expression not discover
Indirect record of expression level
Complementary probes
Relative abundance
6
Even more powerful technology: RNA-seq
7
RNA-seq simplified overview
Ribo-depletion
vs Poly-A
selection
9
Protocol variations in library construction
10
Unstranded library construction
(1) - RNA fragments (2) - Reverse transcription (3) - Second strand (4) - ligation of adapters
and RNA degradation synthesis with
dUTP
11
Stranded library construction
(1) - RNA fragments (2) - Reverse transcription (3) - Second strand (4) - ligation of adapters
and RNA degradation synthesis with
dUTP
X 12
Example of stranded single-end RNA-seq alignment
Forward (Red)
Reverse (Blue)
Cd3g (strand -) Cd3d (strand +)
13
Example of unstranded single-end RNA-seq alignment
Forward (Red)
Reverse (Blue)
14
Stranded RNA-Seq result
+ (Watson)
# reads
- (Crick)
Transcript
models
Stranded RNA-seq
makes it possible to
extract signal produced
from both strands
15
Sequencing variation: single-end vs paired-end
16
RNA-seq library preparation: PE vs SE
Paired-end vs Single-end
Better reconstruction of transcripts with Paired-end
Paired-end: more expensive
PE should be
preferred
E1 E2 E3
17
Bioinformatic workflow: overview
...
18
Raw Data
... 19
Our dataset
20
Loading the dataset 21
Protocol
1. In the upper left corner, click on Unnamed history and rename this workspace to DM1.
2. Select Shared Data > Data Libraries > P5424 > DM1 > DM1_chr18_R1.fq > Import this
dataset into selected history.
3. In the new window select DM1 as Destination history. Click on Import library dataset.
4. Select Analyze Data in the upper menu. Using the pencil, to rename the dataset into DM1_R1.
5. Select Shared Data > Data Libraries > P5424 > DM1 > DM1_chr18_R2.fq > Import this
dataset into selected history. In the new window select DM1 as Destination history. Click on
Import library dataset.
6. Select Analyze Data in the upper menu. Using the pencil, rename the dataset to DM1_R2.
7. Click the eye icon to display the content of DM1_R1 file.
Q1: How is the quality encoded?
Q2: What can you say about the quality of the first encountered reads?
Name your history (for example: DM1)
22
Get a dataset from a Shared Data Library
23
Import a file from the shared library to your history
- DM1_chr18-20Mto50M_
R1.fq
- DM1_chr18-20Mto50M_
R2.fq
24
Imported files should now appear in DM1 history
25
Change (simplify) file names in your history
26
The raw data are provided in fastq format
27
Performing FastQC analysis (raw data) 28
Protocol
1. Use NGS: QC and manipulation > FastQC:Read QC.
2. Select the first fastq file (DM1_R1) and press Execute.
3. Display the data for the corresponding fastqc result (use the view (eyes) icon above the dataset
name in the right panel).
4. Carefully inspect all the statistics.
5. Perform the same operation for DM1_R2 file.
29
FastQC result page
30
Raw Data (fastq) FastQC (html)
... 31
Trimming 32
Protocol
1. Search for the Sickle tool using the galaxy search engine (upper left corner).
2. Set Single-End or Paired-End reads to Paired-end (two separate files).
3. Set forward reads to DM1_R1
4. Set reverse reads to DM1_R2
5. Set Quality Threshold to 20, Length Threshold to 25.
6. Execute.
7. Rename Paired-End forward strand output of Sickle to DM1_R1_trim
8. Rename Paired-End reverse strand output of Sickle to DM1_R2_trim
9. Rename Paired-End singleton output of Sickle to DM1_singleton_trim
10. Perform a new FastQC analysis using the trimmed read as input
Q: How many reads to you retrieve after trimming? How does it compare with the input fastq
files?
Trimming: sickle input form
33
Raw Data (fastq) FastQC (html)
... 34
Alignment: splice-aware aligners
Reads that overlaps several exons may not be mapped properly by
splice-unaware aligners (e.g bowtie)
Genome
E1 E2 E3
Fragments
35
Splice-aware aligners ?
Reads that overlap several exons may not be mapped properly by
splice-unaware aligners (e.g bowtie)
Fragments
Genome
E1 E2 E3
We will obtain
spliced reads
(gapped
alignments)
36
Example of splice aware aligners
Tophat
Part of a complete pipeline (the tuxedo pipeline)
Make call to bowtie to perform initial, unspliced-alignments
STAR
Developed in the context of ENCODE project
Very fast (>> compared to tophat)
Need ~30Go of memory for human/mouse genome
Based on an associative table (hash).
Usage is painful
Compatible with the tuxedo pipeline
37
Aligned reads
stranded paired-end sequencing on Total RNA (contains immature
RNA)
Alignment performed with tophat
Gene: Il2RA
38
Getting the sequence of mouse chromosome 18 at UCSC 39
Most of the time the galaxy server will provide you with an already indexed genome that can be used by
tophat to perform read alignment. In this practical, we would like to restrict the alignment to mouse
chromosome 18 (this will be faster). We thus need to download the sequence of mouse chromosome 18.
This sequence will be provided to tophat in the subsequent steps (tophat will perform sequence indexing
internally by calling bowtie-build).
NB: the chromosome sequence can also be obtained from ensembl ftp web site.
Protocol
1. In your browser, open a connection to the UCSC ftp site for the mouse genome (assembly mm9):
http://hgdownload.soe.ucsc.edu/goldenPath/mm9/chromosomes/
2. Copy the link address of chr18.fa.gz.
3. Select Tools > Get Data > Upload File.
4. Click Paste / Fetch data.
5. In the text area (URL/Text) paste the link to the chr18 sequence.
6. Select fasta as File Format and mm9 as a reference genome.
7. Press Start. After a few seconds the query is submitted to your Galaxy server and the Status appears in green.
Click Close to quit the Download window.
8. Once the sequence has been fetched from UCSC, rename the record in the history to chr18_mm9.fa.
Gene Transfer Format (GTF)
40
My GTF file format is rich
chr source type start end score strand frame attributes (keys/values) .
chr6 refGene exon 80837264 80837341 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80838878 80838946 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80877395 80877528 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80878592 80878747 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80880999 80881107 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80910651 80910748 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80912819 80912929 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
chr6 refGene exon 80982852 80982938 . + . transcript_id "NM_183050"; gene_id "BCKDHB";
In order to provide Tophat with the location of known exons in the human genome, we will
download a file in GTF format (Gene transfer format). You can get more information about this
format on UCSC web site or GENCODE web site. GTF file can be obtained both from UCSC table
browser or ensembl ftp web site.
NB: it is very important at this step to ensure that the fasta file and the GTF file are
obtained from the genome release . The chromosome sequences and gene positions vary
between releases.
Protocol
1. Select Shared Data > Data Libraries > P5424_chr18-20Mto50M > GTF >
chr18_20M-50M_gencode_vM1.gtf > Import this dataset into selected history. In the
new window select DM1 as Destination history. Click on Import library dataset.
2. Select Analyze Data in the upper menu..
3. Check the first lines of the GTF file. What kind of information is enclosed in this file?
GTF file content
https://genome.ucsc.edu/FAQ/FAQformat.html#format4 43
Read mapping with TopHat 44
Protocol
1. Select NGS: Mapping > Tophat from the toolbox.
2. Set data type to Paired-end (individual datasets).
3. Set RNA-Seq FASTQ file, forward reads to DM1_R1_trim. Set reverse reads to
DM1_R2_trim.
4. Set Use a built in reference genome or one from your history to Use a genome from
history.
5. Set Select the reference genome to chr18_mm9.fa.
6. Set TopHat settings to use to Full parameter list.
a. Set Maximum number of alignments to be allowed to 1.
b. Set Library Type to FR First strand (this is determined by library construction kit).
c. Set Use Own Junctions to yes. Set Use Gene Annotation Model from History.
Set Gene Model Annotations to chr18_20M-50M_gencode_vM1.gtf
7. Press Execute. Rename the accepted_hits dataset to DM1_alignments. Rename the
splice-junction bed file to DM1_splice_junctions.bed.
Mapping with tophat
46
Checking the number of aligned reads 47
Exercise
We will use samtools flagstat to assess the number of aligned reads available in the bam file.
1. Select Statistics > flagstat.
2. Select the BAM file and press Execute.
Q1: look carefully at the statistics. What is the meaning of each record ?
NB: "Properly paired" means both mates of a read pair map to the same chromosome, oriented in
Forward-Reverse orientation and having an insert size compatible with fragment size.
Raw Data (fastq) FastQC (html)
... 48
Loading Tophat results with the Integrative Genomics Viewer (IGV) 49
Protocol
1. In Galaxy, select tophat accepted hits and download dataset (bam) and bam index (bai).
2. Select control_splice_junctions.bed and download the bed file.
If this file does not contain a bed extension, rename it to add .bed.
3. Open a connection to the Integrative Genomics Viewer (IGV) download page:
https://software.broadinstitute.org/software/igv/download
4. Download and open IGV or launch it with 750 MB or 1.2 Gb depending of your machine.
5. In IGV, load the mm9 genome.
6. Select the menu function File -> Load from file , locate the download folder on your
computer, select the bam file., and click Open.
7. Note: you do not need to load the bam index (bai) file because it will be loaded automatically
with the bam file of the same name.
8. In the same way, load the splice junction file (bed format).
Download the aligned reads (bam, bai) and splice junctions
Splice junctions (bed)
Aligned reads (bam) Bam index (bai)
50
Load the Mouse genome without its sequence in IGV
Do not activate
this option !
51
Load your tophat results in IGV
Load
the bam file but not the bai
(it will be loaded
automatically with the bam)
the splice junction bed file.
52
Loaded tophat results
53
Viewing the results with the Integrative Genomics Viewer (IGV) 54
Protocol
1. Select mm9 as a genome and browse to chromosome 18.
Note: the download sequence option should be inactive.
2. Go to the Egr1 gene (by typing Egr1 in the GO text area). Zoom to view alignments.
3. In the left panel right click on the bam track and select View as pairs.
4. In the left panel right click on the bam track and set Color Alignments by > Read
strand.
5. Load the control_splice_junctions.bed into IGV (File > Load from file).
6. Unzoom to view the number of alignments supporting exon junctions.
7. Mouse over a junction on of control_splice_junctions.bed track. What is the Depth about
?
Select a chromosome (here. chr18)
55
Select a gene (e.g. Egr1)
56
Select viewing options
57
Viewing results with IGV 58
Exercise
1. What is the strand of Egr1 gene?
2. Regarding reads, what does the blue and pink color indicate?
3. Mouse over a paired read. What are the meanings of the following tags/keys:
a. CIGAR? Mapped? Mapping quality? Secondary? Duplicate? Mate-is mapped?
Insert-size? Pair-orientation? First in pair? Second in pair?
4. What are the meaning of :
a. NH? NM?
5. Mouse over several paired alignments on Egr1. What are the values of the
pair-orientation keys?
Viewing results with IGV 59
Exercise
1. Go to internal exons of Etf1 (this gene is located just 40kb away on the 3 side of Egr1).
a. What is the strand of Etf1? What are the values of pair-orientation key on paired
alignments?
b. Look at additional gene examples.
2. What can you conclude regarding paired alignments values?
a. How would you isolate the signal emitted from the plus and minus strands?
3. Looking at Nr3c1 you will find some signal extending from the 5 region?
a. Is it produced by the plus or minus strand?
Bam files are fat
BAM files are fat as they do contain exhaustive information about read
alignments.
Memory issues (can only visualize fraction of the BAM).
Need a more lightweight file format containing only genomic coverage
information:
Wig (not compressed, not indexed)
60
Window wi wi+1 wi+2 wi+3
Coverage files
(wig, bigwig, tdf)
Coverage 4 7 3 1
A lightweight format
FixedStep chrom=chr1 start=100 step =50
Especially when span=50
compressed 4
7
Fast access 3
1
When indexed
61
Creating a bigwig file - (1) getting chromosome sizes 62
Protocol
1. in Galaxy, use Get Data > UCSC Main table browser.
2. Set : Clade to Mammal, Genome to Mouse, assembly to July 2007 (NCBI37/mm9), group to All tables,
database to mm9 and table to chromInfo.
3. Set output format to all fields from selected table and Send output to Galaxy.
4. Click get output. In the result web page press Send query to galaxy.
5. Rename the dataset to mm9_chrom_info_txt.
Q1: What does this file contain? Check the information in each column of the result.
Q2: Suppress from the text file the column header line(starting with a #) and the random chromosomes.
Q3: Use the Galaxy tools Text Manipulation > Cut and Statistics > Summary Statistics > Column or
expression > select C2 to compute the median size of a mouse chromosome.
Getting gene coordinates from UCSC table browser
63
Remove the header line
64
Suppress random chromosomes
65
Creating a bigwig track (2) - create a wiggle 66
Protocol
1. Use the BAM to Wiggle tool to convert the BAM file to a wiggle format (the uncompressed and
unindexed version of the BigWig format).
2. Select the tophat accepted hits as input file.
3. Set Chromosome size file to mm9_chrom_info_txt.
4. Set Strand-specific to Paired-end RNA-Seq.
5. Set Pair-End Read Type to read1 (positive > negative; negative > positive), read2 (positive
> positive; negative > negative).
6. Press Execute.
7. Rename the output tracks to DM1_Fwd_wig and DM1_Rev_wig
Convert bam to wiggle
67
Creating a bigwig track (3) 68
Protocol
For each one of the two wiggle files:
... 69
Searching for novel transcript models
Fragments
Genome
E1 E2 E3
70
Searching for novel transcript models: cufflinks
Read pair
Gapped alignment
71
Searching for novel transcript models: cufflinks
72
Searching for novel transcripts with cufflinks 73
Protocol
1. In the toolbox, select NGS: RNA Analysis > Cufflinks.
2. Select the Tophat accepted hits file as SAM or BAM file of aligned RNA-Seq reads.
3. Set Use Reference Annotation to Use reference annotation as guide.
4. Set Reference Annotation to chr18_20M-50M_gencode_vM1.gtf.
5. Set Set advanced Cufflinks options to Yes.
6. Set Library prep used for input read to fr-firststrand.
7. Press Execute.
8. Rename assembled transcript dataset to DM1_cufflinks_transcripts.
Q1: Have a look at the assembled transcripts file produced by cufflinks. What are the attributes
provided?
Q2: Move downwards in the assembly table and check the gene_id and transcript_id attributes.
Why do some of them start with ENS and others with CUFF?
Q3: Download the assembled transcript file produced by cufflinks and load it into IGV. What
can we say about the transcripts produced by the Pura gene?
Cufflinks parameters
74
Extracting a Raw Data (fastq) FastQC (html)
... 75
Extracting a workflow 76
Protocol
Protocol
1. Create a new history: History > Create new and rename it PI1.
2. Select Shared Data > Data Libraries > P5424_chr18-20Mto50M
3. Check the box besides the genome_data folder.
This folder contains 3 files required as input for the workflow:
a. the sequence of mouse chromosome 18: mm9_chr_size.txt, chr18_mm9_fa
b. annotations for the selected region of chromosome 18:
chr18_20M-50M_gencode_vM1.gtf
c. mouse chromosome sizes: mm9_chr_size.txt.
4. Check the
5. Check the PI1 folder, which contains the two fastq files with the reads R1 and
R2 of the PI1 sample.
6. Click To History and import it in the PI1 history. This will import all the files
contained in the two selected folders
7. Click on Analyze Data to go back to your history (PI1). You should see the five
datasets.
Running the workflow on another sample 78
Protocol
1. In the top menu, select workflow > RNA-Seq mapping and transcript discovery > edit.
2. Have a look at your new workflow. Check the input files and figure out which of the files
imported in the previous slide should be used where.
3. Select workflow > RNA-Seq mapping and transcript discovery > run. Set the proper
input files.
4. Click Run workflow at the bottom of the page.
5. Renamed the bigwig files PI1_Fwd_cov.bigwig and PI1_Rev_cov.bigiwg, resp.
6. Rename assembled transcript dataset to PI1_cufflinks_transcripts.
7. Rename the accepted_hits dataset to PI1_tophat_alignments.
8. Rename the splice-junction bed file to PI1_tophat_splice_junctions.bed.
9. Save these 5 results on your computer, and load them in IGV.
Q1: Go to the Egr1 gene. What can you see?
Creating a workspace to compare two samples 79
Protocol
1. Create a new history entitled DM versus PI.
2. Import the following files from
Shared Libraries > P5424_chr18-20Mto50M > genome_data
a. The gtf file (chr18_20M-50M_gencode_vM1.gtf).
b. The chromosome sizes (mm9_chr_size.txt).
3. Click Analyse data and use History > Dataset actions > Copy datasets to copy
the following datasets in this history.
a. The assembled transcripts from cufflinks, which should be renamed
PI1_cufflinks_transcripts and DM1_cufflinks_transcripts, resp.
b. The tophat accepted hits (that should have been renamed PI1_alignments and
DM1_alignments) for all samples.
Transcript Raw Data (fastq) FastQC (html)
... 80
Merging the reference and inferred genomic annotations
We now have at least three different GTF files (depending on whether you have
processed DM2,DM3,PI2,PI3):
The reference annotation
The discovered transcripts in the control sample(s).
The discovered transcripts in the activated sample(s).
We will ask cuffmerge to merge the novel annotations (obtained through cufflinks)
with the reference (known annotation) and to classify the transcripts. It will
annotate transcripts by producing a GTF file containing flags. Some of this flags
may indicate that:
The transcript is unknown (class code u).
The transcript is a novel isoform of a known transcript (class code j).
The transcript is the same as the original/known transcript ((class code =).
For a full description of all possible flags (class code), please refer to the
cuffmerge web site (section Transfrag class codes).
Here we will concentrate on retrieving the position of novel transcripts. 81
Combining and comparing transcripts with cuffmerge 82
Protocol
1. Come back to Analyse Data, and select cuffmerge in the tool search box.
2. In the option GTF file(s) produced by Cufflinks, select the assembled transcript files
(DM1_transcripts).
3. Set Use Reference Annotation to Yes.
4. Set Reference Annotation to chr18_20M-50M_gencode_vM1.gtf.
5. Press Execute.
6. Rename the result cuffmerge_transcripts_PI_and_DM.
Combining and comparing transcripts with cuffmerge
83
Selecting unknown transcripts discovered by cuffmerge 84
Protocol
1. Use Filter and sort > Select lines that match an expression.
2. In cuffmerge_transcripts_PI_and_DM select lines matching the pattern class_code "u".
Note: the letter "u" stands for unknown genes (not present in the reference annotations).
3. Rename the result unknown_genes_PI_and_DM
4. Merge the novel gene annotations with the reference annotations
(chr18_20M-50M_gencode_vM1.gtf) using Text Manipulation > Concatenate datasets
(you will need to click Insert datasets to specify the second dataset).
5. Rename the file to enhanced_annotations.gtf.
Q1: How many transcripts were classified as unknown?
Merging reference and novel annotations
85
Raw Data (fastq) FastQC (html)
... 86
Quantification
87
Counting reads per genes 88
Protocol
We assume that you have aligned the reads for samples DM1 (control) and PI1 (activated cells).
... 89
Protocol
NB: Differential analysis will be done on the full expression matrix provided as shared library.
1. Create a new history named Differential_expression.
2. Go to Shared Data > Data Libraries > P5424_chr18-20Mto50M > COUNTS.
3. Import DM_vs_PI_gene_counts.txt into the Differential_expression history.
4. Click Analyse data and select Differential_Count from the toolbox.
5. Set Title for job outputs to P5424_PI_vs_DM.
6. Set Treatment Name to PMA_Ionomycine.
7. Select columns containing treatment to PI1, PI2, PI3.
8. Set control name to DMSO.
9. Select columns containing control to DM1, DM2, DM3.
10. Set Run this model using edgeR to Do not run edgeR.
11. Set Do not run DESeq2 to Run DESeq2.
12. Set Run Voom to Do not run Voom.
13. Set FDR (Type II error) control method to fdr.
14. Click Execute.
Q1: What is the first plot?
Q2: What can you say from the second plot?
90 Q3: Look at the produced html file. What can you guess from the heatmap?
Extracting significant genes 91
Protocol
1. Select the tool Filter data on any column using simple expressions.
2. Use DifferentialCounts_topTable_DESeq2.xls as a dataset.
3. Set With following condition to abs(c3) > 1 and c7 < 0.01 and Execute.
4. Rename the output DESeq2_DEG_logFC1_padj0.01
5. With the tool Cut columns from a table, select the first column of the
DESeq2_DEG_logFC1_padj0.01 table, in order to dispose of the list of gene names.
6. Rename the output DESeq2_DEG_logFC1_padj0.01_genes
92
An example list.
Pubmed query for all of them ?
93
What is the biological meaning of a gene lists?
94
Is my list enriched in gene whose function is known ?
N genes
m genes known to be associated to a
Term !Term
term/function T.
List x k-x k
n genes not associated to the term/function
T. !List m-x n-(k-x) N-k
k selected genes (e.g. upregulated in the
tumor compared to normal counterpart) m (white) n N
(black)
x genes associated to term/function T in k.
What is probability to observe x genes
associated with term/function T in k ?
X follows a hypergeometric
distribution N
Hypergeometric test / Fisher exact k
X
test m
95
Where are these lists coming from ?
97
gProfiler - A web server for functional interpretation
98
http://biit.cs.ut.ee/gprofiler/
Getting the list of up/down-regulated genes 99
Exercise
1. Using Galaxy extract the list of up and down-regulated genes into two separate datasets.
2. Use these lists as two separated input for gProfiler (http://biit.cs.ut.ee/gprofiler/).
3. What are the functional terms that appear significant with the hypergeometric test ?
a. For the up-regulated genes ?
b. For the down-regulated genes ?
Merci
100
Unstranded RNA-seq library limitations
+ (Watson)
>>>>>> Ea1 >>>>> >>>>>> Ea2 >>>>> >>>>>> Ea3 >>>>>
- (Crick)
<<<<<<<<<<<< Eb1 <<<<<<<<<<<<<<
UNSTRANDED Ambiguous
reads should
Ambiguous reads Non ambiguous reads be discarded
From
counting
STRANDED
102
Quantification
Quantification is most generally performed
at the gene level.
Some specialized software may provide
you with transcript abundance
estimations.
Cufflinks (tuxedo pipeline)
Kallisto
Known issues
Positive association between gene
counts and length.
May be problematic for
gene-wise comparisons.
Suggests higher expression of
longer genes.
Unstranded data may lead to
ambiguous reads that should be
103
discarded.
Intersample normalization: library size
Inter-sample normalization is a prerequisite for differential expression analysis.
This normalization is mostly applied because of some imbalance in read counts
between samples.
Example
Sample 1 has 2 times more reads than sample 2 (24 vs 12)
Gene expression will be overestimated in sample 1 although its expression is unchanged.
A basic normalisation factor could be the library size (total number of reads),
However this might lead to biases (see next slides).
Ratio (sample2/sample1)
50.5
G5
0.5
0.5
0.5
0.5
0.5
0.5
105
0.5
TMM Normalization (Robinson and Oshlack, 2010)
Trimmed Mean of M values
Outline
Compute the M values (log ratio). G5
Take the trimmed mean of the M
value as scaling factor.
Multiply read counts by scaling
factor (they multiply to one)
If more than two columns
The library whose 3rd quartile
is closest to the mean of 3rd
quartile is used.
Very similar to RLE
106
Intra-sample normalization
Here the objective is to compare the expression level of genes in the same
sample
Counts ?
Problem with long transcripts
Produce lots of fragments
Will appear artifactually highly expressed compared to others
Proposed method
RPKM
Read per kilobase per million mapped reads (SE)
FPKM
Fragment per kilobase per million mapped reads (PE)
107
RPKM/FPKM normalization
2kb transcript with 3000 alignments in a sample of 10 millions of mappable
reads
RPKM = 3000/(2 * 10) = 150
108
Differential expression analysis
109
Ontologies for almost everything !
https://www.bioontology.org/
Bioportal at http://bioportal.bioontology.org/
110
Network analysis through data mining
111
GeneMania
(http://genemania.org/)
112
Yet other applications of RNA-seq
115
Merci
116
TopHat pipeline
RNA-seq reads are mapped against the whole reference genome (bowtie).
TopHat allows Bowtie to report more than one alignment for a read (default=10),
and suppresses all alignments for reads that have more than this number
Reads that do not map are set aside (initially unmapped reads, or IUM reads)
TopHat then assembles the mapped reads using the assembly module in Maq.
An initial consensus of mapped regions is computed.
The ends of exons in the pseudo-consensus will initially be covered by few reads
(most reads covering the ends of exons will also span splice junctions)
Tophat adds a small amount of flanking sequence of each island
(default=45 bp).
117
TopHat pipeline
Weakly expressed genes should be poorly covered
Exons may have gaps
To map reads to splice junctions, TopHat first enumerates all canonical
donor and acceptor sites within the island sequences (as well as their
reverse complements)
Next, Tophat considers all pairings of these sites that could form
canonical (GTAG) introns between neighboring (but not necessarily
adjacent) islands.
By default, TopHat examines potential introns longer than 70 bp and shorter than 20 000 bp
(more than 93% of mouse introns in the UCSC known gene set fall within this range)
Sequences flanking potential donor/acceptor splice sites within neighboring
regions are joined to form potential splice junctions.
Read are mapped onto these junction library
118
Mapping read spanning exons
119
Bowtie, a very popular aligner (for unspliced alignments)
Burrows-Wheeler Transform-based algorithm
Two phases: seed and extend.
The Burrows-Wheeler Transform of a text T, BWT(T), can be constructed as follows:
The character $ is appended to T, where $ is a character not in T that is
lexicographically less than all characters in T.
The Burrows-Wheeler Matrix of T, BWM(T), is obtained by computing the matrix whose
rows comprise all cyclic rotations of T sorted lexicographically.
acaacg$ 1 $acaacg 7
T caacg$a aacg$ac BWT (T)
2 3
acaacg$ aacg$ac 3 acaacg$ 1 gc$aaac
acg$aca 4 acg$aca 4
cg$acaa 5 caacg$a 2
g$acaac 6 cg$acaa 5
$acaacg 7 g$acaac 6 120
Bowtie principle
Burrows-Wheeler Matrices have a property called the Last First (LF)
Mapping.
The ith occurrence of character c in the last column corresponds to
the same text character as the ith occurrence of c in the first
column
Example: searching AAC in ACAACG
7
3
1
4
2
5
6
122
Some key results of ENCODE analysis
15 cell lines studied
RNA-seq, CAGE-seq, RNA-PET
Long RNA-seq (76) vs short (36)
Subnuclear compartments
chromatin, nucleoplasm and nucleoli
123
The world of long non-coding RNA (LncRNA)
Long: i.e cDNA of at least 200 bp
A considerable fraction (29%) of lncRNAs are detected in only one of the cell
lines tested (vs 7% of protein coding)
10% expressed in all cell lines (vs 53% of protein-coding genes)
More weakly expressed than coding genes
The nucleus is the center of accumulation of ncRNAs
124
Some LncRNA are functional
Some results regarding their implication in cancer
May help recruitment of chromatin modifiers
May also reveal the underlying activity of enhancers
A large fraction are divergent transcripts
125
The Gencode database (hs/mm)
126
Aligner output: SAM/BAM files
SAM = Sequence Alignment/MAP
BAM: binary/compressed version of SAM
Store information related to alignments
Read alignment coordinates
Mapping quality
CIGAR String
Bitwise FLAG
read paired, read mapped in proper pair, read unmapped, ...
...
127
Bitwise flag
Numerous informations are enclosed in the 3rd column of the
bam file:
read pairs
reads mapped in proper pairs
reads unmapped
mates unmapped
reads reverse strand These binary information
are enclosed in a single column
mates reverse strand
first in pair
second in pair
not primary alignment
...
128
Bitwise flag
00000000001 2^0 = 1 (read paired)
00000000010 2^1 = 2 (read mapped in proper pair)
00000000100 2^2 = 4 (read unmapped)
00000001000 2^3 = 8 (mate unmapped)
00000010000 2^4 = 16 (read reverse strand)
00000001001 2^0+ 2^3 = 9 (read paired, mate unmapped)
00000001101 2^0+2^2+2^3 =13 ...
...
http://picard.sourceforge.net/explain-flags.html 129
The extended CIGAR string
Exemple flags:
M alignment match (can be a sequence match or mismatch !)
I insertion to the reference
D deletion from the reference
http://samtools.sourceforge.net/SAM1.pdf
ATTCAGATGCAGTA
ATTCA--TGCAGTA 5M2D7M
130
RNA-seq: library construction (simplified)
131
Illumina sequencing general principle
132
http://www.illumina.com/company/video-hub/HMyCqWhwB8E.html
The Sanger quality score
Sanger quality score (Phred quality score): Measure the quality of each base call
Based on p, the probability of error (the probability that the corresponding base
call is incorrect).
Qsanger = -10 log10(p)
p = 10-Q/10
Example: p = 0.01 <=> Qsanger = 20
Quality scores are in ASCII 33.
Note that SRA has adopted Sanger quality score although original fastq files may use
different quality score (see: http://en.wikipedia.org/wiki/FASTQ_format)
133
ASCII 33
Storing PHRED scores as single characters gave a simple and space
efficient encoding:
Character ! means a quality of 0
Character means a quality of 1
Character # means a quality of 3
...
Range 0-40
134
Quality control for high throughput sequence data
Quality control
First step of analysis
Ensure proper quality of sequencing experiment.
135
Quality control with FastQC program
Quality
Nb Reads
136
Mean Phred Score
Tools to create reproducible workflows
https://github.com/common-workflow-language/common-workflow-language/
wiki/Existing-Workflow-systems
E.g make, snakemake, galaxy, taverna...
137
http://www.bioconductor.org/help/course-materials/2009/EMBLJune09/Talks/RNAseq-Paul.pdf
Galaxy server (https://usegalaxy.org/)
Interface to a computing cluster
Highly flexible
Large palette of bioinformatic
programs
Easy to add your own
Fully reproducible workflows
138
Snakemake
A make-like solution
139