Beruflich Dokumente
Kultur Dokumente
transcriptomics: from
sampling to data analysis
Outline
1. Introduc/on to transcriptomes
2. Sample collec/on
3. RNA extrac/on methods and RNA quality
assessment and quan/fica/on
4. RNA sequencing techniques
5. Bioinforma/c Analyses - Typical pipeline: Quality
assessment, trimming,
6. Special type of analyses: mapping onto genome,
quan/fica/on of expression, variant calling (SNPs)
Transcriptomes give us information of gene
expression
Iden/fy genes differen/ally expressed, iden/fy func/onal changes…
Pros Cons
• Easy, accessible way to see • Snapshot in /me (different
and quan/fy gene /mes, different expression
expression paTerns)
• Absence of a gene does not
• Immediate access to the mean it is not present in
protein coding por/on of the genome.
the genome • Difficult to ensure that you
• Iden/fy alterna/ve splicing have sampled a single cell
type.
• Iden/fy Single Nucleo/de
Polymorphisms (SNPs) in • Sta/s/cal analysis is highly
coding regions dependent on experimental
design.
The stage of gene expression we
capture
RNAseq captures the
mature messenger RNA
(mRNA)
Targets the
characteris/c poly-A
tail of the mRNA
The assump/on is
that the amount of
mRNA for any gene is
reflec/ve of its
impact on the cell
func/on
Sampling
design
VERY IMPORTANT: what is your research ques/on?
-- will you have enough to address your ques/on?
Things to bear in mind:
• What /ssues to target – relevant to your research ques/on
• Homogeneous sampling of /ssues - to the extent you can
manage
• Replicates – accounts for varia/on and important to
validate results
• Developmental stage of studied individuals
• Consult sequencing specialists – (Ian Carr
and Steve Moss) for advice on sampling
Some techniques commonly used to
stabilise RNA
• Snap freezing (liquid nitrogen) – immediate storage in -80°C.
• RNAlater (Ambion) – small sized /ssue (< 0.5 cm lengths)
put in x5 volumes of it. Longterm storage: -20°C or -80°C.
• NAP buffer (”homemade”) – similar to RNAlater.
• Other commercial products customised to sample types (i.e.
blood)
Preserving /ssue with RNAlater
Followed by
RNA later
and NAP
buffer Camacho-Sanchez et al. 2013. Molecular Ecology Resources 13, 663–673
Bind total
RNA
Separate
Elute
phases
Tissue
UV Spectroscopy
• Measures absorbance of diluted RNA sample at 260 and 280 nm
• Nucleic acid concentra/on is calculated using Beer-Lambert law
UV Spectroscopy
• Measures absorbance of diluted RNA sample at 260 and 280 nm
• Nucleic acid concentra/on is calculated using Beer-Lambert law
A = ε C I
e.g. A260=1.0 is equivalent to ~40 μg/mL RNA
A260/A280 ra/o indicates RNA purity
• 1.8-2.1 indicates highly purified RNA
IMPORTANT CONSIDERATIONS:
• pH
• CuveTe
• RNA dilu/on range
• Does not discriminate between DNA and RNA (use RNase-free DNase to remove contamina/ng DNA
RNA ISOLATION
ISOLATED RNA
• READ LENGTH
Intermediate genomes 10 70-130 50-100 SE or PE for
3X more reads per sample = 1.5X cost increase
• (Drosophila, C. elegans)
DEPTH OF COVERAGE
posi/onal info
= ~25% more differen/ally expressed genes detected
• Determined by number of samples (libraries) in one lane
Large genomes (human/ 15-25 100-200 >100 SE or PE for
mouse) posi/onal info
REPLICATES, RANDOMISATION AND MULTIPLEXING
• Liu, Y., Zhou, J., and White, KP., (2014) RNA-seq differen/al expression studies: more sequence or more
replica/on? Bioinforma/cs Feb 1;30(3):301-4
Bioinformatics - Analysis of
transcriptomic data
Pasteurella in Saiga Antelope host
2 objec:ves:
1) Get expression level of virulent Pasteurella
genes (coun/ng reads)
2) Iden/fy other possible muta/ons (variant
calling)
Transcriptomic pipeline
Transcriptomic pipeline
NGS data – what it looks like
Transcriptomic pipeline
Sequencing quality check
Fastq quality score: Q = -10 log10 P
Quality score Probability of incorrect Accuracy of base
iden:fica:on iden:fica:on
40 1 in 10000 99.99 %
30 1 in 1000 99.9 %
20 1 in 100 99 %
10 1 in 10 90 %
FastQC: visualisa/on
Trimmoma/c: trim reads
Cutadapt: remove adaptors FastQC interface
40 1 in 10000 99.99 %
30 1 in 1000 99.9 %
20 1 in 100 99 %
10 1 in 10 90 %
FastQC: visualisa/on
Trimmoma/c: trim reads
Cutadapt: remove adaptors FastQC interface
Sequencing quality check
Fastq quality score: Q = -10 log10 P
Quality score Probability of incorrect Accuracy of base
iden:fica:on iden:fica:on
40 1 in 10000 99.99 %
30 1 in 1000 99.9 %
20 1 in 100 99 %
10 1 in 10 90 %
FastQC: visualisa/on
Trimmoma/c: trim reads
Cutadapt: remove adaptors FastQC interface
Transcriptomic pipeline
Mapping reads to Pasteurella genome
Output file:
BAM (Binary Alignment Map) compressed and encrypted.
SAM (Sequence Alignment Map)
Picard Tools
Samtools
Transcriptomic pipeline
Transcriptomic pipeline
SAM format and alignment statistics
SAM format
Sta:s:cs:
Samtools ‘flagstat’
Transcriptomic pipeline
Mpileup file
SAM file
Samtools ‘mpileup’
Mpileup file
Transcriptomic pipeline
Count reads mapping a region
Sample 1 Sample 2
Transcriptomic pipeline
Compare reference to ‘sample’ Pasteurella
IGV
FastQC
Trimmoma/c/Cutadapt
BWA, Bow5e, STAR, Tophat
Summary Samtools
Ø Flagstat
Ø Mpileup
HTSeq-count
Varscan
Need help?
Sequencing advice:
Ian Carr
M. O’Connell Simon Goodman
Steve Moss