Lec 4 - Short Read Alignmet

Lecture 4.
Short Read Alignment
The Chinese University of Hong Kong

CSCI3220 Algorithms for Bioinformatics
Lecture outline
1. Massively parallel sequencing and short
reads
– The short read alignment problem
2. Suffix trie/tree/array
3. Burrows-Wheeler Transform (BWT)
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 2
Part 1
MASSIVELY PARALLEL SEQUENCING

AND SHORT READS
DNA sequencing
• DNA sequencing is the experimental
procedures to find out the exact text string of
a DNA sequence
– Input: Multiple copies of an unknown DNA
(biological) sequence
• Blood sample of a patient
• Some cultured bacteria
• A worm
• ...
– Output: (Text) sequences of fragments of the DNA
sequence
Illustration
Multiple copies of an unknown DNA biological sequence
TACCAGCGGACCGCTGAC
TACCAGCGGACCGCTGAC
TACCAGCGGACCGCTGAC
Breaking down into fragments

Sequencing
Text sequences of fragments

TACCAG GGACCG
GAC
CGCTGAC TACCAG
CTGAC TACCAGC
CGGAC CGCT CGGAC
Sequencing by synthesis
• Use one strand as template, synthesize the
other strand
• Different ways to detect what base is added:
– Give a different color for each type of nucleotide
– Supply only one type of nucleotide at a time, and
see if some signals (e.g., light) can be detected
– Stop whenever a certain nucleotide is added.
Then deduce the nucleotide by DNA lengths
• Can only handle up to a certain length of DNA
Sequencing by synthesis
Image credit: Illumina
Massively parallel sequencing (MPS)
• Sequencing many short fragments in parallel
– Also called “next-generation” or “deep” sequencing
Image credit: Metzker, Nature Reviews Genetics 11:31-46, (2010); Azco Biotech
Shotgun sequencing
• Breaking down the long DNA sequence into
multiple fragments due to the experimental
limitation
Whole genome shotgun Hierarchical approach: slightly easier

to get back the original sequence
Image credit: Commins et al., Biological Procedures Online 11(1):52-78, (2009)
Short reads
• The output of MPS is a list of short sequences
– Each is called a read
– Example: ACA, ATA, ATA, ATT, TAG, TAT, TTC
• Some properties of current MPS reads:
– About 100-200 nucleotides long (very short as
compared to the human genome)
– May overlap, since multiple copies of the original DNA
are sequenced
• Millions or even billions of reads from one experiment
– The DNA sample may contain variations due to
heterozygosity, somatic mutations and mixed
population of cells
– May also have contamination and sequencing errors
Computational problems
• Two main computational problems
• Sequence alignment (this lecture):
– Given a reference sequence s of length n, how to find
out the position of each read r of length m in the
reference?
– Example situations:
• Sequencing the DNA in a cancer sample – The sequence of
normal human DNA can serve as a reference
• Sequencing the DNA of a strain of a bacteria – The sequence
of other strains of the bacteria can serve as a reference
• Sequence assembly (next lecture):
– Is it possible to assemble the short reads back to the
original DNA?
Short read alignment
• Example:
– Original sequence: s=TATACATTAG
– Short reads:
ACA, ATA, ATA, ATT, TAG, TAT, TTC
– Alignment:
TATACATTAG
ACA
ATA
ATA
ATT
TAG Variation or error
TAT
TTC
Image source: http://img1.etsystatic.com/000/0/6103070/il_fullxfull.203233493.jpg
Short read alignment
• Basically a local alignment problem, but need
to align millions or billions of short sequences
with a very long reference sequence,
expecting almost exact matches
• Need to build indexes on reads or reference
– Once the indexes are built, the searching time
should depend only on the size of searching
results (number of hits and their locations), not
the length of the reference
– We will mainly study methods for exact matches
Indices
• Main considerations:
– Space requirement
– Time requirement for building index
– Time requirement for searching
• Main approaches:
– Hash-table-based (Similar to FASTA and BLAST)
• BFAST, ELAND, MAQ, MOSAIK, SHRiMP, SOAP, ZOOM, ...
– Suffix-tree or Burrows-Wheeler-Transform-based
• Bowtie, BWA, SOAP2, ...
– Probabilistic structures with certain chance of wrong
answer
• Bloom Filter, Quotient Filter, …
Last update: 28-Aug-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 14
Hashing
• In general, hashing schemes have to face two
problems
– Some allocated space would be wasted if the k-
mers do not appear in the sequence
– Collisions could occur
• A typical tradeoff between space and time
• Source of the problem: The hash function has
no knowledge of the sequence
– We now study some data structures that make
use of some information about the sequence
Part 2
SUFFIX TRIE/TREE/ARRAY
Suffixes
• Given a sequence s[1..n], a suffix is either a sub-sequence s[i..n] for
any i between 1 and n, or the empty string (which is sometimes
represented by s[n+1..n])
• Example: s[1..10]=TATACATTAG
• Suffixes:
– s[1..10] TATACATTAG
– s[2..10] ATACATTAG
– s[3..10] TACATTAG
– s[4..10] ACATTAG
– s[5..10] CATTAG
– s[6..10] ATTAG
– s[7..10] TTAG
– s[8..10] TAG
– s[9..10] AG
– s[10..10] G
– Empty string
Suffixes with end symbol
• To show the empty string and to mark where the sequence ends,
we will use the symbol $ to indicate the end of a sequence, and
define s[n+1] to be $
• Example: s[1..11]=TATACATTAG$
• Suffixes:
– s[1..11] TATACATTAG$
– s[2..11] ATACATTAG$
– s[3..11] TACATTAG$
– s[4..11] ACATTAG$
– s[5..11] CATTAG$
– s[6..11] ATTAG$
– s[7..11] TTAG$
– s[8..11] TAG$
– s[9..11] AG$
– s[10..11] G$
– s[11..11] $
Subsequence and suffixes
• Important concept: Every subsequence of
s is a prefix of a suffix of s (recall: optimal
local alignment)
• Example:
– s=TATACATTAG$
Suffixes:
– The subsequence s[4..7]=ACAT is a prefix of TATACATTAG$
the suffix s[4..11]=ACATTAG$ ATACATTAG$
• Therefore, finding whether a short read TACATTAG$
appears in a reference sequence is ACATTAG$
CATTAG$
equivalent to checking whether the short ATTAG$
read is a prefix of a suffix of the reference TTAG$
– To facilitate the searching of subsequences, TAG$
we can put the suffixes into a tree AG$
G$
$
Suffix trie
• Tree:
– A set of nodes
– A set of edges, each connecting two nodes
– No cycles
• Suffix trie of sequence s:
– A rooted tree
– Every edge is labeled with one character from s
• Sibling nodes are ordered alphabetically, with the end-of-sequence
character $ ordered before all other characters, i.e., $ < A < C < G < T
– Every path from the root to a leaf represents a suffix of s
– Every suffix of s is represented by a path from the root to a leaf
– Suffixes share edges for their common prefixes
Suffix trie
• s=TATACATTAG$
• Suffixes: $ A C G T
TATACATTAG$ C G T A $ A T
ATACATTAG$
A $ A T T C G T A
TACATTAG$
ACATTAG$ T C A T A $ A G
CATTAG$ T A G A T C $
ATTAG$ A T $ G T A
TTAG$
G T $ A T
TAG$
AG$ $ A G T
G$ G $ A
$ $ G
Suffix trie
• s=TATACATTAG$
• To search for a length-m
subsequence, simply follow $ A C G T
the path from the root until
C G T A $ A T
– The subsequence is found
(the subsequence appears in A $ A T T C G T A
s), e.g., ACAT OR
T C A T A $ A G
– You cannot go any further
(the subsequence does not T A G A T C $
appear in s), e.g., CATC
A T $ G T A
• Both cases take O(m) time –
G T $ A T
independent of n
– Since each layer has no more $ A G T
than 5 nodes, which is a G $ A
constant
$ G
• A suffix trie can be
constructed in time $
proportional to its size

– Worst case O(n2) nodes
Suffix tree
• A suffix tree is a compact
form of a suffix trie, where
non-branching paths are
collapsed to a single edge
$ A CATTAG$ G$ T
$ A C G T
C G T A $ A T
A $ A T T C G T A
T C A T A $ A G CATTAG$ G$ T A TAG$
T A G A T C $
A T $ G T A
G T $ A T
$ A G T
G $ A ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$

$ G
$
Suffix trie Suffix tree
Suffix tree
• The tree has no more
than 2n nodes. Why?
Hint: How many leaf
nodes are there?
• The tree can be $ A CATTAG$ G$ T
constructed in O(n) time

– We do not go into details
• How much space does
each edge require? CATTAG$ G$ T A TAG$
– No need to store the long
edge labels as strings in the
tree. Can use pointers to
the original sequence s.
ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$
Suffix tree
• The tree has no more
than 2n nodes. Why?
Hint: How many leaf
nodes are there?
11-11: 2-2: 5-11: 10-11: 1-1:
• The tree can be $ A CATTAG$ G$ T
constructed in O(n) time
– We do not go into details
• How much space does
each edge require? 5-11:
CATTAG$
10-11:
G$
3-3:
T
2-2:
A
8-11:
TAG$
– No need to store the long
edge labels as strings in the
tree. Can use pointers to
the original sequence s.
– Constant space per edge
4-11: 8-11: 5-11: 10-11: 3-11:
– Conclusion: O(n) space for ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$
the whole tree
Searching using a suffix tree
Searching for TATA:
• Level 1: Go to the T node,
next to find prefix match for
ATA
11-11: 2-2: 5-11: 10-11: 1-1:
• Level 2: Go to the A node, $ A CATTAG$ G$ T
next to find prefix match for
TA
• Level 3: Go to the
TACATTAG$ node
5-11: 10-11: 3-3: 2-2: 8-11:
– Done! CATTAG$ G$ T A TAG$
• Occurrence location: 3 (start
index of the current node) –
2 (total number of
characters in its ancestors) =
1 4-11: 8-11: 5-11: 10-11: 3-11:
ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$
• How to find all occurrences
of a query?
Searching using a suffix tree
Searching for A:
• Level 1: Go to the A node
– Done!
• Finding all occurrences of
11-11: 2-2: 5-11: 10-11: 1-1:
query: visit all leaf nodes of $ A CATTAG$ G$ T
the current node:
– 5 (start index of the CATTAG$
node) – 1 (total number of
characters in its ancestors) = 4
– 10 (start index of the G$ 5-11: 10-11: 3-3: 2-2: 8-11:
node) – 1 (total number of CATTAG$ G$ T A TAG$
– 4 (start index of the
ACATTAG$ node) – 2 (total
number of characters in its
ancestors) = 2
– 8 (start index of the TAG$ 4-11: 8-11: 5-11: 10-11: 3-11:
node) – 2 (total number of ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$
Some limitations
• While suffix tree is already quite space- and
time-efficient, it also has some drawbacks:
– Some construction algorithms are quite complex
– The total space needed is usually 20n bytes or
more for a sequence of length n due to overheads
originated from the tree structure and position
indices
• Think about the length of the human genome and the
maximum amount of memory for a 32-bit machine
• An alternative data structure: suffix array
Suffix array
• An array storing the original locations of the suffixes when
they are sorted in lexicographic order Compare the first character. If order
can be determined, stop. Otherwise,
• s=TATACATTAG$ move to the next character. And so on.
Before sorting: After sorting:
Suffix Location Suffix Location
TATACATTAG$ 1 $ 11
ATACATTAG$ 2 ACATTAG$ 4
TACATTAG$ 3 AG$ 9
ACATTAG$ 4 ATACATTAG$ 2
The suffix array
CATTAG$ 5 ATTAG$ 6
ATTAG$ 6 CATTAG$ 5
TTAG$ 7 G$ 10
TAG$ 8 TACATTAG$ 3
AG$ 9 TAG$ 8
G$ 10 TATACATTAG$ 1
$ 11 TTAG$ 7
Using a suffix array
• Recall that a Suffix Location
subsequence of s is also $
ACATTAG$
11
4 s[9]=A? Yes
a prefix of a suffix of s, AG$ 9
s[9+1]=T? No.
s[9+1]=G<T
therefore ATACATTAG$ 2 s[2]=A? Yes

s[2+1]=T? Yes
ATTAG$ 6 AT found in s!
– Finding a subsequence CATTAG$ 5 s[5]=A? No. s[5]=C>A
can be done by a binary G$ 10 How to find all
search on the suffix array TACATTAG$ 3 occurrences of
TAG$ 8
AT?
– All occurrences of the TATACATTAG$ 1
• Linear scan
• Smart binary
subsequence in s must be TTAG$ 7 search
located on adjacent rows Straight-forward use of suffix array requires
– Example: Searching for O(m log n) time
the subsequence AT Can improve to O(m) time by using
extended suffix array – We don’t study here
Time and space requirements
• A suffix array can be constructed in O(n) time
– Can read off from suffix tree
• If each position is stored as an integer, and each integer
takes 4 bytes, the whole suffix array needs 4n bytes
– For large n, we cannot assume 4 bytes are sufficient. In
general, each index takes log n bits. The total size is thus
O(n log n)
• Already quite good. Can we do even better?
Goal: O(n log |å|), where å is the alphabet
– For DNA, |å|= |{A, C, G, T}| = 4 << n
• Methods:
– Compressed suffix arrays (not discussed here)
– Burrows-Wheeler Transform (our next topic)
Part 3
BURROWS-WHEELER TRANSFORM
Burrows-Wheeler Transform (BWT)
• Proposed by Michael Burrows and David Wheeler
in 1994.
• A very compact structure that can be used for
text search
• Input: sequence s
• Conceptual* method:
– Find all rotations of s and put them in a matrix
– Sort the rows of the matrix in lexicographic order
– Output the sequence in the last column, b
*: In this lecture, whenever you see a method described as “conceptual”, it
means it is used to illustrate some key ideas, but is usually too slow or too
memory-demanding to be practical, and we will discuss better alternatives.
Rotations and transformed string
• Input: s=TATACATTAG$
Rotations: Sorted rotations:
TATACATTAG$ $TATACATTAG
ATACATTAG$T ACATTAG$TAT
Amazingly, we can
TACATTAG$TA AG$TATACATT
ACATTAG$TAT ATACATTAG$T
use b (together with
CATTAG$TATA ATTAG$TATAC some auxiliary data
ATTAG$TATAC CATTAG$TATA structures) to check
TTAG$TATACA G$TATACATTA whether an input
TAG$TATACAT TACATTAG$TA string is a sub-
AG$TATACATT TAG$TATACAT sequence of s
G$TATACATTA TATACATTAG$ efficiently
$TATACATTAG TTAG$TATACA
• Output: b=GTTTCAAAT$A
Rotations and transformed string
• Input: s=TATACATTAG$
Suffixes Location Sorted suffixes Location
Rotations: Sorted rotations:
TATACATTAG$ $TATACATTAG TATACATTAG$ 1 $ 11
ATACATTAG$T ACATTAG$TAT ATACATTAG$ 2 ACATTAG$ 4

TACATTAG$TA AG$TATACATT TACATTAG$ 3 AG$ 9
ACATTAG$TAT ATACATTAG$T ACATTAG$ 4 ATACATTAG$ 2
CATTAG$TATA ATTAG$TATAC CATTAG$ 5 ATTAG$ 6
ATTAG$TATAC CATTAG$TATA ATTAG$ 6 CATTAG$ 5
TTAG$TATACA G$TATACATTA TTAG$ 7 G$ 10
TAG$TATACAT TACATTAG$TA
TAG$ 8 TACATTAG$ 3
AG$TATACATT TAG$TATACAT
AG$ 9 TAG$ 8
G$TATACATTA TATACATTAG$
$TATACATTAG TTAG$TATACA G$ 10 TATACATTAG$ 1
$ 11 TTAG$ 7
• Output: b=GTTTCAAAT$A
– Did you notice the correspondence between the sorted rotations with
the sorted suffixes (due to the unique $)?
– Without the $ symbol, it may not be true. Consider CAA
Things to learn about BWT
1. How to construct b conceptually (i.e., slowly)
2. How to construct b efficiently
3. Basic properties of the sorted rotations
4. Getting s back from b conceptually
5. Getting s back from b efficiently
6. Using b to search for sub-sequences of s
Quick construction of b
• The procedure we have described for constructing the output
b is very slow and memory demanding
• It can be quickly obtained from the suffix array
– Recall that a suffix array can be constructed in linear time and space
for a fixed alphabet
Sorted suffixes Location
– s=TATACATTAG$ $ 11
s=12345678901 Sorted rotations:
$TATACATTAG ACATTAG$ 4
s=0 1
ACATTAG$TAT AG$ 9
– b=GTTTCAAAT$A
AG$TATACATT ATACATTAG$ 2
– First character of b is the character ATACATTAG$T ATTAG$ 6
before the first letter in the first ATTAG$TATAC CATTAG$ 5
row of the sorted rotations CATTAG$TATA G$ 10
• b[1] = s[11-1] = s[10] = G G$TATACATTA
• b[2] = s[4-1] = s[3] = T TACATTAG$ 3
TACATTAG$TA
• b[3] = s[9-1] = s[8] = T TAG$ 8
TAG$TATACAT
• ... TATACATTAG$ TATACATTAG$ 1
TTAG$TATACA TTAG$ 7
Properties of the sorted rotations
• Simple properties for warm-up s=TATACATTAG$
Sorted rotations:
• Property 1: All rows in the sorted $TATACATTAG
rotation matrix are different ACATTAG$TAT
AG$TATACATT
– Due to the $ symbol ATACATTAG$T
ATTAG$TATAC
• Property 2: Every column in the CATTAG$TATA
G$TATACATTA
matrix has the whole set of TACATTAG$TA
TAG$TATACAT
characters in s TATACATTAG$
TTAG$TATACA
b=GTTTCAAAT$A
• Property 3: Different occurrences of the s=TATACATTAG$
same character tend to cluster in b Sorted rotations:
– E.g., three of the A’s are clustered, so are three $TATACATTAG
of the T’s ACATTAG$TAT
AG$TATACATT
– Why? Because if a length-k pattern appears
ATACATTAG$T
multiple times in s (e.g., TA), some rotations will
ATTAG$TATAC
have: CATTAG$TATA
• The length-(k-1) suffix of the pattern (A) at the G$TATACATTA
beginning of the rotation ® these rotations will TACATTAG$TA
be close in the matrix (though not always next to TAG$TATACAT
each other – check AT) TATACATTAG$
• The first character of the pattern (T) in the last TTAG$TATACA
column
b=GTTTCAAAT$A
– Significance? Easier to perform data
compression
• Property 4: The input s can be obtained back
from the output b
• Conceptual method:
1. Create an empty matrix
2. Add b as the leftmost column of the matrix
3. Sort the rows of the matrix
4. Repeat 2 and 3 until the matrix has n+1 columns
5. s can be read from the first row by moving the
leading $ back to the tail
Getting back the original sequence
• s=TATACATTAG$, b=GTTTCAAAT$A
G $ G$ $T G$T $TA G$TA $TAT G$TAT $TATA G$TATA $TATAC G$TATAC $TATACA
T A TA AC TAC ACA TACA ACAT TACAT ACATT TACATT ACATTA TACATTA ACATTAG
T A TA AG TAG AG$ TAG$ AG$T TAG$T AG$TA TAG$TA AG$TAT TAG$TAT AG$TATA
T A TA AT TAT ATA TATA ATAC TATAC ATACA TATACA ATACAT TATACAT ATACATT
C A CA AT CAT ATT CATT ATTA CATTA ATTAG CATTAG ATTAG$ CATTAG$ ATTAG$T
A C AC CA ACA CAT ACAT CATT ACATT CATTA ACATTA CATTAG ACATTAG CATTAG$
A G AG G$ AG$ G$T AG$T G$TA AG$TA G$TAT AG$TAT G$TATA AG$TATA G$TATAC
A T AT TA ATA TAC ATAC TACA ATACA TACAT ATACAT TACATT ATACATT TACATTA
T T TT TA TTA TAG TTAG TAG$ TTAG$ TAG$T TTAG$T TAG$TA TTAG$TA TAG$TAT
$ T $T TA $TA TAT $TAT TATA $TATA TATAC $TATAC TATACA $TATACA TATACAT
A T AT TT ATT TTA ATTA TTAG ATTAG TTAG$ ATTAG$ TTAG$T ATTAG$T TTAG$TA
G$TATACA $TATACAT G$TATACAT $TATACATT G$TATACATT $TATACATTA G$TATACATTA $TATACATTAG

TACATTAG ACATTAG$ TACATTAG$ ACATTAG$T TACATTAG$T ACATTAG$TA TACATTAG$TA ACATTAG$TAT
TAG$TATA AG$TATAC TAG$TATAC AG$TATACA TAG$TATACA AG$TATACAT TAG$TATACAT AG$TATACATT
TATACATT ATACATTA TATACATTA ATACATTAG TATACATTAG ATACATTAG$ TATACATTAG$ ATACATTAG$T
CATTAG$T ATTAG$TA CATTAG$TA ATTAG$TAT CATTAG$TAT ATTAG$TATA CATTAG$TATA ATTAG$TATAC
ACATTAG$ CATTAG$T ACATTAG$T CATTAG$TA ACATTAG$TA CATTAG$TAT ACATTAG$TAT CATTAG$TATA
AG$TATAC G$TATACA AG$TATACA G$TATACAT AG$TATACAT G$TATACATT AG$TATACATT G$TATACATTA
ATACATTA TACATTAG ATACATTAG TACATTAG$ ATACATTAG$ TACATTAG$T ATACATTAG$T TACATTAG$TA
TTAG$TAT TAG$TATA TTAG$TATA TAG$TATAC TTAG$TATAC TAG$TATACA TTAG$TATACA TAG$TATACAT
$TATACAT TATACATT $TATACATT TATACATTA $TATACATTA TATACATTAG $TATACATTAG TATACATTAG$
ATTAG$TA TTAG$TAT ATTAG$TAT TTAG$TATA ATTAG$TATA TTAG$TATAC ATTAG$TATAC TTAG$TATACA
• Why does the procedure work?
– Essentially we are reconstructing the sorted rotation
matrix
– When the reconstruction matrix contains only one
column, after sorting it is exactly the first column of
the sorted rotation matrix
– When we add b as the new first column of the
reconstruction matrix, it is like placing the last column
of the sorted rotation matrix before the first column
– When this matrix is sorted, we get the first two
columns of the sorted rotation matrix
• Every row contains a different subsequence of s
– And so on
• s=TATACATTAG$, b=GTTTCAAAT$A
Sorted rotations: Reconstruction:
$TATACATTAG G $ G$ $T G$T $TA G$TA $TAT G$TATACATTA $TATACATTAG
T A TA AC TAC ACA TACA ACAT TACATTAG$TA ACATTAG$TAT
ACATTAG$TAT
T A TA AG TAG AG$ TAG$ AG$T TAG$TATACAT AG$TATACATT
AG$TATACATT T A TA AT TAT ATA TATA ATAC TATACATTAG$ ATACATTAG$T
ATACATTAG$T C A CA AT CAT ATT CATT ATTA CATTAG$TATA ATTAG$TATAC
ATTAG$TATAC
A C AC CA ACA CAT ACAT CATT ... ACATTAG$TAT CATTAG$TATA
A G AG G$ AG$ G$T AG$T G$TA AG$TATACATT G$TATACATTA
CATTAG$TATA A T AT TA ATA TAC ATAC TACA ATACATTAG$T TACATTAG$TA
G$TATACATTA T T TT TA TTA TAG TTAG TAG$ TTAG$TATACA TAG$TATACAT
$ T $T TA $TA TAT $TAT TATA $TATACATTAG TATACATTAG$
TACATTAG$TA A T AT TT ATT TTA ATTA TTAG ATTAG$TATAC TTAG$TATACA
TAG$TATACAT
TATACATTAG$
TTAG$TATACA
• Good, but the method seems very slow?

– Yes, and we will see how to get back s from b faster
• Property 5: The i-th occurrence s=TATACATTAG$
of a character x in the last s=12345678901
column corresponds to the i-th
occurrence of x in the first s=0 1
column Sorted rotations:
$TATACATTAG
– E.g., The second T in the last
ACATTAG$TAT
column is also the second T in the
AG$TATACATT
first column
ATACATTAG$T
• Which is the one at position 8 in s
ATTAG$TATAC
• These T’s can have a different order in s
CATTAG$TATA
– Why? Consider the following: G$TATACATTA
• Order of the rotations starting with x TACATTAG$TA
• Order of the rotations ending with x TAG$TATACAT
Both depend on the remaining n TATACATTAG$
characters TTAG$TATACA
b=GTTTCAAAT$A
Applications of property 5
• First application: Getting back the original sequence fast
• b=GTTTCAAAT$A
• First column of sorted rotation matrix (by sorting characters in the last column or counting
the number of occurrences of each character): $AAAACGTTTT
• Conceptual back-tracing: $ G
– Character before $: G A T
– Character before G: second A (A) A T
– Character before second A: second T (T) A T
– Character before second T: fourth T (T) A C
– Character before fourth T: fourth A (A) C...A
– Character before fourth A: C G A
– Character before C: first A (A) T A
– Character before first A: first T (T) T T
– Character before first T: third A (A)
T $
– Character before third A: third T (T)
T A
– Character before third T: $
• Therefore the original sequence is s=TATACATTAG$
• If we have stored the location of the first occurrence of each character in the first column,
back-tracing can be done very fast (without really storing the first column).
• Second application : Text search Sorted rotations:
• Suppose we want to search for a $TATACATTAG
ACATTAG$TAT
sub-sequence r from s. All AG$TATACATT
occurrences of r appear as prefixes ATACATTAG$T
in the sorted rotation matrix, and ATTAG$TATAC
CATTAG$TATA
are in adjacent rows. G$TATACATTA
– For example: TA TACATTAG$TA
TAG$TATACAT
• Therefore, we only need to find out TATACATTAG$
the row numbers of the first and TTAG$TATACA
last rows that start with r
– Now we study how we can find these
numbers if we only have b without
materializing the rotation matrix
• Say we want to search for TA. $ G Sorted suffixes Location
Conceptually: A
A
T
T
$ 11
ACATTAG$ 4
– From b, we can get back the first A T AG$ 9
column of the rotation matrix A C ATACATTAG$ 2
– We know that A appears between C...A
ATTAG$ 6
the 2nd and 5th rows in the first G A
CATTAG$ 5
column T A
G$ 10
T T
– We then check the corresponding T $ TACATTAG$ 3
entries in b, and find TA between T A TAG$ 8
the 1st and 3rd occurrences of T TATACATTAG$ 1
– We can then find out their actual TTAG$ 7
locations in s from the suffix array s=TATACATTAG$
• We can either save the array on disk
or save only a portion in memory, and
compute the remaining on the fly
• Another example: CAT $ G Sorted suffixes Location
– 1stto 4thoccurrences of T ® rows 8-11 in the first column A T $ 11
® 3 to 4 occurrences of A ® rows 4-5 in the first column
rd th
A T ACATTAG$ 4
® 1st to 1st occurrences of C ® rows 6-6 in the first column
A T 9
• How to make this conceptual procedure fast? AG$
A C
• From occurrence to row number: store the row ATACATTAG$ 2
number of the first occurrence of each character in C...A ATTAG$ 6
the first column G A
– $: 1, A: 2, C: 6, G: 7, T: 8 CATTAG$ 5
T A
– 3 occurrence of A is on row 2+3-1 = 4
rd
T T G$ 10
• From row number to occurrences: store the number T $ TACATTAG$ 3
of times a character appears up to the current row 8
in the last column T A TAG$
– A: 00000123334 TATACATTAG$ 1
– Up to row 7, 2 A’s have occurred in b TTAG$ 7
– Up to row 11, 4 A’s have occurred in b
– Therefore rows 8-11 contain the 3rd and 4th occurrences s=TATACATTAG$
of A
– With these numbers, we do not need to store the first
column
– Again, may precompute only some of these numbers
Summary of BWT
• What we need to store? Sorted suffixes Location
– The last column b of the sorted rotation matrix $ 11
• O(n) construction time by using suffix array ACATTAG$ 4
• O(n log|S|) space, where |S| is the size of the alphabet AG$ 9
(4 for DNA sequences)
ATACATTAG$ 2
– Location of the first occurrence of each character in
ATTAG$ 6
the first column
• O(|S| log n) construction time by using suffix array CATTAG$ 5
• O(|S|) space G$ 10
– Number of times each character occurs in the last TACATTAG$ 3
column within the first i rows for all i TAG$ 8
• O(n) construction time TATACATTAG$ 1
• O(|S|n log n) space – Can be stored in a special way
that requires much less space TTAG$ 7
– The suffix array s=TATACATTAG$

• O(n) construction time
• O(n log n) space – No need to reside in memory
Summary of BWT
• Getting back the original sequence s
– Trace back in n steps, using either the suffix array
or the array that stores the location of the first
occurrence of each character
• Searching for a query sequence r
– Iteratively compute the range of rows involved for
different suffixes of r
Complete searching example Not Fully Stored in a Stored
Position 1 2 3 4 5 6 7 8 9 10 11 stored stored special way on disk
s T A T A C A T T A G $
To handle the situation that
First column $ A A A A C G T T T T some nucleotides do not
appear in s, a more precise
Last column (b) G T T T C A A A T $ A definition is “Number of
occurrences of
lexicographically smaller
A C G T
characters plus one”
First occurrence position in the first column (F) 2 6 7 8
Occurrences within i “FM-index” (Full-text

the first i rows in the index in Minute space)
last column (O) 1 2 3 4 5 6 7 8 9 10 11
A 0 0 0 0 0 1 2 3 3 3 4
C 0 0 0 0 1 1 1 1 1 1 1
G 1 1 1 1 1 1 1 1 1 1 1
T 0 1 2 3 3 3 3 3 4 4 4
Suffix array (SA) 11 4 9 2 6 5 10 3 8 1 7
Complete searching example Not Fully Stored in a Stored
Position 1 2 3 4 5 6 7 8 9 10 11 stored stored special way on disk
s T A T A C A T T A G $ Searching for AT:

1. O[11, T]=4 occurrences of
First column $ A A A A C G T T T T
T in total
2. In the first column these
Last column (b) G T T T C A A A T $ A T’s appear on row F[T]=8
to row F[T]+4-1=11
A C G T 3. On row 8-1=7, O[8-1, A]=2
A’s have appeared in last
First occurrence position in the first column (F) 2 6 7 8 column
4. On row 11, O[11, A]=4 A’s
Occurrences within i have appeared in last
the first i rows in the column
last column (O) 1 2 3 4 5 6 7 8 9 10 11
5. Therefore these rows
A 0 0 0 0 0 1 2 3 3 3 4 cover the (2+1)=3rd to the
4th occurrences of A
C 0 0 0 0 1 1 1 1 1 1 1 6. In the first column these
G 1 1 1 1 1 1 1 1 1 1 1 A’s appear on row F[A]+3-
1=4 to row F[A]+4-1=5
T 0 1 2 3 3 3 3 3 4 4 4 7. These AT’s appear at
position SA[4]=2 and
Suffix array (SA) 11 4 9 2 6 5 10 3 8 1 7 position SA[5]=6 of the
input string s
Inexact matching
• So far we have been studying exact matching
• Some inexact matching strategies:
– Search for exact matches of length-k subsequences of
the query r, then combine the results
– Search for exact matches of sequences that are within
a certain distance from r
• We have seen how to do that with a hash table
• For a suffix tree, we need to traverse the tree with
backtracking
• For BWT, we do something equivalent to traversing the suffix
tree, but some bounds can be calculated to reduce the
amount of traversal needed
Epilogue
CASE STUDY, SUMMARY AND

FURTHER READINGS
Case study: Competitions
• Many different methods have been proposed for
performing short read alignments
– Which one is the best?
• In order to propose a new method, you need to
show that the method has some advantages over
the previous ones
– Consumes less memory (theoretically or in practice)
– Runs faster
– Provides more features (e.g., inexact matches)
– Is simpler
– Has an efficient implementation
– ...
• One potential problem: Cherry picking
– How can we know that one is not showing the
best results of his/her method, based on a
carefully chosen set of data and parameter values?
• May perform well only in this setting
– Benchmark datasets
• Can still “overfit” if you know the answers
– Public competitions
• Some famous public competitions related to
bioinformatics:
– Assemblathon for sequence assembly
– CAPRI (Critical Assessment of PRediction of Interactions)
– CASP (Critical Assessment of protein Structure Prediction)
– DREAM (Dialogue for Reverse Engineering Assessments
and Methods)
– RGASP (RNAseq Genome Annotation Assessment Project)
– Some competitions on TopCoder
– Some of the KDD Cup competitions associated with the
yearly KDD (Knowledge Discovery and Data Mining)
conference
• Some have attractive awards!
Summary
• Massively parallel sequencing allows the sequencing of
many short DNA fragments in parallel, achieving high
throughput
– Many: millions or even billions
– Short: In the order of a hundred nucleotides
• Different strategies to map the short reads to a
reference:
– Hash table: Tradeoff between space (unused slots) and
time (resolving collisions)
– Suffix trie/tree/array: Compact structures, proportional to
the length of the indexed sequence
– Burrows-Wheeler Transform (BWT): Similar to suffix array,
but usually requires less space in main memory due to less
bits per input character and compression possibility
Other practical issues
• We have only focused on methods for finding a
short read from a long sequence efficiently. In
real applications, there are many other issues:
– Non-unique mapping: One read can map to multiple
places of s
– Incorporating quality scores from sequencing
machines
– Larger structural variants (e.g., indels) that cannot be
handled by inexact matches
– Parallelization using multiple cores/machines
– ...
Further readings
• Chapter 3 of Algorithms in Bioinformatics: A
Practical Introduction
– More details about suffix trees (such as suffix links)
– Additional applications (such as finding longest
common prefix of two sequences)
– More detailed complexity analyses
– Free slides available
• A paper that describes a method called BWA for
aligning short sequencing reads using BWT
– Li and Durban, Fast and accurate short read alignment
with Burrows-Wheeler transform. Bioinformatics
25(14):1754-1760, (2009)

Lec 4 - Short Read Alignmet

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lec 4 - Short Read Alignmet

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture 4.

Short Read Alignment

The Chinese University of Hong Kong

MASSIVELY PARALLEL SEQUENCING

Breaking down into fragments

Text sequences of fragments

Image credit: Illumina

Whole genome shotgun Hierarchical approach: slightly easier

Image credit: Commins et al., Biological Procedures Online 11(1):52-78, (2009)

proportional to its size

G $ A ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$

constructed in O(n) time

ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$

therefore ATACATTAG$ 2 s[2]=A? Yes

ATACATTAG$T ACATTAG$TAT ATACATTAG$ 2 ACATTAG$ 4

G$TATACA $TATACAT G$TATACAT $TATACATT G$TATACATT $TATACATTA G$TATACATTA $TATACATTAG

• Good, but the method seems very slow?

– The suffix array s=TATACATTAG$

Occurrences within i “FM-index” (Full-text

Suffix array (SA) 11 4 9 2 6 5 10 3 8 1 7

s T A T A C A T T A G $ Searching for AT:

CASE STUDY, SUMMARY AND

Das könnte Ihnen auch gefallen