Beruflich Dokumente
Kultur Dokumente
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 2
Part 1
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 4
Illustration
Multiple copies of an unknown DNA biological sequence
TACCAGCGGACCGCTGAC
TACCAGCGGACCGCTGAC
TACCAGCGGACCGCTGAC
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 5
Sequencing by synthesis
• Use one strand as template, synthesize the
other strand
• Different ways to detect what base is added:
– Give a different color for each type of nucleotide
– Supply only one type of nucleotide at a time, and
see if some signals (e.g., light) can be detected
– Stop whenever a certain nucleotide is added.
Then deduce the nucleotide by DNA lengths
• Can only handle up to a certain length of DNA
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 6
Sequencing by synthesis
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 7
Massively parallel sequencing (MPS)
• Sequencing many short fragments in parallel
– Also called “next-generation” or “deep” sequencing
Image credit: Metzker, Nature Reviews Genetics 11:31-46, (2010); Azco Biotech
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 8
Shotgun sequencing
• Breaking down the long DNA sequence into
multiple fragments due to the experimental
limitation
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 9
Short reads
• The output of MPS is a list of short sequences
– Each is called a read
– Example: ACA, ATA, ATA, ATT, TAG, TAT, TTC
• Some properties of current MPS reads:
– About 100-200 nucleotides long (very short as
compared to the human genome)
– May overlap, since multiple copies of the original DNA
are sequenced
• Millions or even billions of reads from one experiment
– The DNA sample may contain variations due to
heterozygosity, somatic mutations and mixed
population of cells
– May also have contamination and sequencing errors
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 10
Computational problems
• Two main computational problems
• Sequence alignment (this lecture):
– Given a reference sequence s of length n, how to find
out the position of each read r of length m in the
reference?
– Example situations:
• Sequencing the DNA in a cancer sample – The sequence of
normal human DNA can serve as a reference
• Sequencing the DNA of a strain of a bacteria – The sequence
of other strains of the bacteria can serve as a reference
• Sequence assembly (next lecture):
– Is it possible to assemble the short reads back to the
original DNA?
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 11
Short read alignment
• Example:
– Original sequence: s=TATACATTAG
– Short reads:
ACA, ATA, ATA, ATT, TAG, TAT, TTC
– Alignment:
TATACATTAG
ACA
ATA
ATA
ATT
TAG Variation or error
TAT
TTC
Image source: http://img1.etsystatic.com/000/0/6103070/il_fullxfull.203233493.jpg
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 12
Short read alignment
• Basically a local alignment problem, but need
to align millions or billions of short sequences
with a very long reference sequence,
expecting almost exact matches
• Need to build indexes on reads or reference
– Once the indexes are built, the searching time
should depend only on the size of searching
results (number of hits and their locations), not
the length of the reference
– We will mainly study methods for exact matches
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 13
Indices
• Main considerations:
– Space requirement
– Time requirement for building index
– Time requirement for searching
• Main approaches:
– Hash-table-based (Similar to FASTA and BLAST)
• BFAST, ELAND, MAQ, MOSAIK, SHRiMP, SOAP, ZOOM, ...
– Suffix-tree or Burrows-Wheeler-Transform-based
• Bowtie, BWA, SOAP2, ...
– Probabilistic structures with certain chance of wrong
answer
• Bloom Filter, Quotient Filter, …
Last update: 28-Aug-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 14
Hashing
• In general, hashing schemes have to face two
problems
– Some allocated space would be wasted if the k-
mers do not appear in the sequence
– Collisions could occur
• A typical tradeoff between space and time
• Source of the problem: The hash function has
no knowledge of the sequence
– We now study some data structures that make
use of some information about the sequence
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 15
Part 2
SUFFIX TRIE/TREE/ARRAY
Suffixes
• Given a sequence s[1..n], a suffix is either a sub-sequence s[i..n] for
any i between 1 and n, or the empty string (which is sometimes
represented by s[n+1..n])
• Example: s[1..10]=TATACATTAG
• Suffixes:
– s[1..10] TATACATTAG
– s[2..10] ATACATTAG
– s[3..10] TACATTAG
– s[4..10] ACATTAG
– s[5..10] CATTAG
– s[6..10] ATTAG
– s[7..10] TTAG
– s[8..10] TAG
– s[9..10] AG
– s[10..10] G
– Empty string
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 17
Suffixes with end symbol
• To show the empty string and to mark where the sequence ends,
we will use the symbol $ to indicate the end of a sequence, and
define s[n+1] to be $
• Example: s[1..11]=TATACATTAG$
• Suffixes:
– s[1..11] TATACATTAG$
– s[2..11] ATACATTAG$
– s[3..11] TACATTAG$
– s[4..11] ACATTAG$
– s[5..11] CATTAG$
– s[6..11] ATTAG$
– s[7..11] TTAG$
– s[8..11] TAG$
– s[9..11] AG$
– s[10..11] G$
– s[11..11] $
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 18
Subsequence and suffixes
• Important concept: Every subsequence of
s is a prefix of a suffix of s (recall: optimal
local alignment)
• Example:
– s=TATACATTAG$
Suffixes:
– The subsequence s[4..7]=ACAT is a prefix of TATACATTAG$
the suffix s[4..11]=ACATTAG$ ATACATTAG$
• Therefore, finding whether a short read TACATTAG$
appears in a reference sequence is ACATTAG$
CATTAG$
equivalent to checking whether the short ATTAG$
read is a prefix of a suffix of the reference TTAG$
– To facilitate the searching of subsequences, TAG$
we can put the suffixes into a tree AG$
G$
$
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 19
Suffix trie
• Tree:
– A set of nodes
– A set of edges, each connecting two nodes
– No cycles
• Suffix trie of sequence s:
– A rooted tree
– Every edge is labeled with one character from s
• Sibling nodes are ordered alphabetically, with the end-of-sequence
character $ ordered before all other characters, i.e., $ < A < C < G < T
– Every path from the root to a leaf represents a suffix of s
– Every suffix of s is represented by a path from the root to a leaf
– Suffixes share edges for their common prefixes
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 20
Suffix trie
• s=TATACATTAG$
• Suffixes: $ A C G T
TATACATTAG$ C G T A $ A T
ATACATTAG$
A $ A T T C G T A
TACATTAG$
ACATTAG$ T C A T A $ A G
CATTAG$ T A G A T C $
ATTAG$ A T $ G T A
TTAG$
G T $ A T
TAG$
AG$ $ A G T
G$ G $ A
$ $ G
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 21
Suffix trie
• s=TATACATTAG$
• To search for a length-m
subsequence, simply follow $ A C G T
the path from the root until
C G T A $ A T
– The subsequence is found
(the subsequence appears in A $ A T T C G T A
s), e.g., ACAT OR
T C A T A $ A G
– You cannot go any further
(the subsequence does not T A G A T C $
appear in s), e.g., CATC
A T $ G T A
• Both cases take O(m) time –
G T $ A T
independent of n
– Since each layer has no more $ A G T
than 5 nodes, which is a G $ A
constant
$ G
• A suffix trie can be
constructed in time $
Last update: 28-Aug-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 22
Suffix tree
• A suffix tree is a compact
form of a suffix trie, where
non-branching paths are
collapsed to a single edge
$ A CATTAG$ G$ T
$ A C G T
C G T A $ A T
A $ A T T C G T A
T C A T A $ A G CATTAG$ G$ T A TAG$
T A G A T C $
A T $ G T A
G T $ A T
$ A G T
$
Suffix trie Suffix tree
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 23
Suffix tree
• The tree has no more
than 2n nodes. Why?
Hint: How many leaf
nodes are there?
• The tree can be $ A CATTAG$ G$ T
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 24
Suffix tree
• The tree has no more
than 2n nodes. Why?
Hint: How many leaf
nodes are there?
11-11: 2-2: 5-11: 10-11: 1-1:
• The tree can be $ A CATTAG$ G$ T
constructed in O(n) time
– We do not go into details
• How much space does
each edge require? 5-11:
CATTAG$
10-11:
G$
3-3:
T
2-2:
A
8-11:
TAG$
– No need to store the long
edge labels as strings in the
tree. Can use pointers to
the original sequence s.
– Constant space per edge
4-11: 8-11: 5-11: 10-11: 3-11:
– Conclusion: O(n) space for ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$
the whole tree
Last update: 28-Aug-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 25
Searching using a suffix tree
Searching for TATA:
• Level 1: Go to the T node,
next to find prefix match for
ATA
11-11: 2-2: 5-11: 10-11: 1-1:
• Level 2: Go to the A node, $ A CATTAG$ G$ T
next to find prefix match for
TA
• Level 3: Go to the
TACATTAG$ node
5-11: 10-11: 3-3: 2-2: 8-11:
– Done! CATTAG$ G$ T A TAG$
• Occurrence location: 3 (start
index of the current node) –
2 (total number of
characters in its ancestors) =
1 4-11: 8-11: 5-11: 10-11: 3-11:
ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$
• How to find all occurrences
of a query?
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 26
Searching using a suffix tree
Searching for A:
• Level 1: Go to the A node
– Done!
• Finding all occurrences of
11-11: 2-2: 5-11: 10-11: 1-1:
query: visit all leaf nodes of $ A CATTAG$ G$ T
the current node:
– 5 (start index of the CATTAG$
node) – 1 (total number of
characters in its ancestors) = 4
– 10 (start index of the G$ 5-11: 10-11: 3-3: 2-2: 8-11:
node) – 1 (total number of CATTAG$ G$ T A TAG$
characters in its ancestors) = 9
– 4 (start index of the
ACATTAG$ node) – 2 (total
number of characters in its
ancestors) = 2
– 8 (start index of the TAG$ 4-11: 8-11: 5-11: 10-11: 3-11:
node) – 2 (total number of ACATTAG$ TAG$ CATTAG$ G$ TACATTAG$
characters in its ancestors) = 6
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 27
Some limitations
• While suffix tree is already quite space- and
time-efficient, it also has some drawbacks:
– Some construction algorithms are quite complex
– The total space needed is usually 20n bytes or
more for a sequence of length n due to overheads
originated from the tree structure and position
indices
• Think about the length of the human genome and the
maximum amount of memory for a 32-bit machine
• An alternative data structure: suffix array
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 28
Suffix array
• An array storing the original locations of the suffixes when
they are sorted in lexicographic order Compare the first character. If order
can be determined, stop. Otherwise,
• s=TATACATTAG$ move to the next character. And so on.
Before sorting: After sorting:
Suffix Location Suffix Location
TATACATTAG$ 1 $ 11
ATACATTAG$ 2 ACATTAG$ 4
TACATTAG$ 3 AG$ 9
ACATTAG$ 4 ATACATTAG$ 2
The suffix array
CATTAG$ 5 ATTAG$ 6
ATTAG$ 6 CATTAG$ 5
TTAG$ 7 G$ 10
TAG$ 8 TACATTAG$ 3
AG$ 9 TAG$ 8
G$ 10 TATACATTAG$ 1
$ 11 TTAG$ 7
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 29
Using a suffix array
• Recall that a Suffix Location
subsequence of s is also $
ACATTAG$
11
4 s[9]=A? Yes
a prefix of a suffix of s, AG$ 9
s[9+1]=T? No.
s[9+1]=G<T
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 30
Time and space requirements
• A suffix array can be constructed in O(n) time
– Can read off from suffix tree
• If each position is stored as an integer, and each integer
takes 4 bytes, the whole suffix array needs 4n bytes
– For large n, we cannot assume 4 bytes are sufficient. In
general, each index takes log n bits. The total size is thus
O(n log n)
• Already quite good. Can we do even better?
Goal: O(n log |å|), where å is the alphabet
– For DNA, |å|= |{A, C, G, T}| = 4 << n
• Methods:
– Compressed suffix arrays (not discussed here)
– Burrows-Wheeler Transform (our next topic)
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 31
Part 3
BURROWS-WHEELER TRANSFORM
Burrows-Wheeler Transform (BWT)
• Proposed by Michael Burrows and David Wheeler
in 1994.
• A very compact structure that can be used for
text search
• Input: sequence s
• Conceptual* method:
– Find all rotations of s and put them in a matrix
– Sort the rows of the matrix in lexicographic order
– Output the sequence in the last column, b
*: In this lecture, whenever you see a method described as “conceptual”, it
means it is used to illustrate some key ideas, but is usually too slow or too
memory-demanding to be practical, and we will discuss better alternatives.
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 33
Rotations and transformed string
• Input: s=TATACATTAG$
Rotations: Sorted rotations:
TATACATTAG$ $TATACATTAG
ATACATTAG$T ACATTAG$TAT
Amazingly, we can
TACATTAG$TA AG$TATACATT
ACATTAG$TAT ATACATTAG$T
use b (together with
CATTAG$TATA ATTAG$TATAC some auxiliary data
ATTAG$TATAC CATTAG$TATA structures) to check
TTAG$TATACA G$TATACATTA whether an input
TAG$TATACAT TACATTAG$TA string is a sub-
AG$TATACATT TAG$TATACAT sequence of s
G$TATACATTA TATACATTAG$ efficiently
$TATACATTAG TTAG$TATACA
• Output: b=GTTTCAAAT$A
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 34
Rotations and transformed string
• Input: s=TATACATTAG$
Suffixes Location Sorted suffixes Location
Rotations: Sorted rotations:
TATACATTAG$ $TATACATTAG TATACATTAG$ 1 $ 11
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 36
Quick construction of b
• The procedure we have described for constructing the output
b is very slow and memory demanding
• It can be quickly obtained from the suffix array
– Recall that a suffix array can be constructed in linear time and space
for a fixed alphabet
Sorted suffixes Location
– s=TATACATTAG$ $ 11
s=12345678901 Sorted rotations:
$TATACATTAG ACATTAG$ 4
s=0 1
ACATTAG$TAT AG$ 9
– b=GTTTCAAAT$A
AG$TATACATT ATACATTAG$ 2
– First character of b is the character ATACATTAG$T ATTAG$ 6
before the first letter in the first ATTAG$TATAC CATTAG$ 5
row of the sorted rotations CATTAG$TATA G$ 10
• b[1] = s[11-1] = s[10] = G G$TATACATTA
• b[2] = s[4-1] = s[3] = T TACATTAG$ 3
TACATTAG$TA
• b[3] = s[9-1] = s[8] = T TAG$ 8
TAG$TATACAT
• ... TATACATTAG$ TATACATTAG$ 1
TTAG$TATACA TTAG$ 7
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 37
Properties of the sorted rotations
• Simple properties for warm-up s=TATACATTAG$
Sorted rotations:
• Property 1: All rows in the sorted $TATACATTAG
rotation matrix are different ACATTAG$TAT
AG$TATACATT
– Due to the $ symbol ATACATTAG$T
ATTAG$TATAC
• Property 2: Every column in the CATTAG$TATA
G$TATACATTA
matrix has the whole set of TACATTAG$TA
TAG$TATACAT
characters in s TATACATTAG$
TTAG$TATACA
b=GTTTCAAAT$A
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 38
Properties of the sorted rotations
• Property 3: Different occurrences of the s=TATACATTAG$
same character tend to cluster in b Sorted rotations:
– E.g., three of the A’s are clustered, so are three $TATACATTAG
of the T’s ACATTAG$TAT
AG$TATACATT
– Why? Because if a length-k pattern appears
ATACATTAG$T
multiple times in s (e.g., TA), some rotations will
ATTAG$TATAC
have: CATTAG$TATA
• The length-(k-1) suffix of the pattern (A) at the G$TATACATTA
beginning of the rotation ® these rotations will TACATTAG$TA
be close in the matrix (though not always next to TAG$TATACAT
each other – check AT) TATACATTAG$
• The first character of the pattern (T) in the last TTAG$TATACA
column
b=GTTTCAAAT$A
– Significance? Easier to perform data
compression
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 39
Properties of the sorted rotations
• Property 4: The input s can be obtained back
from the output b
• Conceptual method:
1. Create an empty matrix
2. Add b as the leftmost column of the matrix
3. Sort the rows of the matrix
4. Repeat 2 and 3 until the matrix has n+1 columns
5. s can be read from the first row by moving the
leading $ back to the tail
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 40
Getting back the original sequence
• s=TATACATTAG$, b=GTTTCAAAT$A
G $ G$ $T G$T $TA G$TA $TAT G$TAT $TATA G$TATA $TATAC G$TATAC $TATACA
T A TA AC TAC ACA TACA ACAT TACAT ACATT TACATT ACATTA TACATTA ACATTAG
T A TA AG TAG AG$ TAG$ AG$T TAG$T AG$TA TAG$TA AG$TAT TAG$TAT AG$TATA
T A TA AT TAT ATA TATA ATAC TATAC ATACA TATACA ATACAT TATACAT ATACATT
C A CA AT CAT ATT CATT ATTA CATTA ATTAG CATTAG ATTAG$ CATTAG$ ATTAG$T
A C AC CA ACA CAT ACAT CATT ACATT CATTA ACATTA CATTAG ACATTAG CATTAG$
A G AG G$ AG$ G$T AG$T G$TA AG$TA G$TAT AG$TAT G$TATA AG$TATA G$TATAC
A T AT TA ATA TAC ATAC TACA ATACA TACAT ATACAT TACATT ATACATT TACATTA
T T TT TA TTA TAG TTAG TAG$ TTAG$ TAG$T TTAG$T TAG$TA TTAG$TA TAG$TAT
$ T $T TA $TA TAT $TAT TATA $TATA TATAC $TATAC TATACA $TATACA TATACAT
A T AT TT ATT TTA ATTA TTAG ATTAG TTAG$ ATTAG$ TTAG$T ATTAG$T TTAG$TA
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 41
Getting back the original sequence
• Why does the procedure work?
– Essentially we are reconstructing the sorted rotation
matrix
– When the reconstruction matrix contains only one
column, after sorting it is exactly the first column of
the sorted rotation matrix
– When we add b as the new first column of the
reconstruction matrix, it is like placing the last column
of the sorted rotation matrix before the first column
– When this matrix is sorted, we get the first two
columns of the sorted rotation matrix
• Every row contains a different subsequence of s
– And so on
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 42
Getting back the original sequence
• s=TATACATTAG$, b=GTTTCAAAT$A
Sorted rotations: Reconstruction:
$TATACATTAG G $ G$ $T G$T $TA G$TA $TAT G$TATACATTA $TATACATTAG
T A TA AC TAC ACA TACA ACAT TACATTAG$TA ACATTAG$TAT
ACATTAG$TAT
T A TA AG TAG AG$ TAG$ AG$T TAG$TATACAT AG$TATACATT
AG$TATACATT T A TA AT TAT ATA TATA ATAC TATACATTAG$ ATACATTAG$T
ATACATTAG$T C A CA AT CAT ATT CATT ATTA CATTAG$TATA ATTAG$TATAC
ATTAG$TATAC
A C AC CA ACA CAT ACAT CATT ... ACATTAG$TAT CATTAG$TATA
A G AG G$ AG$ G$T AG$T G$TA AG$TATACATT G$TATACATTA
CATTAG$TATA A T AT TA ATA TAC ATAC TACA ATACATTAG$T TACATTAG$TA
G$TATACATTA T T TT TA TTA TAG TTAG TAG$ TTAG$TATACA TAG$TATACAT
$ T $T TA $TA TAT $TAT TATA $TATACATTAG TATACATTAG$
TACATTAG$TA A T AT TT ATT TTA ATTA TTAG ATTAG$TATAC TTAG$TATACA
TAG$TATACAT
TATACATTAG$
TTAG$TATACA
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 43
Properties of the sorted rotations
• Property 5: The i-th occurrence s=TATACATTAG$
of a character x in the last s=12345678901
column corresponds to the i-th
occurrence of x in the first s=0 1
column Sorted rotations:
$TATACATTAG
– E.g., The second T in the last
ACATTAG$TAT
column is also the second T in the
AG$TATACATT
first column
ATACATTAG$T
• Which is the one at position 8 in s
ATTAG$TATAC
• These T’s can have a different order in s
CATTAG$TATA
– Why? Consider the following: G$TATACATTA
• Order of the rotations starting with x TACATTAG$TA
• Order of the rotations ending with x TAG$TATACAT
Both depend on the remaining n TATACATTAG$
characters TTAG$TATACA
b=GTTTCAAAT$A
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 44
Applications of property 5
• First application: Getting back the original sequence fast
• b=GTTTCAAAT$A
• First column of sorted rotation matrix (by sorting characters in the last column or counting
the number of occurrences of each character): $AAAACGTTTT
• Conceptual back-tracing: $ G
– Character before $: G A T
– Character before G: second A (A) A T
– Character before second A: second T (T) A T
– Character before second T: fourth T (T) A C
– Character before fourth T: fourth A (A) C...A
– Character before fourth A: C G A
– Character before C: first A (A) T A
– Character before first A: first T (T) T T
– Character before first T: third A (A)
T $
– Character before third A: third T (T)
T A
– Character before third T: $
• Therefore the original sequence is s=TATACATTAG$
• If we have stored the location of the first occurrence of each character in the first column,
back-tracing can be done very fast (without really storing the first column).
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 45
Applications of property 5
• Second application : Text search Sorted rotations:
• Suppose we want to search for a $TATACATTAG
ACATTAG$TAT
sub-sequence r from s. All AG$TATACATT
occurrences of r appear as prefixes ATACATTAG$T
in the sorted rotation matrix, and ATTAG$TATAC
CATTAG$TATA
are in adjacent rows. G$TATACATTA
– For example: TA TACATTAG$TA
TAG$TATACAT
• Therefore, we only need to find out TATACATTAG$
the row numbers of the first and TTAG$TATACA
last rows that start with r
– Now we study how we can find these
numbers if we only have b without
materializing the rotation matrix
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 46
Applications of property 5
• Say we want to search for TA. $ G Sorted suffixes Location
Conceptually: A
A
T
T
$ 11
ACATTAG$ 4
– From b, we can get back the first A T AG$ 9
column of the rotation matrix A C ATACATTAG$ 2
– We know that A appears between C...A
ATTAG$ 6
the 2nd and 5th rows in the first G A
CATTAG$ 5
column T A
G$ 10
T T
– We then check the corresponding T $ TACATTAG$ 3
entries in b, and find TA between T A TAG$ 8
the 1st and 3rd occurrences of T TATACATTAG$ 1
– We can then find out their actual TTAG$ 7
locations in s from the suffix array s=TATACATTAG$
• We can either save the array on disk
or save only a portion in memory, and
compute the remaining on the fly
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 47
Applications of property 5
• Another example: CAT $ G Sorted suffixes Location
– 1stto 4thoccurrences of T ® rows 8-11 in the first column A T $ 11
® 3 to 4 occurrences of A ® rows 4-5 in the first column
rd th
A T ACATTAG$ 4
® 1st to 1st occurrences of C ® rows 6-6 in the first column
A T 9
• How to make this conceptual procedure fast? AG$
A C
• From occurrence to row number: store the row ATACATTAG$ 2
number of the first occurrence of each character in C...A ATTAG$ 6
the first column G A
– $: 1, A: 2, C: 6, G: 7, T: 8 CATTAG$ 5
T A
– 3 occurrence of A is on row 2+3-1 = 4
rd
T T G$ 10
• From row number to occurrences: store the number T $ TACATTAG$ 3
of times a character appears up to the current row 8
in the last column T A TAG$
– A: 00000123334 TATACATTAG$ 1
– Up to row 7, 2 A’s have occurred in b TTAG$ 7
– Up to row 11, 4 A’s have occurred in b
– Therefore rows 8-11 contain the 3rd and 4th occurrences s=TATACATTAG$
of A
– With these numbers, we do not need to store the first
column
– Again, may precompute only some of these numbers
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 48
Summary of BWT
• What we need to store? Sorted suffixes Location
– The last column b of the sorted rotation matrix $ 11
• O(n) construction time by using suffix array ACATTAG$ 4
• O(n log|S|) space, where |S| is the size of the alphabet AG$ 9
(4 for DNA sequences)
ATACATTAG$ 2
– Location of the first occurrence of each character in
ATTAG$ 6
the first column
• O(|S| log n) construction time by using suffix array CATTAG$ 5
• O(|S|) space G$ 10
– Number of times each character occurs in the last TACATTAG$ 3
column within the first i rows for all i TAG$ 8
• O(n) construction time TATACATTAG$ 1
• O(|S|n log n) space – Can be stored in a special way
that requires much less space TTAG$ 7
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 49
Summary of BWT
• Getting back the original sequence s
– Trace back in n steps, using either the suffix array
or the array that stores the location of the first
occurrence of each character
• Searching for a query sequence r
– Iteratively compute the range of rows involved for
different suffixes of r
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 50
Complete searching example Not Fully Stored in a Stored
Position 1 2 3 4 5 6 7 8 9 10 11 stored stored special way on disk
s T A T A C A T T A G $
To handle the situation that
First column $ A A A A C G T T T T some nucleotides do not
appear in s, a more precise
Last column (b) G T T T C A A A T $ A definition is “Number of
occurrences of
lexicographically smaller
A C G T
characters plus one”
First occurrence position in the first column (F) 2 6 7 8
A 0 0 0 0 0 1 2 3 3 3 4
C 0 0 0 0 1 1 1 1 1 1 1
G 1 1 1 1 1 1 1 1 1 1 1
T 0 1 2 3 3 3 3 3 4 4 4
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 51
Complete searching example Not Fully Stored in a Stored
Position 1 2 3 4 5 6 7 8 9 10 11 stored stored special way on disk
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 52
Inexact matching
• So far we have been studying exact matching
• Some inexact matching strategies:
– Search for exact matches of length-k subsequences of
the query r, then combine the results
– Search for exact matches of sequences that are within
a certain distance from r
• We have seen how to do that with a hash table
• For a suffix tree, we need to traverse the tree with
backtracking
• For BWT, we do something equivalent to traversing the suffix
tree, but some bounds can be calculated to reduce the
amount of traversal needed
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 53
Epilogue
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 55
Case study: Competitions
• One potential problem: Cherry picking
– How can we know that one is not showing the
best results of his/her method, based on a
carefully chosen set of data and parameter values?
• May perform well only in this setting
– Benchmark datasets
• Can still “overfit” if you know the answers
– Public competitions
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 56
Case study: Competitions
• Some famous public competitions related to
bioinformatics:
– Assemblathon for sequence assembly
– CAPRI (Critical Assessment of PRediction of Interactions)
– CASP (Critical Assessment of protein Structure Prediction)
– DREAM (Dialogue for Reverse Engineering Assessments
and Methods)
– RGASP (RNAseq Genome Annotation Assessment Project)
– Some competitions on TopCoder
– Some of the KDD Cup competitions associated with the
yearly KDD (Knowledge Discovery and Data Mining)
conference
• Some have attractive awards!
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 57
Summary
• Massively parallel sequencing allows the sequencing of
many short DNA fragments in parallel, achieving high
throughput
– Many: millions or even billions
– Short: In the order of a hundred nucleotides
• Different strategies to map the short reads to a
reference:
– Hash table: Tradeoff between space (unused slots) and
time (resolving collisions)
– Suffix trie/tree/array: Compact structures, proportional to
the length of the indexed sequence
– Burrows-Wheeler Transform (BWT): Similar to suffix array,
but usually requires less space in main memory due to less
bits per input character and compression possibility
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 58
Other practical issues
• We have only focused on methods for finding a
short read from a long sequence efficiently. In
real applications, there are many other issues:
– Non-unique mapping: One read can map to multiple
places of s
– Incorporating quality scores from sequencing
machines
– Larger structural variants (e.g., indels) that cannot be
handled by inexact matches
– Parallelization using multiple cores/machines
– ...
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 59
Further readings
• Chapter 3 of Algorithms in Bioinformatics: A
Practical Introduction
– More details about suffix trees (such as suffix links)
– Additional applications (such as finding longest
common prefix of two sequences)
– More detailed complexity analyses
– Free slides available
• A paper that describes a method called BWA for
aligning short sequencing reads using BWT
– Li and Durban, Fast and accurate short read alignment
with Burrows-Wheeler transform. Bioinformatics
25(14):1754-1760, (2009)
Last update: 30-Jul-2019 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2019 60