Sie sind auf Seite 1von 45

Genome Sequencing, genome assembly and

annotation strategy for Genomes


Presented By: Dushyant Singh Baghel, MD & CEO, Nucleome Informatics Pvt Ltd
dushyant@nucleomeinfo.com, +91 99858 23215
AGENDA
- The Genome Sequencing and assembly Recommendations
- Sample and DNA Isolation
- Library Prep
- Sequencing
- Assembly, Phasing, and Polishing
- Iso-Seq for Genome Annotation
- Assembly Improvement by Hybrid Assembly
- 10x Genomics
- Optical Mapping
- Deliverables
The plant genome sequencing
and assembly Strategy
Stand Alone PacBio sequencing and Falcon Assembly
with Polishing, followed by ISO-Seq for annotation
PLANT & ANIMAL DE NOVO ASSEMBLY AND GENOME
ANNOTATION WORKFLOW

HMW DNA Template Long-read Genome Genome


Extraction Prep Sequencing Assembly Annotation

http://www.pacb.com/applications/whole-genome-sequencing/plant-animal/
Sample and DNA Isolation
SAMPLE CHOICE

- High molecular weight (HMW) DNA (50+ kb) is a critical component of long-read
sequencing, so choosing a sample type to best enable HMW DNA isolation is
key

- General Sample Guidelines:


- Animals:
- Single individual
- Blood, fresh tissue, or flash-frozen tissue best source of DNA
- Qiagen MagAttract HMW DNA Kit used by some customers successfully
- Plants:
- Dark treated leaf tissue
- Youngest leaves possible
- Consider extra cleanup step after DNA isolation to remove secondary metabolites
IMPORTANCE OF HIGH QUALITY TEMPLATE DNA

HMW Long
DNA Reads
1 2 3 4 5 6 7

- Quality and size of starting DNA is critical


to obtaining longest reads 60kb
50kb
40kb
30kb

20kb

- If shearing is necessary, we recommend 15kb

using the Megaruptor for accuracy


10kb

- The more the DNA is handled, the more it


shears, so starting DNA as large as
possible sets you up for success

DNA sheared with Megaruptor


(Diagenode)
Library Preparation
LIBRARY PREPARATION

HMW DNA Template Long-read Genome Genome


Extraction Prep Sequencing Assembly Annotation

Protocol
online

Similar to DNA, you want your library to be as large as possible


to get the most genomic information out of your sample.

http://www.pacb.com/wp-content/uploads/Procedure-Checklist-Preparing-Greater-Than-30-kb-SMRTbell%C2%AE-Libraries-
Using-Megaruptor%C2%AE-Shearing-and-BluePippin%E2%84%A2-Size-Selection-on-PacBio-RS-II-and-Sequel%C2%AE-
Systems.pdf
LIBRARY SIZE SELECTION

HMW DNA Template Long-read Genome Genome


Extraction Prep Sequencing Assembly Annotation

Size Selection:
60kb
50kb
40kb

- Removes smaller library fragments 30kb

20kb

15kb

- Enables reads up to ≥60 kb long


10kb

- Aim for largest cut possible (30 kb+)


- BluePippin or ELF (SageScience)
30 kb SMRTbell Library
Sequencing
SEQUENCING ON THE SEQUEL SYSTEM

HMW DNA Template Long-read Genome Genome


Extraction Prep Sequencing Assembly Annotation

Sequel System: Recommendations:


- 1,000,000 ZMW per SMRT - 50 to 60-fold coverage for
Cell
relatively inbred, diploid
- 4+ Gb of data per SMRT cell genomes
- N50 read length ~15 kb
- Up to 10 hour movies - >80-fold coverage for outbred,
highly repetitive (>75%) or
polyploid genomes
Assembly
GENOME ASSEMBLY

HMW DNA Template Long-read Genome Genome


Extraction Prep Sequencing Assembly Annotation

SMRT Link Command Line Tools

Flexible
parameterization

Easy-to-use web interface

http://www.pacb.com/products-and-services/analytical-software/ http://pb-falcon.readthedocs.io/en/latest/index.html
GENOME ASSEMBLY

HMW DNA Template Long-read Genome Genome


Extraction Prep Sequencing Assembly Annotation

SMRT Link Command Line Tools


- Easy-to-use web interface for - Flexible parameterization for
small genomes (<1 Gb) assembly of large, complex
- Push-button assembly genomes
- Does not allow unzipping of - FALCON is diploid aware for a
assembly more accurate representation
of genome
- Polishing automatic
- FALCON-Unzip extends phase
blocks of FALCON assemblies

http://www.pacb.com/products-and-services/analytical-software/
http://pb-falcon.readthedocs.io/en/latest/index.html
GENOME POLISHING
Takes your genome assembly to >99%
Quiver and Arrow Algorithms
consensus accuracy!

- Map raw reads back


to genome sequence
- Compute consensus
base and base
quality
- Hidden Markov
Model, trained on
sequencing
chemistry
characteristics
Iso-Seq Method for Annotation
GENOME ANNOTATION WITH ISO-SEQ METHOD

HMW DNA Template Long-read Genome Genome


Extraction Prep Sequencing Assembly Annotation

Analysis:
- Reference-based mode
- Reference-free de novo mode

http://www.pacb.com/applications/rna-sequencing/rna-sequencing-for-plant-and-animal-sciences/
FULL-LENGTH TRANSCRIPT ISOFORM SEQUENCING (ISO-SEQ)
Gene

mRNA isoforms

Short-read
technologies:
Insufficient Connectivity Reads
spanning
Splice Isoform Uncertainty splice
junctions
PacBio’s
Iso-Seq
solution:
Full-length cDNA Sequence Reads
Splice Isoform Certainty – No Assembly Required
PACBIO Iso-Seq Method: ‘No Assembly Required’ Full-Length
Sequencing of Transcripts
Experimental Pipeline
cDNA synthesis with Size Selection & PCR PacBio Sequel
SMRTbell ligation
adapters amplification System Sequencing
1 Poly(A) mRNA 2 3 AAAAA
4 5
AAAAA

AAAAA AAAAA AAAAA

AAAAA AAAAA
AAAAA

AAAAA AAAAA

AAAAA AAAAA

Poly(A) tail
5’ primer Coding sequence 3’ primer
Raw (AAA)nn
SMRT adapter

SMRT adapter

Reads of Insert (AAA)nn

Informatics Pipeline Evidenced-based


6 7 8 9 10 gene models
Nonredundant
PacBio raw Classify Isoform
transcript Final isoforms
sequence reads sequence reads clusters
isoforms

Remove adapters Reads Consensus Quality Map to


Remove artifacts clustering calling filtering reference genome

User Bulletin – Guidelines for Preparing cDNA Libraries for Isoform Sequencing
DevNet: Iso-Seq wiki page
ISO-SEQ METHOD DISCOVERS MORE ISOFORMS PER GENE

- Prepare full-length transcripts using the Clontech SMARTer PCR cDNA


Synthesis Kit with as little as 1 ng of poly A+ RNA or 2 ng of total RNA
- Optional size-selection protocols to enrich for transcripts >4 kb
- Multiplex samples with barcoding, up to 24 samples in a single library prep
- For genome annotation, we recommend:
- High-quality RNA (assessed with RIN) from fresh tissue
- Multiple tissue types, ideally from same individual as reference genome
- 1-2 SMRT Cells per sample on the Sequel System

BEST PRACTICES: LONG-READ RNA SEQUENCING http://www.pacb.com/wp-content/uploads/Application-Brief-RNA-sequencing-Best-Practices.pdf


SIGNIFICANT VALUE IN ONLY A FEW SMRT CELLS

PacBio RS II Sequel System Experimental Goals


(SMRT Cells per sample) (SMRT Cells per sample)

Targeted, gene-specific isoform


1 <1
characterization
General survey of full-length isoforms in a
transcriptome (moderate to high
1-8 1
expression levels) with or without size
selection
A comprehensive survey of full-length
12-16 1-2 isoforms in the transcriptome across 3-4
size fractions
Deep sequencing for comprehensive
isoform discovery and identification of low
>16 2+
abundance transcripts across 3-4 size
fractions
The genome assembly Improvement
Strategy
10X genomics library, HiSeq sequencing, Supernova
Assembly, Optical Mapping
Beyond draft genome assembly?

Ref: Bionano Slide


Deck, PAG 2017
10X GENOMICS CHROMIUM GENOME SOLUTION FOR LARGER
GENOMES
Loading and Sequencing coverage requirement for 10xGenomics
library
- For genomes 1.6 to 3.2 Gb, Characteristics of Supported Genomes
• Genome size:
• Genomes in the range of 1Gb-3.2Gb have been tested.
- We load 1.25 ng for a 3.2 Gb genome • Diploid genomes:
• The Supernova Assembler is well-tested on diploid genomes. Haploid and polyploid
genomes have not been evaluated.
- Sequence 800-1200 M reads for a 3.2 • Other genome characteristics.
• The Supernova Assembler has not been tested on genomes having repeat content far
greater than human, nor on genomes having extreme GC content.
Gb genome; • Clonality.
• We strongly recommend that DNA be obtained from an individual organism or clonal
population.
- For genomes 0.1 to 1.6 Gb, we load less and • DNA size.
• We recommend that DNA have size 50 kb, or preferably, 100 kb or greater.
sequence deeper.

- We load 0.625 ng for any genome in


this smaller range.

- Sequence deeply: 400-600 M reads for


any genome in this smaller range.

- It should be supported with datasets


from other sequencers like HiSeq/PacBio
High Quality Assemblies of larger genomes based on 10X
Genomics Chromium System+HiSeq and Supernova Assembler
>270 Unique genome analysed using Bionano Irys System
HUMAN example 3 (best ever assembled human paper – Korean AK1)
The AK1 paper attached:
best ever assembled animal genome paper
Combination of Bionano Genomics and 10x Genomics data produces high
quality mammalian genome at low cost
The publication describing the assembly is available on the pre-print server bioRxiv
(http://www.biorxiv.org/content/early/2017/04/19/128348). The annotated assembly is available from
NCBI at https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Neomonachus_schauinslandi/100/
The final assembly produced a scaffold N50 of 29.65 Mbp, representing a 215x improvement over
an assembly produced with llumina short-read sequencing alone. The longest scaffold of the
combined Bionano and 10x Genomics assembly is 84.77 Mbp, approaching full-chromosome
scale. Bionano maps improved the contiguity ensuring correct order and orientation, sized gaps
and corrected gap sizes in the 10X Genomics de novo assembly. As an orthogonal, non-
sequencing based assembly method, Bionano maps validated the assembly which is not possible
with other sequence-based scaffolding technologies
Conclusion
- HMW gDNA is key for de novo genome assembly programs
- Perform a genome survey before starting a large genome
sequencing and assembly program
- Design the strategy based on genome survey results and budget
- Stand alone PacBio or 10x Genomics(from 100Mb to 3.2 Gb) can
be supported with Optical Mapping datasets
- Hybrid assembly will certainly provide assembly with less scaffolds
and higher N50.
- Genome polishing using Pacbio reads and Gap filling using Illumina
clean reads
- For functional annotation of the genome, ISO seq and Illumina
based RNA Seq data will be required
- Assembly of the genome is tricky and needs patience and
perseverance.