Beruflich Dokumente
Kultur Dokumente
1) Methods for Studying Microbial Genomes 2) Analysis and Interpretation of Whole Genome Sequences
Principle of PFGE
two factors influence DNA migration rates through conventional gels - charge differences between DNA fragments - molecular sieve effect of DNA pores DNA fragments normally travel through agarose pores as spherical coils, fragments greater than 20 kb in size form extended coils and therefore are not subjected to the molecular sieve effect the charge effect is countered by the proportionally increased friction applied to the molecules and therefore fragments greater than 20 kb do not resolve PFGE works by periodically altering the electric field orientation the large extended coil DNA fragments are forced to change orientation and size dependent separation is re-established because the time taken for the DNA to reorient is size dependent
Principle of PFGE
Principle of PFGE
the most important factor in PFGE resolution is switching time, longer switching times generally lead to increased size of DNA fragments which can be resolved switching times are optimised for the expected size of the DNA being run on the PFGE gel switch time ramping increases the region of the gel in which DNA separation is linear with respect to size a number of different apparatus have been developed in order to generate this switching in electric fields however most commonly used in modern laboratories are FIGE (Field Inversion Gel Electrophoresis) and CHEF (Contour-Clamped Homogenous Electrophoresis)
+ +
+ +
+
+ +
Fill gaps
Whole Genome
Fill gaps
Library construction
Both conventional and large insert genomic DNA libraries should be constructed the small insert library will be used for the bulk of the sequencing in order to generate suitable coverage of the complete genome the large insert library (BAC, PAC, cosmid etc.) will be used as a scaffold during the sequence closure phase it is crucial to ensure that both libraries are as random as possible mechanical shearing is often used to generate small DNA fragments it is also important that each clone contains only one DNA fragment and as such specialised methods for library construction must be used
DNA Sequencing
DNA sequences are generated using vector primers for both ends of inserts at least 6X coverage of the genome is required although 9 to 10X coverage is often generated
Linking Clones
one of the most effective means of contig ordering and gap filling is linking clones linking clones are those whose terminal sequences (from either end of the insert) belong to different contigs if the orientation of the sequences and the distance to the end of the contig are compatible with with the size of the insert, the two contigs are likely to be linked the larger the insert the more likely a clone will be a linking clone this is why random sequencing is also performed on large insert clones - they are far more likely to form linking clones
Contig 1 Gap
Contig 2
Random Sequencing
Random Sequencing
Contig 1
Contig 2
FWD
REV
Once all possible linking clones are identified gaps are classified into two categories - those with linking clones (template available for sequencing) and physical gaps without linking clones ( no DNA template for the region) for those gaps with suitable linking clones, the gaps confirmed by PCR and closed by primer walk sequencing
Contig 1
Contig 2
FWD
REV
Primer Walking
Physical Gaps
Contigs separated with physical gaps (no linking clones) are usually spanned by PCR on genomic DNA using primers from each end of the contigs the PCR products can then be sequenced to close the gaps without linking clones other techniques to order contigs must be used in order to guide the selection of PCR products
Linking clone
For those contigs without linking clones, how do you fill the gaps?
Supercontig 1
Supercontig 2
Supercontig 3
Physical Gaps
contigs can be ordered by peptide linking - contig ends having regions with homology to the same gene (or operon / gene cluster) southern hybridisation of labelled contig terminal oligonucleotides against large restriction fragments
Contig 2
PCR Product
Contig 6
FWD
REV
Primer Walking
first complete genome sequence of a free living organism (1995) important pathogen genome is around 1.83 megabases in size random sequencing was done for both small insert and large insert (lambda) libraries sequencing reactions performed by eight individuals using fourteen ABI 377 DNA sequencers per day over a three month period in total around 33000 sequencing reactions were performed on 20000 templates plasmid extraction performed in a 96 well format 11 mb of sequence was intially used to generate 140 contigs gaps were closed by lambda linking clones (23), peptide links (2), Southern analysis (37) and PCR (42)
Genome Annotation
The process after sequencing has been completed.
Use of many different tools required: Bioinformatics Databases Literature Sequence Experimental
Proteins
Similarity searches against reference databases Calculations & predictions (MW , structure, location etc)
Manual editing
Identifying ORFs
most genomes will contain genes with very little or no homology to known genes of other organisms for this reason all of the possible ORFs need to be identified without relying totally on homology most efficient means for identifying potential genes in genome sequences is a three step process 1) submit entire sequence as a 6-frame translation for BLAST analysis in order to identify some protein coding regions on the basis of high levels of homology 2) use these initial coding regions to determine the sequence characteristics (GC content, codon bias etc.) that distinguish coding and non-coding regions of the genome (training the software
Identifying ORFs
3) reanalyse the genome sequence using this data (plus potential ribosome binding sequences) in order to identify all the potential genes using this process it has been experimentally shown that around 94% of genes can be accurately predicted algorithms are also available to identify ORFs without using the training procedure with only slightly reduced accuracy GLIMMER is a software for gene prediction and used by: BASYS- http://wishart.biology.ualberta.ca/basys/cgi/submit.pl JCVI (formerly TIGR)- http://www.tigr.org/ SABIA- http://www.sabia.lncc.br/
E. coli (4277) Pyrococcus horikoshii (2064) Haemophilus influenzae (1709) B. subtilis (4099) Methanococcus jannaschii (1735)
No homologues (%)
In addition to the 20 amino acids, two new but rare amino acids have now been identified:
21st selenocystine (Sec) 22nd pyrolysin (Pyl)
The Sec and Pyl containing proteins are predominantly found in members of class -proteobacteria, phylum Proteobacteria.
Metagenome analysis of the uncultured -proteobacteria of the gutless & mouthless worm, Olavius algarvensi, contains the highest proportion of Sec & Pyl containing proteins to date suggesting that symbiosis promotes Sec & Pyl genetic code
code- 63 out of the 64 possible codons
Olavius algarvensi, also contains the most wide use of the genetic
Sec: 99 genes, cluster into 30 protein families present in domains Bacteria, Archaea & Eucarya. Sec coded by UGA (UGA also acts as a stop codon)
Structural genomics
in order to gain a complete understanding of an organism and fully exploit the potential offered by microbial genome sequencing, it is essential that these unidentified ORFs are assigned function in most cases classical molecular biology tools will be necessary for this task, however, some suggestion of function for these ORFs would greatly improve the efficiency of this process one possibility is structural genomics this is the process of determining three dimensional structures of all the gene products encoded in a microbial genome (1000s of structures!!) function can then be inferred on the basis of 3d structure comparisons to other proteins this relies on the principle that structure determines functions and although two proteins with similar amino acid sequences can be assumed to have similar structures, two proteins with similar structure dont necessarily have the same aa sequence
Microarray hybridisation
a completely annotated microbial genome sequence, whilst a powerful scientific tool, still doesnt provide all of the information needed to understand the complete biology of an organism as it essentially a static picture of the genome for truly complete characterisation, the dynamic nature of gene expression within a microbial cell needs to be determined microarray technology allows whole organism gene expression to be investigated PCR products of every gene from a complete genome sequence are bound in a high density array on a glass slide these arrays are probed with fluorescently labelled cDNA prepared from whole RNA under specific environmental conditions the level of cDNA for each ORF is then quantified using high resolution image scanners
Microarray hybridisation
example a microarray containing 97% of the predicted ORFs from Mycobacterium tuberculosis was used to investigate the response to the antituberculosis drug isoniazid (INH) INH was found to induce several genes related to outer lipid envelope biosynthesis consistent with the drugs physiological mode of action a number of additional genes were also induced which may provide potential drug targets in the future
Archaeal Genomes
analysis of the 5 complete genome available for members of the domain Archaea has provided new insights into relationships between Archaea, Bacteria and Eukaryotes around 35% of the Archaeal genes form a stable core conserved throughout the domain most of these encode proteins involved in transcription, translation and DNA metabolism and some central metabolic pathways the remainder of the genome is classified as a variable shell a relatively high proportion of the variable shell genes are most homologous to their bacterial counterparts - this suggests horizontal gene transfer events a relatively high proportion of the stable core genes are most similar to Eukaryotic genes
A - Stable core
B - Variable shell
Summary
Microbial genome sequencing and analysis is a rapidly expanding and increasingly important strand of microbiology important information about the specific adaptations and evolution of an organism can be determined from genome sequencing however, genome sequencing merely a strong starting point on road to completely understanding the biology of microorganisms further characterisation of ORFs of unknown function, in combination with gene expression analysis and proteomics is required