Beruflich Dokumente
Kultur Dokumente
Overview
1.
Gene Finding
1.
2.
GenScan
Comparitive Genomics
1.
Gene Finding
1.
2.
3.
2.
rVista
Consite
Toucan
Programming Tools
1.
Languages
1.
2.
3.
4.
2.
3.
4.
Slam
Twinscan
Perl
BioPerl
BioJava
Bio???
Link References
Aknowledgements
1953
2003
http://www.geneticscongress2003.com/index.php
Genomic data
Whole genome data sets. According to
http://www.ebi.ac.uk/genomes/ as at 28-May-03
Archea 16
Bacteria 107
Organelles 308
Phages 112
Plasmids 280
Viroids 40
Viruses 880
TOTAL:1743
Chromosomes
2L 2R 3L 3R X
I II III IV V
I II III IV V
Proteome pages
I II III IV V
X
FastA
Proteome pages
X,2-4,Y
FastA
Proteome pages
I II III IV V
VI VII VIII IX X
XI
I II III IV V
VI VII VIII IX X
XI
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 X Y
Proteome pages
14 21q
Leishmania major
source MIPS
123456
7 8 9 10 11
12 13 14 15
16 17 18 19 X
Oryza sativa
Plasmodium falciparum
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Trypanosoma brucei
http://www.ebi.ac.uk/genomes/
Proteins
Proteome pages
123456
7 8 9 10 11
12 13 14 15
16 17 18 19 20 X
I-XVI
FastA
Proteome pages
I-III
I II III
Proteome pages
http://gnn.tigr.org/sequenced_genomes/genome_guide_p1.shtml
gnn.tigr.org
http://www.genomesonline.org/
Its a Fact:
Count @ 1 base per second, 24 hours a day,
It would take you about
95 years to count the DNA in one cell.
Francis Collins
Director, National Human Genome Research Institute
25th. April 2003
Here in the very month of the 50th anniversary of the
discovery of DNAs double helix, I am pleased and honored
perhaps I should say exhilarated to declare the goals
of the Human Genome Project to be completed.
Gene Finding
Gene finding is about detecting coding regions and
inferring gene structure.
Gene finding is difficult.
DNA sequence signals have low information
content (degenerated and highly unspecific)
It is difficult to discriminate real signals
Sequencing errors
Prokaryotes: High gene density and simple gene structure,
Short genes have little information, Overlapping genes.
Gene Finding
A Good Gene Finding Review has been
prepared by Lorenzo Cerutti of the Swiss
Institute of Bioinformatics. It is an EMBNet
course, (September 2002) entitled Gene
Finding.
It is at:
http://www.ch.embnet.org/CoursEMBnet/Pages02/slides/gene_finding.pdf
Gene Finders
GenScan - Uses generalized hidden Markov models to predict complete gene structure
http://genes.mit.edu/GENSCAN.html
http://www.cshl.org/genefinder
FGENES Uses linear discriminant analysis.
http://genomic.sanger.ac.uk/gf/gf.shtml
GeneFinder:
http://www.cshl.org/genefinder
GRAIL 1,1a,2
http://compbio.ornl.gov
http://www.sanger.ac.uk/Software/Wise2.
http://hto-13.usc.edu/software/procrustes/index.html
Gene Finders
Gene Finders
1. Overall performances are the best for HMMgene and GENSCAN.
2. Some programs accuracy depends on the G+C content, except for
HMMgene and GENSCAN, which use different parameters sets for different
G+C contents.
3. For almost all the tested programs, medium exons (70-200 nucleotides
long), are most accurately predicted. Accuracy decrease for shorter and
longer exons, except for HMMgene.
4. Internal exons are much more likely to be correctly predicted (weakness of
the start/stop codon detection).
5. Initial and terminal exons are most likely to be missed completely.
6. Only HMMgene and GENSCAN have reliable scores for exon prediction.
GenScan
GENSCAN was developed by Chris Burge and
Comparitive Genomics
Comparative Genomics
Human
Comparative Genomics
Mouse
Rat
Evolutionary
relationship
between metazoans
that are sequenced,
or due for
sequencing.
Evolutionary
distances are in
millions of years.
C.Elegan
s
Comparitive Genomics
Comparative genomics may be defined as the
http://www.nature.com/cgi-taf/DynaPage.taf?file=/nrg/journal/v4/n4/full/nrg1043_fs.html
Comparitive Genomics
there has been an explosion in the
availability of tools which may make it
difficult to decide which tool is most
suitable for your research.
Indeed, to interpret these resources, you
must be aware of the differences between
them and between their underlying
assumptions.
Exampleofacomparativegenefinder
EmploysageneralisedpairhiddenMarkovmodel
approachforpredictinggenestructureswithin
syntenicgenomicsequences
Performinggenefindingandalignmentofthe
sequencessimultaneously
SLAM
SLAM has been used for whole genome annotation projects.
For the Mouse/Human analysis, SLAM used a human/mouse sytenny map,
giving segments which are further broken up into 300kb pieces.
These pieces are aligned by AVID .
SLAM then ran on all syntenic pieces using AVID alignments as guides.
Coding lengths < 120 were discarded.
SLAM also predicted conserved non coding regions(CNS), the first de novo
prediction of CNS in the human and mouse genome.
The results are available at
http://bio.math.berkeley.edu/slam/mouse/
A similar result is available for Human/Rat.
seq1 SLAM CDS 2421 2478 . + 2 gene_id "000001"; transcript_id "000001.1"; frame "1"; exontype "internal"
seq1 SLAM CDS 3127 3805 . + 1 gene_id "000001"; transcript_id "000001.1"; frame "1"; exontype "internal"
-------------------------------------------------------------------------------------------------------------------------------------------------------------seq2 SLAM CDS 2134 2191 . + 2 gene_id "000001"; transcript_id "000001.1"; frame "2"; exontype "internal"
seq2 SLAM CDS 2867 3545 . + 1 gene_id "000001"; transcript_id "000001.1"; frame "2"; exontype "internal
-------------------------------------------------------------------------------------------------------------------------------------------------------------> Protein 1: (244,244) aa (incomplete protein)
Y
Z
...
50
50
http://baboon.math.berkeley.edu/~syntenic/slam.html
TwinScan
One of the first gene predictors to substantially
exceed the performance of GENSCAN on a
genomic scale by using mousehuman
comparison was TWINSCAN (Korf et al. 2001).
http://genes.cs.wustl.edu/query.html
://www.sanger.ac.uk/cgibin/doublescan/submit
Regulatory Sequence
Regulatory Sequence
Leroy Hood brought out this point in
his talk at the Bio2001 meeting in San
Diego (2428 June 2001) with his statement
that
The difference between man and
monkey is gene regulation.
Toucan. . . . . . . . . . . . . . . . . . . . . . .
http://www.esat.kuleuven.ac.be/~saerts/software/toucan.php/
Trafac . . . . . . . . . . . . . . . . . . . . . . . .
http://trafac.chmcc.org/trafac/index.jsp
rVista
http://teapot.jgi-psf.org/ovcharen/rvista/index.html
ConSite
http://forkhead.cgb.ki.se/cgi-bin/consite
ConSite
The method is implemented as a graphical web
application, ConSite, which is at:
http://forkhead.cgb.ki.se/cgi-bin/consite or
http://www.phylofoot.org/
Various tools are made available at phylofoot.org.
http://www.phylofoot.org/
Sequence View
http://www.phylofoot.org/
http://www.phylofoot.org/
Toucan
http://www.esat.kuleuven.ac.be/~saerts/software/toucan.php
Programming Tools
Perl
Perl is remarkably good for slicing, dicing,
twisting, wringing, smoothing,
summarizing and otherwise mangling text!
Perl's powerful regular expression
matching and string manipulation
operators simplify this job in a way that is
unequalled by any other modern
language.
day.
Lincoln Stein.
Perl one-Liners!
Take a blast output and print all of the
gi's(Genbank Identifiers) matched, one per
line.
Solution one line of Perl.
perl -pe 'next unless ($_) = /^>gi\|
(\d+)/;$_.="\n"' filename.blast
Perl Modules/Programs
Perl can be used for complex programs.
The RepeatMasker program is written in Perl. It
calls other programs written in other
languages(Crossmatch written in C).
Slipper is a 4500 line program written in Perl. It
calls Repeatmasker and Primer3 repeatedly and
processes the output files from them, writing
summarised results to disk.
SLiPPER
Sequence Length Polymorphism and Primer FindER
Slipper
UTILITY OF SLIPPER
40
Dimer repeats
Multimer repeats
O* = polymorphic
O* B6 = NZO
35
30
Number
of SSRs
25
O*
20
O*
15
10
O*
O*
O*
O*
O*
O*
O*
O* O*
0
0
20 00 00
40 00 00
60 000 0
800 00 0
1 00 0 00 0
1 20 0 000
14 000 00
What is BioPerl
Bioperl is a tookit of perl modules useful in
building bioinformatics solutions in perl.
It is built in an object-oriented manner
The collection of modules can be used to
run a large range of Bioinformatics
programs and process their output files.
There are modules to carry out analyses,
to graph data and to read many data
formats.
BioJava
http://www.biojava.org/
The BioJava Project is an open-source
project dedicated to providing Java tools for
processing biological data.
BioJava is a general bioinformatics toolkit. It
provides a framework for building everything
from simple scripts to complete applications.
BioJava is designed to be used as a library.
BioJava
http://www.biojava.org/
Dynamic programming
IO
Processing, storing, manipulating
Visualising
GFF
Blast
Meme
Sequence Databases
BioCorba interoperability
ACeDB client
DAS client
LINKS
Internet Resources
Prediction of exons and gene structure
SLAM....http://baboon.math.berkeley.edu/~syntenic/slam.html
SPG-1....http://soft.ice.mpg.de/sgp-1
TwinScan....http://genes.cs.wustl.edu
Finding regulatory regions by phylogenetic footprinting
Consite....http://forkhead.cgb.ki.se/cgi-bin/consite
rVISTA....http://teapot.jgi-psf.org/ovcharen/rvista/index.html
Toucan....http://www.esat.kuleuven.ac.be/~saerts/software/toucan.php
Whole-genome alignments in genome browser
ECR browser....http://nemo.lbl.gov/ecrBrowser
Ensembl....http://www.ensembl.org
UCSC....http://genome.ucsc.edu
A comprehensive, straightforward Links Page one of the best!
http://apollo11.isto.unibo.it/
Genome aligners
AVID....http://www-gsd.lbl.gov/vista/details_avid.htm
BLASTZ....http://bio.cse.psu.edu
BLAT....http://genome.ucsc.edu
Exonerate....http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/Exonerate.html
GLASS....http://crossspecies.lcs.mit.edu
LAGAN/MLAGAN....http://lagan.stanford.edu
MegaBLAST....http://www.ncbi.nih.gov/blast/tracemb.html
MUMmer....http://www.tigr.org/software/mummer
PatternHunter....http://www.bioinformaticssolutions.com/products/ph.php
WABA....http://www.cse.ucsc.edu/~kent/xenoAli/index.html
http://www.ornl.gov/TechResources/Human_Genome/education/images.html
http://
www.ornl.gov/TechResources/Human_Genome/educ
ation/images.html
Other Website Image Galleries and
Resources
NIH NHGRI Press Photos
CSHL Eugenics Archive
RasMol Protein Gallery
Photos of normal and abnormal chromosomes
Access Excellence Graphics Gallery
The Why Files Cool Image Gallery
Genetics Animation Gallery
http://
www.ornl.gov/TechResources/Human_Genome/educ
ation/images.html
Molecular Expressions Photo Gallery
Gene Maps
1999 Online Gene Map from NCBI.
Clickable 1996 Gene Map from Science magazine.
You can click on any one of the 24 different human
chromosomes and see examples of genes found.
Chromosome Maps
human chromosome 16
human chromosome 19
http://
www.ornl.gov/TechResources/Human_Genome/educ
ation/images.html
Acknowledgements
WEHI Bioinformatics group
Tim Beissbarth
Alex Gout
Terry Speed
All the others in Bioinformatics who provide a
great environment to work in and with.
Grant Morahan
WEHI ITS - who provide the best infrastructure
of anywhere I know of.