Tools For Genomic Data

WEHI Postgraduate Seminar Series 2003
Tools for Maximising the Value

of Genomic Data
Keith Satterley, Bioinformatics,

The Walter & Eliza Hall Institute of Medical research
2nd. June 2003
keith@wehi.edu.au
http://bioinf.wehi.edu.au/resources/presentations.html
Overview
1.
Genomic data what is it, where is it

1.
Gene Finding
1.
2.
GenScan
Comparitive Genomics
1.
Gene Finding
1.
2.
3.
Finding Regulatory Regions

1.
2.
3.
2.
rVista
Consite
Toucan
Programming Tools
1.
Languages
1.
2.
3.
4.
2.
3.
4.
Slam
Twinscan
Perl
BioPerl
BioJava
Bio???
Slipper-a Perl program & results
Link References
Aknowledgements
1953
2003
http://www.geneticscongress2003.com/index.php
Genomic data
Whole genome data sets. According to
http://www.ebi.ac.uk/genomes/ as at 28-May-03
Archea 16
Bacteria 107
Organelles 308
Phages 112
Plasmids 280
Viroids 40
Viruses 880
TOTAL:1743
Eukaryota (completed chromosomes)

Description
Chromosomes
Anopheles gambiae: Ensembl project data
2L 2R 3L 3R X
MUSTARD: Arabidopsis thaliana complete genome:
I II III IV V
I II III IV V
Proteome pages
WORM:Caenorhabditis elegans complete genome
I II III IV V
X
FastA
Proteome pages
FLY: Drosophila melanogaster complete genome
X,2-4,Y
FastA
Proteome pages
Encephalitozoon cuniculi complete genome
I II III IV V
VI VII VIII IX X
XI
I II III IV V
VI VII VIII IX X
XI
Ensembl project data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 X Y
Proteome pages
Homo sapiens complete genome parts: CON files
14 21q
Leishmania major
source MIPS
HUMAN:Homo sapiens complete genome:
MOUSE:Mus musculus complete genome:

123456
7 8 9 10 11
12 13 14 15
16 17 18 19 X
Oryza sativa
Plasmodium falciparum
1 2 3 4 5 6 7 8 9 10 11 12 13 14
RAT:Rattus norvegicus complete genome:

YEAST:Saccharomyces cerevisiae strain S288C complete

genome
YEAST:Schizosaccharomyces pombe strain 972hcomplete genome
Trypanosoma brucei
http://www.ebi.ac.uk/genomes/
Proteins
Proteome pages
123456
7 8 9 10 11
12 13 14 15
16 17 18 19 20 X
I-XVI
FastA
Proteome pages
I-III
I II III
Proteome pages
http://gnn.tigr.org/sequenced_genomes/genome_guide_p1.shtml
gnn.tigr.org
GOLD Genomes Online Database
http://www.genomesonline.org/
Its a Fact:
Count @ 1 base per second, 24 hours a day,
It would take you about
95 years to count the DNA in one cell.
Francis Collins
Director, National Human Genome Research Institute
25th. April 2003
Here in the very month of the 50th anniversary of the
discovery of DNAs double helix, I am pleased and honored
perhaps I should say exhilarated to declare the goals
of the Human Genome Project to be completed.
3.1 Million years

To count to 100
Trillion!
..the information that will matter to you about your life is a fraction of your genetic
code probably less than 1 percent. J.Craig Venter, 25-04-2003(Bio-IT World)
http://www.genomesonline.org/
Most Recent Genomics News

BETHESDA, Md., May 20, 2003
By June, researchers from the Whitehead/MIT Center and the
Genome Sequencing Center at Washington University School
of Medicine expect to complete the sequencing work
(approximately four-fold coverage) necessary to create an
initial working draft of the genome of the chimpanzee (Pan
troglodytes).
The Whitehead/MIT team expects to complete a high-quality
draft of the dog genome sequence within the next 12 months.
After the genome of the boxer is sequenced, researchers plan
to sample and analyze DNA from 10 to 20 other dog breeds,
including the beagle, to study genetic variation within the
canine species.
http://www.genome.gov/11007358
Gene Finding
Gene finding is about detecting coding regions and
inferring gene structure.
Gene finding is difficult.
DNA sequence signals have low information
content (degenerated and highly unspecific)
It is difficult to discriminate real signals
Sequencing errors
Prokaryotes: High gene density and simple gene structure,
Short genes have little information, Overlapping genes.
Eukaryotes: Low gene density and complex gene structure

Alternative splicing, Pseudo-genes.
Gene Finding
A Good Gene Finding Review has been
prepared by Lorenzo Cerutti of the Swiss
Institute of Bioinformatics. It is an EMBNet
course, (September 2002) entitled Gene
Finding.
It is at:
http://www.ch.embnet.org/CoursEMBnet/Pages02/slides/gene_finding.pdf
Gene Finders
GenScan - Uses generalized hidden Markov models to predict complete gene structure
http://genes.mit.edu/GENSCAN.html
MZEF - Designed to predict only internal coding exons.
http://www.cshl.org/genefinder
FGENES Uses linear discriminant analysis.
http://genomic.sanger.ac.uk/gf/gf.shtml
GeneFinder:
http://www.cshl.org/genefinder
GRAIL 1,1a,2
http://compbio.ornl.gov
HMMgene - Designed to predict complete gene structure.

http://genome.cbs.dtu.dk/services/HMMgene
Genewise - Uses HMMs. Genewise is part of the Wise2 package:
Procrustes - Predicts gene structure from homology found in proteins.
http://www.sanger.ac.uk/Software/Wise2.
http://hto-13.usc.edu/software/procrustes/index.html
GeneMark.hmm. Recently modified to predict gene structure in eukaryotes.

http://opal.biology.gatech.edu/GeneMark
Geneid. Recently updated to a new and faster version.

http://www1.imim.es/geneid.html
Gene Finders
Gene Finders
1. Overall performances are the best for HMMgene and GENSCAN.
2. Some programs accuracy depends on the G+C content, except for
HMMgene and GENSCAN, which use different parameters sets for different
G+C contents.
3. For almost all the tested programs, medium exons (70-200 nucleotides
long), are most accurately predicted. Accuracy decrease for shorter and
longer exons, except for HMMgene.
4. Internal exons are much more likely to be correctly predicted (weakness of
the start/stop codon detection).
5. Initial and terminal exons are most likely to be missed completely.
6. Only HMMgene and GENSCAN have reliable scores for exon prediction.
Gene prediction limits

1.
2.
3.
4.
5.
6.
7.
Existing predictors are for protein coding regions

Non-coding areas are not detected (5 and 3 UTR)
Non-coding RNA genes are missed
Predictions are for typical genes
Partial genes are often missed
Training sets may be biased
Atypical genes use other grammars
GenScan
GENSCAN was developed by Chris Burge and
Samuel Karlin, Department of Mathematics, Stanford University
Genscan is a general probabilistic model of the

gene structure of human genomic sequences.
Genscan identifies complete exon/intron structures
of genes in both strands of genomic DNA.
The new Genscan Web Server is at
http://genes.mit.edu/GENSCAN.html
Genscan is also available for WEHI people at
http://www.wehi.edu.au/resources/PBC/index.html
with a greater choice of options.
Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. (1997) 268, 78-94
Quotes from the 50/50 series of

interviews by Bio-IT World
Gene Myers
Professor, Dept. of Electrical Engineering & Computer Sciences
University of California, Berkeley .
If you take a sequence and just run a gene

prediction program on it, the programs dont
usually do very well. But if you take human
and mouse sequence, and compare them
against each other looking for similar
regions you get better predictions. And
the more genomes we have, the better it will
get.

Richard Durbin
Head of Informatics, Wellcome Trust Sanger Institute.
Looking at the similarity between the human

genome and other species is a really powerful way
to get at functional sequences and to allow us to
work on them in different species.
Several groups, including ours, have gene-finding
methods for comparative genomics. This is an
active area where we will see significant advances
in the next few years.
Comparative Genomics
The Assumption that underlies comparitive genomics is that the two

genomes had a common ancestor and that each organism is a combination
of the ancestor and the action of evolution.
Evolution can be broadly thought of as the combination of two processes:

mutational forces that generate random mutations in the genome
sequence, and selection pressures that
1. Eliminate random mutations (negative selection),
2. Have no effect on mutations (neutral selection) or,
2. Increase the frequency of mutant alleles in the population as a result
of a gain in fitness (positive selection).
The combined action of mutation and selection is represented generally by a

RATE MATRIX of base-pair changes between the two observed genomes.
Human
Comparative Genomics
Mouse
Rat
Evolutionary
relationship
between metazoans
that are sequenced,
or due for
sequencing.
Evolutionary
distances are in
millions of years.
C.Elegan
s
Comparative genomics may be defined as the
derivation of genomic information following

comparison of the information content of 2 or
more species genome sequences
There is a good article in Nature Genetics Reviews, April

2003 Vol 4 No 4,pp251-262.
Comparative Genomics: Genomice-Wide

analysis in Metazoan Eukaryotes,
Ureta-Vidal, A. Laurence Ettwiller &
Ewan Birney 2003
http://www.nature.com/cgi-taf/DynaPage.taf?file=/nrg/journal/v4/n4/full/nrg1043_fs.html
The similarity is such that human chromosomes can be cut

(schematically at least) into about 150 pieces (only about 100 are
large enough to appear here), then reassembled into a reasonable
approximation of the mouse genome.
http://www.ornl.gov/TechResources/Human_Genome/graphics/slides/ttmousehuman.html
there has been an explosion in the
availability of tools which may make it
difficult to decide which tool is most
suitable for your research.
Indeed, to interpret these resources, you
must be aware of the differences between
them and between their underlying
assumptions.
Whole Genome Alignments

Kbrowserhttp://hanuman.math.berkeley.edu/cgibin/kbrowser
Amultiplegenomebrowser,currentlysetupforhuman,mouseandrat
basedontheMAVIDalignments,UCSCgenomebrowser.
Comparative Gene Prediction

SLAM
http://baboon.math.berkeley.edu/~syntenic/slam.html
Exampleofacomparativegenefinder
EmploysageneralisedpairhiddenMarkovmodel
approachforpredictinggenestructureswithin
syntenicgenomicsequences
Performinggenefindingandalignmentofthe
sequencessimultaneously
SLAM
SLAM has been used for whole genome annotation projects.
For the Mouse/Human analysis, SLAM used a human/mouse sytenny map,
giving segments which are further broken up into 300kb pieces.
These pieces are aligned by AVID .
SLAM then ran on all syntenic pieces using AVID alignments as guides.
Coding lengths < 120 were discarded.
SLAM also predicted conserved non coding regions(CNS), the first de novo
prediction of CNS in the human and mouse genome.
The results are available at
http://bio.math.berkeley.edu/slam/mouse/
A similar result is available for Human/Rat.
seq1 SLAM CDS 2421 2478 . + 2 gene_id "000001"; transcript_id "000001.1"; frame "1"; exontype "internal"
seq1 SLAM CDS 3127 3805 . + 1 gene_id "000001"; transcript_id "000001.1"; frame "1"; exontype "internal"
-------------------------------------------------------------------------------------------------------------------------------------------------------------seq2 SLAM CDS 2134 2191 . + 2 gene_id "000001"; transcript_id "000001.1"; frame "2"; exontype "internal"
seq2 SLAM CDS 2867 3545 . + 1 gene_id "000001"; transcript_id "000001.1"; frame "2"; exontype "internal
-------------------------------------------------------------------------------------------------------------------------------------------------------------> Protein 1: (244,244) aa (incomplete protein)
Y
Z
...
1 KCEAIASDCF LSGNVDIELK DHNNCISKIN VEDQKNCALS WAFASIYHLE

CE IAS CF LSGNVDIE K D ++C S I
E+Q NC LS W F S HLE
1 TCERIASSCF LSGNVDIEWK DKSSCFSSIE TEEQGNCNLS WLFTSKTHLE
50
50
http://baboon.math.berkeley.edu/~syntenic/slam.html
TwinScan
One of the first gene predictors to substantially
exceed the performance of GENSCAN on a
genomic scale by using mousehuman
comparison was TWINSCAN (Korf et al. 2001).
http://genes.cs.wustl.edu/query.html
Other Comparative Gene

Predicters
DoubleScan -http
://www.sanger.ac.uk/cgibin/doublescan/submit
It is a program for comparative ab initio prediction of

protein coding genes in mouse and human DNA.
Generates exon candidates in both sequences.
SPG-1....http://soft.ice.mpg.de/sgp-1
SGP-1 is a similarity based gene prediction
program. Given two genomic DNA sequences it
post-processes the pairwise local alignment to
predict single or multiple gene models of protein
coding genes in forward and reverse strands.
Regulatory Sequence
Regulatory Sequence
Leroy Hood brought out this point in
his talk at the Bio2001 meeting in San
Diego (2428 June 2001) with his statement
that
The difference between man and
monkey is gene regulation.

Lincoln Stein
Associate Professor, Cold Spring Harbor Laboratory .
I think the places that we should be looking at

now are the non-repetitive, unique, noncoding DNA. If they are conserved, they
must be important. There are discoveries in
there.
Finding regulatory regions

rVISTA. . . . . . . . . . . . . . . . . . . . . . .
http://teapot.jgi-psf.org/ovcharen/rvista/index.html
Consite. . . . . . . . . . . . . . . . . . . . . . .
http://forkhead.cgb.ki.se/cgi-bin/consite
Footprinter. . . . . . . . . . . . . . . . . . .
http://abstract.cs.washington.edu/~blanchem/FootPrinterWeb/FootprinterInput.pl
Toucan. . . . . . . . . . . . . . . . . . . . . . .
http://www.esat.kuleuven.ac.be/~saerts/software/toucan.php/
Trafac . . . . . . . . . . . . . . . . . . . . . . . .
http://trafac.chmcc.org/trafac/index.jsp
VISTA is a set of tools for comparative genomics. It was designed to visualize

long sequence alignments of DNA from two or more species with annotation
information.
The alignment engine behind VISTA. AVID is a program for globally
aligning DNA sequences of arbitrary length.
mVISTA (main VISTA) A program for visualizing alignments of an

arbitrary number of genomic sequences from different species
rVISTA (regulatory VISTA) combines transcription factor binding sites
database search with a comparative sequence analysis.
rVista
A program that combines transcription factor

binding site (TFBS) searches with
comparative sequence analysis.
At the first step, human and mouse sequences are aligned using the global
alignment program MAVID.
At the second step, potential transcription factor binding sites were predicted
by Match program based on TRANSFAC Professional 5.3 library.
At the third step, the human-mouse sequence conservation of a DNA
region spanning a transcription factor binding site was assessed using a
novel strategy.
Human and/or mouse annotation determine the genomic location of each
predicted transcription factor hit.
Finding Regulatory Regions

rVista
A program that combines transcription factor
binding site (TFBS) searches with comparative
sequence analysis.
ConSite
http://forkhead.cgb.ki.se/cgi-bin/consite
Identification of conserved regulatory

elements by comparative genome analysis
Boris Lenhard*, Albin Sandelin*, Luis

Mendoza*, Pr Engstrm*,
Niclas Jareborg* and Wyeth W Wasserman*
BioMed Central - Open Access
Journal of Biology
ConSite - Identification of conserved regulatory

elements by comparative genome analysis
Consite is a web-based tool for detecting

transcription factor binding sites in
genomic sequences using phylogenetic
footprinting.
Two orthologous genomic sequences are
aligned, and transcription factor binding
sites are only reported for those regions in
the alignment which transcend a certain
treshold of conservation.
ConSite
The method is implemented as a graphical web
application, ConSite, which is at:
http://forkhead.cgb.ki.se/cgi-bin/consite or
http://www.phylofoot.org/
Various tools are made available at phylofoot.org.
Sequence View
Toucan
http://www.esat.kuleuven.ac.be/~saerts/software/toucan.php
Toucan is a workbench for regulatory

sequence analysis on metazoan genomes :
comparative genomics, detection of significant
transcription factor binding sites, and detection
of cis-regulatory modules (combinations of
binding sites) in sets of coexpressed/coregulated
genes.
Standalone Java application that is tightly linked
with Ensembl, and was built using the BioJava
package
Perl A Programming Language.

What is Perl?
Perl actually stands for
Practical Extraction and Report Language,
and was invented by Larry Wall.
Perl is supported by its users and was all
written by volunteers.
Programming Tools
Perl
Perl is remarkably good for slicing, dicing,
twisting, wringing, smoothing,
summarizing and otherwise mangling text!
Perl's powerful regular expression
matching and string manipulation
operators simplify this job in a way that is
unequalled by any other modern
language.
Perl & Genome Data

Although genome informatics groups are
constantly tinkering with other "high level"
languages such as Python, Tcl and recently Java,
nothing comes close to Perl's popularity.
In short, when the genome project was
foundering in a sea of incompatible data formats,
rapidly-changing techniques, and monolithic data
analysis programs that were already antiquated
on the day of their release, Perl saved the
day.
Lincoln Stein.
Perl one-Liners!
Take a blast output and print all of the
gi's(Genbank Identifiers) matched, one per
line.
Solution one line of Perl.
perl -pe 'next unless ($_) = /^>gi\|
(\d+)/;$_.="\n"' filename.blast
Perl Modules/Programs
Perl can be used for complex programs.
The RepeatMasker program is written in Perl. It
calls other programs written in other
languages(Crossmatch written in C).
Slipper is a 4500 line program written in Perl. It
calls Repeatmasker and Primer3 repeatedly and
processes the output files from them, writing
summarised results to disk.
SLiPPER
Sequence Length Polymorphism and Primer FindER
Programming: Keith Satterley, Specifications: Grant Morahan

Division of Bioinformatics & Genetics,
The Walter & Eliza Hall Institute of Medical Research
Slipper
Masks Alu etc. repeats (using RepeatMasker);

Selects SSLRs with user-specified parameters;
Designs primers (using Primer3)
Grant Morahan selects and tests chosen SSLRs
to become Microsatelite Markers on the Mouse
Genome.
To derive a first generation systematic map of the
mouse, with sub-centiMorgan (1Mb) resolution.
Extend to 10 times this density over 50 strains.
UTILITY OF SLIPPER
40
Dimer repeats
Multimer repeats
O* = polymorphic
O* B6 = NZO
35
30
Number
of SSRs
25
O*
20
O*
15
10
O*
O*
O*
O*
O*
O*
O*
O* O*
0
0
20 00 00
40 00 00
60 000 0
800 00 0
1 00 0 00 0
1 20 0 000
14 000 00
Position on chromosome (bp)
Possible SLIPPER Data Analysis

STRAIN RELATEDNESS AND EVOLUTION
-graphic depiction of allele sharing between strains

-probability of IBD v allele convergence by mutation
-comparison of close strain relatedness
(eg B6 v b10; B6 v Ka; D1 v D2; NOD v NOR)
-overall strain relatedness > cladogram
-pairwise strain dimorphism rate
useful for choosing 2 strains to be used in a cross
-comparison of results for reduced strains set with MIT markers
- comparison of haplotypes with Phenome database
O|B|F - Open Bioinformatics

Foundation
The Open Bioinformatics Foundation is a non profit,
volunteer run organization focused on supporting open
source programming in bioinformatics.
The foundation grew out of the volunteer projects
Bioperl, BioJava and Biopython.
Underwrites and supports the BOSC conferences.
Organizing and supporting developer-centric "hackathon"
events.
Managing servers, bank account & other assets.
Open Bioinformatics Foundation

PROJECTS
BioPerl
BioJava
BioPython
BioRuby
BioPipe
BioSQL / OBDA
MOBY
DAS
BioPathways
EMBOSS
Open Bioinformatics Foundation
June 27-28 2003 -- 4th Annual Bioinformatics

Open Source Conference
www.open-bio.org/bosc/
ISMB 2003 - Brisbane
Normally held in Europe and Nth. America.

For 2 days beforehand
BOSC(Open Source conference) .
Biopathways, BioOntology, Text Mining & WEB03
Tutorials on Sunday choose 2 from 15 offered.

ISMB for 4 days over 50 no parallel talks!
http://www.iscb.org/ismb2003/index.shtml
The bioperl project

Officially organized in 1995 and existing
informally for several years prior, The
Bioperl Project is an international
association of developers of open source
Perl tools for bioinformatics, genomics and
life science research.
What is BioPerl
Bioperl is a tookit of perl modules useful in
building bioinformatics solutions in perl.
It is built in an object-oriented manner
The collection of modules can be used to
run a large range of Bioinformatics
programs and process their output files.
There are modules to carry out analyses,
to graph data and to read many data
formats.
BioJava
http://www.biojava.org/
The BioJava Project is an open-source
project dedicated to providing Java tools for
processing biological data.
BioJava is a general bioinformatics toolkit. It
provides a framework for building everything
from simple scripts to complete applications.
BioJava is designed to be used as a library.
BioJava
http://www.biojava.org/
Currently, there are objects for:

Sequences and features
Dynamic programming
Single-sequence and pair-wise HMMs

Viterbi-path, Forward and Backward algorithms
Training models
Sampling sequences from models
External file formats and programs
IO
Processing, storing, manipulating
Visualising
GFF
Blast
Meme
Sequence Databases
BioCorba interoperability
ACeDB client
DAS client
Other Open Source Projects.

BioDAS - Distributed Annotation System
(DAS) - A server system for the sharing of
Reference Sequences.
Biopython. tools for computational
molecular biology. Python(excellent
language for beginners, yet superb for
experts).
BioRuby, BioSQL, MOBY,BioPathways
and BioOpera.
LINKS
Internet Resources
Prediction of exons and gene structure
SLAM....http://baboon.math.berkeley.edu/~syntenic/slam.html
TwinScan....http://genes.cs.wustl.edu
Finding regulatory regions by phylogenetic footprinting
Consite....http://forkhead.cgb.ki.se/cgi-bin/consite
rVISTA....http://teapot.jgi-psf.org/ovcharen/rvista/index.html
Toucan....http://www.esat.kuleuven.ac.be/~saerts/software/toucan.php
Whole-genome alignments in genome browser
ECR browser....http://nemo.lbl.gov/ecrBrowser
Ensembl....http://www.ensembl.org
UCSC....http://genome.ucsc.edu
A comprehensive, straightforward Links Page one of the best!
http://apollo11.isto.unibo.it/
Genome Links from Ewan Birney et al.
Genome aligners
AVID....http://www-gsd.lbl.gov/vista/details_avid.htm
BLASTZ....http://bio.cse.psu.edu
BLAT....http://genome.ucsc.edu
Exonerate....http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/Exonerate.html
GLASS....http://crossspecies.lcs.mit.edu
LAGAN/MLAGAN....http://lagan.stanford.edu
MegaBLAST....http://www.ncbi.nih.gov/blast/tracemb.html
MUMmer....http://www.tigr.org/software/mummer
PatternHunter....http://www.bioinformaticssolutions.com/products/ph.php
WABA....http://www.cse.ucsc.edu/~kent/xenoAli/index.html
Prediction of exons or coding regions

DIALIGN2....http://bibiserv.techfak.uni-bielefeld.de/dialign
ExoFish....http://www.genoscope.cns.fr/proxy/cgi-bin/exofish.cgi
OrthoSeq....http://www.phylofoot.org/cgi-bin/orthoseq.cgi
ROSETTA/GLASS....http://crossspecies.lcs.mit.edu
Prediction of exons and gene structure

DoubleScan....http://www.sanger.ac.uk/Software/analysis/doublescan
SLAM....http://baboon.math.berkeley.edu/~syntenic/slam.html
TwinScan....http://genes.cs.wustl.edu
Genomics Web sites
Functional and Comparative Genomics Research - More technical information

on HGP involvement with comparative and functional genomics.
Virtual Library of Genetics - Links to genetic and genomic information
organized by organism.
Microbial Genome Program - U.S. Department of Energy program to study the
genetic material of microbes that may be useful in helping DOE fulfill its
missions.
DOE Joint Genome Institute - Consortium of U.S. Department of Energy
researchers developing and exploiting new technologies as a means for
discovering and characterizing the basic principles and relationships
underlying living systems.
A Quick Guide to Sequenced Genomes - Illustrated index of organisms that
have had their genomes sequenced. From the Genome News Network.
Model Organisms for Biomedical Research - Information on model organisms
from the National Institutes of Health.
Mouse Genome Resources - Gateway to mouse resources in and beyond
National Center for Biotechnology Information (NCBI) resources.
Functional Genomics - Gateway to functional genomics sources from Science.
Ecce homology: A primer on comparative genomics - From Modern Drug
Discovery, a publication of the American Chemical Society.
Image Gallery links
http://www.ornl.gov/TechResources/Human_Genome/education/images.html
Image Gallery links

http://
www.ornl.gov/TechResources/Human_Genome/educ
ation/images.html
Gallery 1: Genome Science

Gallery 2: Genome Tools and Technologies
Gallery 3: Genomes to Life
Gallery 4: Human Genome Project
Gallery 5: Ethical, Legal, and Social Issues;
Genomic Medicine
http://
ation/images.html
Other Website Image Galleries and
Resources
NIH NHGRI Press Photos
CSHL Eugenics Archive
RasMol Protein Gallery
Photos of normal and abnormal chromosomes
Access Excellence Graphics Gallery
The Why Files Cool Image Gallery
Genetics Animation Gallery
http://
ation/images.html
Molecular Expressions Photo Gallery
Gene Maps
1999 Online Gene Map from NCBI.
Clickable 1996 Gene Map from Science magazine.
You can click on any one of the 24 different human
chromosomes and see examples of genes found.
Chromosome Maps
human chromosome 16
human chromosome 19
http://
ation/images.html
U.S. Government Image Galleries

Argonne National Laboratory Photo Gallery
Brookhaven National Laboratory Image Library
Fermi National Laboratory Photo Database
Jefferson Laboratory Picture Exchange
Lawrence Berkeley National Laboratory Image Gallery
Lawrence Livermore National Laboratory Image Gallery
Los Alamos National Laboratory Photo Gallery
National Human Genome Research Institute Image Gallery
National Renewable Energy Laboratory Photo Library
Oak Ridge National Laboratory Image Gallery
Pacific Northwest National Laboratory Photo Gallery
Stanford Linear Accelerator Center Photo Archives
Sandia National Laboratory Photo Gallery
U.S. DOE Image Gallery
Acknowledgements
WEHI Bioinformatics group
Tim Beissbarth
Alex Gout
Terry Speed
All the others in Bioinformatics who provide a
great environment to work in and with.
Grant Morahan
WEHI ITS - who provide the best infrastructure
of anywhere I know of.
Year by year we are becoming better equipped to

accomplish the things we are striving for.
But what are we actually striving for?
- Bertrand de Jouvenel, 1903-1987
Success is the ability to go

from failure to failure without
losing your enthusiasm.
- Winston Churchill, 18741965

Tools For Genomic Data

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Tools For Genomic Data

Hochgeladen von

Copyright:

Verfügbare Formate

WEHI Postgraduate Seminar Series 2003

Tools for Maximising the Value

Keith Satterley, Bioinformatics,

Genomic data what is it, where is it

Finding Regulatory Regions

Slipper-a Perl program & results

Eukaryota (completed chromosomes)

Anopheles gambiae: Ensembl project data

MUSTARD: Arabidopsis thaliana complete genome:

WORM:Caenorhabditis elegans complete genome

FLY: Drosophila melanogaster complete genome

Encephalitozoon cuniculi complete genome

Ensembl project data

Homo sapiens complete genome parts: CON files

HUMAN:Homo sapiens complete genome:

MOUSE:Mus musculus complete genome:

RAT:Rattus norvegicus complete genome:

YEAST:Saccharomyces cerevisiae strain S288C complete

YEAST:Schizosaccharomyces pombe strain 972hcomplete genome

GOLD Genomes Online Database

3.1 Million years

Most Recent Genomics News

Eukaryotes: Low gene density and complex gene structure

MZEF - Designed to predict only internal coding exons.

HMMgene - Designed to predict complete gene structure.

Genewise - Uses HMMs. Genewise is part of the Wise2 package:

Procrustes - Predicts gene structure from homology found in proteins.

GeneMark.hmm. Recently modified to predict gene structure in eukaryotes.

Geneid. Recently updated to a new and faster version.

Gene prediction limits

Existing predictors are for protein coding regions

Samuel Karlin, Department of Mathematics, Stanford University

Genscan is a general probabilistic model of the

Quotes from the 50/50 series of

If you take a sequence and just run a gene

Quotes from the 50/50 series of

Looking at the similarity between the human

The Assumption that underlies comparitive genomics is that the two

Evolution can be broadly thought of as the combination of two processes:

The combined action of mutation and selection is represented generally by a

derivation of genomic information following

There is a good article in Nature Genetics Reviews, April

Comparative Genomics: Genomice-Wide

The similarity is such that human chromosomes can be cut

Whole Genome Alignments

Comparative Gene Prediction

1 KCEAIASDCF LSGNVDIELK DHNNCISKIN VEDQKNCALS WAFASIYHLE

Other Comparative Gene

It is a program for comparative ab initio prediction of

Quotes from the 50/50 series of

I think the places that we should be looking at

Finding regulatory regions

VISTA is a set of tools for comparative genomics. It was designed to visualize

mVISTA (main VISTA) A program for visualizing alignments of an

A program that combines transcription factor

Finding Regulatory Regions

Identification of conserved regulatory

Boris Lenhard*, Albin Sandelin*, Luis

ConSite - Identification of conserved regulatory

Consite is a web-based tool for detecting

Toucan is a workbench for regulatory

Perl A Programming Language.

Perl & Genome Data

Boris Lenhard, Albin Sandelin, Luis