Sie sind auf Seite 1von 87

BioJava Core API

Java for Bioinformatics?


 Cross platform means develop on one
platform deploy on any.
 Widely accepted industry standard.
 Lots of support libraries for modern
technologies (XML, WebServices,
JDBC).
 Scales well from small to industrial
strength enterprise sized programs.
Java for Bioinformatics?
 Object Oriented.
 Rapid development due to
 Very strict types
 Simple clear syntax
 Exception handling and recovery
 Cross platform
 Extensive class library
 Code reuse
What is BioJava?
 A collection of Java objects that
represent and manipulate biological
data
 Not a program, rather a programming
library
 Open source (LGPL) open for all
development, even commercial. Not
‘sticky’ or ‘viral’.
What is BioJava?
 Collection of objects to assist
bioinformatics research
 Started at EBI/Sanger in 1998 by
Matthew Pocock and Thomas Down
 25+ developers have contributed (5
core)
What is BioJava?
 BioJava has acquired 1100+ classes,
130,000+ lines of code.
 Uses CVS version control, JUnit
testing and ANT builds.
 It now has a fairly stable API.
 76 packages!
Where is BioJava
 Home Page
 www.biojava.org
 BioJava in Anger
 http://www.biojava.org/docs/bj_in_anger/
 Mailing Lists
 biojava-l@biojava.org
 biojava-dev@biojava.org
 Nightly Builds
 http://www.derkholm.net/autobuild/
Obtaining BioJava
 Download
 http://www.biojava.org/download/
 Get binaries, source and docs
 biojava-live (requires cvs)
 cvs -d :pserver:cvs@cvs.open-
bio.org:/home/repository/biojava login
 Password is ‘cvs’
 cvs -d :pserver:cvs@cvs.open-
bio.org:/home/repository/biojava checkout biojava-live
 cvs update -Pd
Compiling biojava-live
 Requires the ANT build tool
 http://jakarta.apache.org/ant/
 The ANT tool will use build.xml to
 Arrange source code
 Compile source
 Make jar file
 Make Java docs
 Build demos
 Build and Run tests
 Change to biojava-live; type ant
 Unit testing requires JUnit
 http://junit.sourceforge.net/
Setting up BioJava
 Put the following JAR files on your class
path:
 biojava.jar
 bytecode-0.92.jar
 commons-cli.jar
 commons-collections-2.1.jar
 commons-dbcp-1.1.jar
 commons-pool-1.1.jar
Object Orient Patterns and
BioJava Design
BioJava Design
 Uses some reasonably “advanced”
concepts
 Design by Interface
 Protected or Private constructors
 Factory classes and Methods
 Flyweight/ Singleton objects
Interfaces Hide
Implementation
 In BioJava there are several
implementations of the Distribution
interface.
 Any can be legally returned by a method
that returns a Distribution (the returning
method may even return different ones
depending on the situation).
 Any can be legally used as an argument to a
method that requires a Distribution.
 All are guaranteed to contain a minimal set
of common methods.
Flyweight and Singleton
Objects
 A Singleton is a class with only one instance
and only one access point.
 A Singleton will need a Private constructor
and may be static (e.g. AlphabetManager).
 A Flyweight object uses sharing to support
large numbers of fine-grained object
efficiently.
 For example in BioJava there is only ever
one instance of the DNA Symbol “A”. A
sequence of A’s is really just a list of
pointers to that one object.
Factory and Static methods
 Sometimes it is useful to prevent a user
from directly constructing an object via a
constructor.
 If the construction is complex.
 If the choice of the optimal implementation is
best left to the API developer.
 If important resources are best protected from
end users e.g. Singletons/ Flyweights.
 Rather than instantiating the object via its
constructor a static method or Factory object
is used
Examples
 Static method:
 FiniteAlphabet dna = DNATools.getDNA();

 Static field:
 DistributionFactory df = DistributionFactory.DEFAULT;

 Factory method:
 Distribution d = df.createDistribution(dna);
Two Levels of BioJava
 Macro type programming
 Tools classes (SeqIOTools,
DistributionTools etc).
 Static methods for common tasks.
 Full programming
 Lots of customizations and ‘plug and
play’ possible.
 More exposure to the sharp edges of the
API. Less documentation.
Alphabets, Symbols and
Sequences
Symbols
 In BioJava the DNA residue “A” is an
object.
 In Bioperl “A” would be a String.
 The “A” object is part of the sequence
not the sequence.
 “A” from DNA is not equal to “A” from
RNA or “A” from Protein.
Why not Strings?
 DNA A != RNA A != Protein A
 For Strings “A”.equals(“A”);
 DNA Alphabet also contains
K,Y,W,S,R,M,B,D,G,V,N
Why not Strings?
 Object Y contains C and T, The String “Y”
doesn’t contain anything
 Translation HashMaps with Strings are
flawed.
 Biojava GGN translates to GLY
 String GGN maps to null
 A fully redundant String to String HashMap
translation table requires 4096 keys!
Symbols are Canonical
 DNATools.a() == DNATools.a();
 There is only one instance of ‘a’
 DNATools.a().equals(DNATools.a());
 ProteinTools.a() != DNATools.a();
 Even on Remote JVM’s!
 During serialization Alphabet indexing is
transient and ‘reconnected’ via
readResolve() methods.
Alphabets
 A set of Symbols
 Alphabets can be infinite
 DoubleAlphabet, IntegerAlphabet
 Some Alphabets have a Finite number
of Symbols
 DNA, RNA etc
 Alphabet and FiniteAlphabet interfaces
org.biojava.bio.Alphabet
boolean contains(Symbol s)
Returns whether or not this Alphabet contains the symbol.
List getAlphabets()
Return an ordered List of the alphabets which make up a compound alphabet.
Symbol getAmbiguity(java.util.Set syms)
Get a symbol that represents the set of symbols in syms.
Symbol getGapSymbol()
Get the 'gap' ambiguity symbol that is most appropriate for this alphabet
String getName()
Get the name of the alphabet.
Symbol getSymbol(java.util.List rl)
Get a symbol from the Alphabet which corresponds to the specified ordered
list of symbols.
SymbolTokenization getTokenization(java.lang.String name)
Get a SymbolTokenization by name.
void validate(Symbol s)
Throws a precanned IllegalSymbolException if the symbol is not contained
within this Alphabet.
org.biojava.bio.FiniteAlphabet
 In addition to the previous methods

void addSymbol(Symbol s)
Adds a symbol to this Alphabet
Iterator iterator()
Retrieve an Iterator over the Symbols in this
Alphabet.
void removeSymbol(Symbol s)
Remove a symbol from this alphabet.
int size()
The number of symbols in the alphabet.
The Default Alphabets
 DNA (a,c,g,t)
 RNA (a,c,g,u)
 PROTEIN (all amino acids including ‘Sel’)
 PROTEIN-TERM (all PROTEIN plus “*”)
 STRUCTURE (PDB structure symbols)
 Alphabet of all integers (Infinite Alphabet)
 Can generate SubIntegerAlphabets
 Alphabet of all doubles (Infinite Alphabet)
Getting the common Alphabets
import org.biojava.bio.symbol.*;
import java.util.*;
import org.biojava.bio.seq.*;
public class AlphabetExample {
public static void main(String[] args) {
Alphabet dna, rna, prot;
//get the DNA alphabet by name
dna = AlphabetManager.alphabetForName("DNA");
//get the RNA alphabet by name
rna = AlphabetManager.alphabetForName("RNA");
//get the Protein alphabet by name
prot = AlphabetManager.alphabetForName("PROTEIN");
//get the protein alphabet that includes the * termination Symbol
prot = AlphabetManager.alphabetForName("PROTEIN-TERM");
//get those same Alphabets from the Tools classes
dna = DNATools.getDNA();
rna = RNATools.getRNA();
prot = ProteinTools.getAlphabet();
//or the one with the * symbol
prot = ProteinTools.getTAlphabet();
}
}
SymbolLists are made of
Symbols
 org.biojava.bio.symbol.SymbolList
 A sequence of Symbols from the same
Alphabet.
 Uses biological coordinates from 1 to
length
 cf String from 0 to length-1
Doesn’t this waste memory?
 A SymbolList is not really a List of Symbol
Objects.
 Rather a List of Object references.
 Still a bit heavier than a char[] but not
serious.

A C T
G

AACGTGGGTTCCAACT
The Bigger Picture

AlphabetManager
“DNA”

“Protein”

A C T
G

AACGTGGGTTCCAACT
The SymbolList interface
void edit(Edit edit)
Apply an edit to the SymbolList as specified by the edit object.
Alphabet getAlphabet()
The alphabet that this SymbolList is over.
Iterator iterator()
An Iterator over all Symbols in this SymbolList.
int length()
The number of symbols in this SymbolList.
String seqString()
Stringify this symbol list.
SymbolList subList(int start, int end)
Return a new SymbolList for the symbols start to end inclusive.
String subStr(int start, int end)
Return a region of this symbol list as a String.
Symbol symbolAt(int index)
Return the symbol at index, counting from 1.
List toList()
Returns a List of symbols.
String to SymbolList
import org.biojava.bio.seq.*
import org.biojava.bio.symbol.*;

public class StringToSymbolList {


public static void main(String[] args) {
try {
//create a DNA SymbolList from a String
SymbolList dna = DNATools.createDNA("atcggtcggctta");
//create a RNA SymbolList from a String
SymbolList rna = RNATools.createRNA("auugccuacauaggc");
//create a Protein SymbolList from a String
SymbolList aa = ProteinTools.createProtein("AGFAVENDSA");
}
catch (IllegalSymbolException ex) {
//this will happen if you use a character in one of your strings that is
//not an accepted IUB Character for that Symbol.
ex.printStackTrace();
}
}
}
SymbolList to String
import org.biojava.bio.symbol.*;

public class SymbolListToString {

public static void main(String[] args) {


SymbolList sl = null;

//code here to instantiate sl

//convert sl into a String


String s = sl.seqString();
}
}
The Sequence Interface
 A Sequence is a SymbolList with more
information.
 In addition to Annotatable and SymbolList:
String getName()
The name of this sequence.
String getURN()
A Uniform Resource Identifier (URI) which identifies the sequence
represented by this object.
 Also implements FeatureHolder which
allows addition of Feature Objects.
Quickly generate a Sequence
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class StringToSequence {
public static void main(String[] args) {
try {
//create a DNA sequence with the name dna_1
Sequence dna = DNATools.createDNASequence("atgctg", "dna_1");
//create an RNA sequence with the name rna_1
Sequence rna = RNATools.createRNASequence("augcug", "rna_1");
//create a Protein sequence with the name prot_1
Sequence prot = ProteinTools.createProteinSequence("AFHS", "prot_1");
}
catch (IllegalSymbolException ex) {
//an exception is thrown if you use a non IUB symbol
ex.printStackTrace();
}
}
}
More Complex Symbols and
Alphabets
Ambiguity Symbols
 Ambiguous or Fuzzy data is a fact of
life, especially with sequencing.
 DNA traces can contain symbols such
as n, r, w, v, h, k, y, n etc.
 In BioJava DNA symbols a, c, g, t are
AtomicSymbols.
 Ambiguous symbols like y are
BasisSymbols.
BasisSymbols
 A BasisSymbol may be represented as
a list of one or more Symbols.
 BasisSymbol extends Symbol.
 Ambiguity Symbols are always
BasisSymbols
 getSymbols() The list of symbols that
this symbol is composed from.
AtomicSymbols
 AtomicSymbols are not ambiguous.
 They cannot be further divided into
Symbols that are valid members of the
parent Alphabet.
 In the case of compound Alphabets
they can be divided into valid Symbols
from component Alphabets.
AtomicSymbols
 The AtomicSymbol interface extends
BasisSymbol but adds no new
methods only behaviour contracts.
 AtomicSymbol instances guarantee
that getMatches() returns an Alphabet
containing just that Symbol and each
element of the List returned by
getSymbols() is also atomic.
Atomic and Basis

“DNA” AlphabetManager

W
BasisSymbol

A T AtomicSymbols

AATW
Translating Ambiguity
 BioJava handles translation of
ambiguity very smoothly.
 DNA ‘n’ = [a,c,g,t]
 Transcribes to RNA ‘n’ [a,c,g,u]
 ggn translates to Gly
 agn translates to [Ser, Arg]
 Most protein ambiguities have no
‘token’ and are printed as ‘X’
CrossProduct Alphabets
 A CrossProductAlphabet is a combination of
two or more Alphabets.
 Any type of CrossProductAlphabet is
possible
 Dimers (DNA x DNA)
 Codon (DNA x DNA x DNA)
 Conditional ((DNA x DNA) x DNA)
 Mixed ((DNA x DNA x DNA) x PROTEIN)
Finite and Compound Alphas

(DNA x DNA x DNA)


GNG
BasisSymbol

ACA (DNA x DNA x DNA)


GTG
AtomicSymbols

C T DNA AtomicSymbols
A G

[AAC][GTG]GGTTCCAACT
What are they good for?
 Codon Symbols (DNA x DNA x DNA).
 Many analysis Classes such as Count
and Distribution use Symbol as an
argument. A hexamer can be an
AtomicSymbol.
 Phred is DNA x Integer
 1st and Higher order Markov Models
use CrossProductAlphabets.
How do I make a
CrossProductAlphabet?
import java.util.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class CrossProduct {
public static void main(String[] args) {
//make a CrossProductAlphabet from a List
List l = Collections.nCopies(3, DNATools.getDNA());
Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);
//get the same Alphabet by name
Alphabet codon2 =
AlphabetManager.generateCrossProductAlphaFromName(
"(DNA x DNA x DNA)“
);
//show that the two Alphabets are canonical
System.out.println(codon == codon2);
}
}
Making Triplet Views on a
SymbolList
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class CodonView {
public static void main(String[] args) {
try {
//make a DNA SymbolList
SymbolList dna = DNATools.createDNA("atgcccgcgtaa");
System.out.println("Length of dna " + dna.length());
//get a Codon View (window size of three)
SymbolList codons = SymbolListViews.windowedSymbolList(dna, 3);
System.out.println("Length of codons " + codons.length());
//get a Triplet View
SymbolList triplets = SymbolListViews.orderNSymbolList(dna, 3);
System.out.println("Length of triplets "+ triplets.length());
}
catch (Exception ex) {
ex.printStackTrace();
}
}
}
Getting a Symbol for a Codon
import java.util.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class MakeATG {
public static void main(String[] args) {
//make a CrossProductAlphabet from a List
List l = Collections.nCopies(3, DNATools.getDNA());
Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);
//get the codon made of atg
List syms = new ArrayList(3);
syms.add(DNATools.a());
syms.add(DNATools.t());
syms.add(DNATools.g());
Symbol atg = null;
try {
atg = codon.getSymbol(syms);
}
catch (IllegalSymbolException ex) {
//used Symbol from Alphabet that is not a component of codon
ex.printStackTrace();
}
System.out.println("Name of atg: "+ atg.getName());
}
}
Breaking a Codon into its
Parts
import java.util.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class BreakingComponents {
public static void main(String[] args) {
//make the 'codon' alphabet
List l = Collections.nCopies(3, DNATools.getDNA());
Alphabet alpha = AlphabetManager.getCrossProductAlphabet(l);
//get the first symbol in the alphabet
Iterator iter = ((FiniteAlphabet)alpha).iterator();
AtomicSymbol codon = (AtomicSymbol)iter.next();
System.out.print(codon.getName()+" is made of: ");
//break it into a list its components
List symbols = codon.getSymbols();
for(int i = 0; i < symbols.size(); i++){
if(i != 0)
System.out.print(", ");
Symbol sym = (Symbol)symbols.get(i);
System.out.print(sym.getName());
}
}
}
Basic Sequence Operations
Getting a section of a
SymbolList
 symbolAt(int i)
 Returns a Symbol
 subList(int min, int max)
 Returns a SymbolList
 subString(int min, int max)
 Returns the subsection tokenized to a
String
Transcription
 In BioJava DNA sequences and RNA sequences are from
different Alphabets. To convert between them:

//make a DNA SymbolList


SymbolList dna = DNATools.createDNA("atgccgaatcgtaa");

//convert it to RNA
SymbolList rna = DNATools.toRNA(dna);

//just to prove it worked


System.out.println(rna.seqString()); //augccgaaucguaa

//biological transcription (ie copy and reverse strand)


rna = DNATools.transcribeToRNA(dna); //5’ atgccgaatcgtaa 3’
System.out.println(rna.seqString()); //5’ uuacgauucggcau 3’
Reverse Complement
 import org.biojava.bio.symbol.*;
import org.biojava.bio.seq.*;
public class ReverseCompiment {
public static void main(String[] args) throws Exception{
SymbolList forward = DNATools.createDNA("atcgctagcgatcg");
//two step
SymbolList reverse = SymbolListViews.reverse(forward);
SymbolList revc1 = DNATools.complement(reverse);
//one step
SymbolList revc2 = DNATools.reverseComplement(forward);
//test for equivalence
System.out.println(revc1.equals(revc2));
}
}
Translation
 RNATools contains the “Universal”
RNA to Protein TranslationTable.

 Standard procedure is transcribe DNA


to RNA and then translate.
Translation Example
 import org.biojava.bio.symbol.*;
import org.biojava.bio.seq.*;
public class Translate {
public static void main(String[] args) {
try {
//create a DNA SymbolList
SymbolList symL = DNATools.createDNA("atggccattgaatga");
//transcribe to RNA
symL = RNATools.toRNA(symL);
//translate to protein
symL = RNATools.translate(symL);
//prove that it worked
System.out.println(symL.seqString());
}
catch (Exception ex) {
ex.printStackTrace()
}
}
}
Sequence I/O
Don’t ever write another
Parser
 If you can avoid it!
 BioJava supports
 Genbank, GenPept, RefSeq, EMBL, SwissProt, PDB,
Fasta, ABI, LocusLink, Unigene (requires Java 1.4)
 GAME, AGAVE
 Blast, Fasta, HMMER (models and results), BlastXML,
MEME, Phred
 OBDA, BioIndex, BioSQL, DAS, GFF, XFF
 Ensembl (with biojava-ensembl package)
 StAX/ Tag value
 RMI and Serialization
Simple I/O
 Most of BioJava’s simpler I/O operations are
conveniently wrapped up behind static
methods from the SeqIOTools class.
 SeqIOTools can read and write:
 Fasta (protein or DNA)
 EMBL
 GenBank (flat file and XML)
 SwissProt
 GenPept
 MSF (protein or DNA)
 Fasta Alignments
SeqIOTools Reader Methods
SequenceIterator i = SeqIOTools.readGenbank(br);
SequenceIterator i = SeqIOTools.readGenpept(br);
SequenceIterator i = SeqIOTools.readSwissprot(br);
SequenceIterator i = SeqIOTools.readEmbl(br);
etc…

SequenceIterator i = (SequenceIterator)
SeqIOTools.fileToBiojava("fasta", "dna“, br);

Alignment a =
(Alignment) SeqIOTools.fileToBiojava(“MSF", “rna“, br);
Features, Locations,
Annotations
Features and Annotations
 Sequence data often comes with
added information about the various
properties of the sequence (Genbank,
SwissProt etc).
 BioJava divides this information into
global properties (Annotations) and
Localized properties (Features).
Annotatable
 Annotatable is an “mix-in” interface
that indicates the implementing object
contains a Annotation object.
 It defines one method.
 Annotation getAnnotation();
Annotations
 org.biojava.bio.Annotation
 Annotations are used for Global properties.
 Species, Accession Number, xrefs, date,
publication.
 Key – value maps.
 Key and Value are objects but almost always are
Strings.
 Annotation.EMPTY_ANNOTATION
 static convenience class
 good place holder, avoids null pointer exceptions
 immutable
Annotation API
Map asMap()
Return a map that contains the same key/values as this
Annotation.
boolean containsProperty(java.lang.Object key)
Returns whether there the property is defined.
Object getProperty(java.lang.Object key)
Retrieve the value of a property by key.
Set keys()
Get a set of key objects.
void removeProperty(java.lang.Object key)
Delete a property
void setProperty(java.lang.Object key, java.lang.Object value)
Set the value of a property.
FeatureHolder
 FeatureHolder is another “mix-in”
interface which allows the
implementing object to hold Features.
 Sequence implements FeatureHolder.
 Features are created by
FeatureHolders.
 FeatureHolders can be filtered.
FeatureHolder methods
boolean containsFeature(Feature f)
Check if the feature is present in this holder.
int countFeatures()
Count how many features are contained.
Feature createFeature(Feature.Template ft)
Create a new Feature, and add it to this FeatureHolder.
Iterator features()
Iterate over the features in no well defined order.
FeatureHolder filter(FeatureFilter filter)
Query this set of features using a supplied FeatureFilter.
FeatureHolder filter(FeatureFilter fc, boolean recurse)
Return a new FeatureHolder that contains all of the children of this one that
passed the filter fc.
FeatureFilter getSchema()
Return a schema-filter for this FeatureHolder.
void removeFeature(Feature f)
Remove a feature from this FeatureHolder.
Features are Annotatable
 Features implement Annotatable
 Can hold an annotation
 Global annotations of a Feature
 /note:
 /db_xref: etc
Features may be nested
 Features implement FeatureHolder!
 Therefore Features may hold nested
Features
 c.f. The AWT Menu is a MenuItem
 e.g. A gene has exons and introns
 Filtering can be recursive
 A Feature cannot hold itself (directly or
indirectly)
Location API
 Locations are objects that specify a minimum and
maximum bound on a region of sequence.
 Contains some useful methods, particularly
getMin() and getMax().
 Many methods have been deprecated and are now
delegated to LocationTools.
 LocationTools is the best place to get new
instances of a Location.
 PointLocation, RangeLocation, CircularLocation,
CompoundLocation.
LocationTools
static boolean areEqual(Location locA, Location locB)
Return whether two locations are equal.
static boolean contains(Location locA, Location locB)
Return true iff all indices in locB are also contained by locA.
static Location flip(Location loc, int len)
Flips a location relative to a length.
static Location intersection(Location locA, Location locB)
Return the intersection of two locations.
static CircularLocation makeCircularLocation(int min, int max, int seqLength)
A simple method to generate a RangeLocation wrapped in a CircularLocation
static Location makeLocation(int min, int max)
Return a contiguous Location from min to max.
static boolean overlaps(Location locA, Location locB)
Determines whether the locations overlap or not.
static Location subtract(Location x, Location y)
Subtract one location from another.
static Location union(java.util.Collection locs)
The n-way union of a Collection of locations.static
Location union(Location locA, Location locB)
Return the union of two locations.
Location Example
 import org.biojava.bio.symbol.*;
 import org.biojava.bio.seq.*;
 
 public class SpecifyRange {
   public static void main(String[] args) {
     try {
       //make a RangeLocation specifying the residues 3-8
       Location loc = LocationTools.makeLocation(3,8);
       //print the location
       System.out.println("Location: "+loc.toString());
 
       //make a SymbolList
       SymbolList sl = RNATools.createRNA("gcagcuaggcggaaggagc");
       System.out.println("SymbolList: "+sl.seqString());
 
       //get the SymbolList specified by the Location
       SymbolList sym = loc.symbols(sl);
       System.out.println("Symbols specified by Location: "+sym.seqString());
     }
     catch (IllegalSymbolException ex) {
       //illegal symbol used to make sl
       ex.printStackTrace();
     }
   }
 }
Filtering Features
 FeatureHolders have a filter method that
accepts a FeatureFilter as an argument.
 Features that are accepted by the
FeatureFilter are returned as a new
FeatureHolder.
 Filtering may be done recursively so that
nested Features are subjected to the same
FeatureFilter .
FeatureFilters
 FeatureFilter is an interface that specifies
one method.
 boolean accept(Feature f)
 There are 26 implementations of
FeatureFilter in BioJava available as inner
classes of the FeatureFilter interface.
 Most commonly used are ByType,
BySource, StrandFilter, OverlapsLocation,
ContainedByLocation.
 Also boolean logic filters: And, Or, Not
Analysis and Distributions
Distributions and Counts
 The Distribution and Count interfaces
are from the org.biojava.bio.dist
package.
 Counts are maps from AtomicSymbols
to counts.
 Distributions are maps from Symbols
to frequencies.
Distributions
 Distributions are central to analysis
 Map Symbols to Frequencies
 Can be trained or weights can be set
 Used heavily in dp (dynamic programming)
package.
 HMM transitions and emmissions
 Many implementations, frequently used are:
 SimpleDistribution
 OrderNDistribution
 UniformDistribution
Distribution API
Alphabet getAlphabet()
The alphabet from which this spectrum emits symbols.
Distribution getNullModel()
Retrieve the null model Distribution that this Distribution recognizes.
double getWeight(Symbol s)
Return the probability that Symbol s is emited by this spectrum.
void registerWithTrainer(DistributionTrainerContext dtc)
Register this distribution with a training context.
Symbol sampleSymbol()
Sample a symbol from this state's probability distribution.
void setNullModel(Distribution nullDist)
Set the null model Distribution that this Distribution recognizes.
void setWeight(Symbol s, double w)
Set the probability or odds that Symbol s is emited by this state.
DistributionFactory
 Generally a Distribution is created using a
DistributionFactory.
 The DistributionFactory interface contains a
static inner class called DEFAULT that
implements DistributionFactory
 DistributionFactory df = DistributionFactory.DEFAULT;
 Distribution d = df.createDistribution(dna.getAlphabet());
Distribution Training
 Distributions can be trained on
observed sequences using a
DistributionTrainerContext.
 One or more Distributions can be
registered with the DTC.
 //register the Distributions with the trainer
dtc.registerDistribution(dnaDist);
DistributionTrainerContext
 A DistributionTrainer is assigned to each
registered Distribution by the DTC.
 If unusual training behaivour is required you
can register your own DistributionTrainer at
the same time.
 The dtc can also add pseudocounts if
needed.
 Ambiguities are automagically handled.
 Counts are split according to the null model.
Training Example
 //make a DNA SymbolList
SymbolList dna = DNATools.createDNA("atcgctagcgtyagcntatsggca");
//get a DistributionTrainerContext
DistributionTrainerContext dtc = new SimpleDistributionTrainerContext();
//make the Distribution
Distribution dnaDist =
DistributionFactory.DEFAULT.createDistribution(dna.getAlphabet());
//register the Distribution with the trainer
dtc.registerDistribution(dnaDist);
for(int j = 1; j <= dna.length(); j++){
dtc.addCount(dnaDist, dna.symbolAt(j), 1.0);
}
//train the Distribution
dtc.train();
setWeight() Example
FiniteAlphabet a = DNATools.getDNA();
Distribution d =
DistributionFactory.DEFAULT.createDistribution(a);
//set the weight of each symbol
d.setWeight(DNATools.a(),0.3);
d.setWeight(DNATools.c(),0.2);
d.setWeight(DNATools.g(),0.2);
d.setWeight(DNATools.t(),0.3);
DistributionTools
 DistributionTools holds static methods for creating
and manipulating Distributions.
 Tasks include:
 Equal emission spectra?
 Shannon Entropy, information, KL Distance.
 Generate biased sequences.
 Make a Distribution[] from an Alignment (each Distribution
represents one position in an Alignment.
 Average two or more Distributions.
 Randomize a Distribution.
 Make a Distribution from a Count.
Serialization of Distributions
 Distributions are Serializable
 Write to and Read from Binary
 RMI
 XMLDistributionWriter
 Write any Distribution to a stream in XML format.
 XMLDistributionReader
 SAXParser
 Read any Distribution from a XML stream
XML Output
<?xml version="1.0" ?>
<Distribution type="Distribution">
<alphabet name="DNA" />
<weight sym="adenine" prob="0.32178516910737204" />
<weight sym="cytosine" prob="0.04596199299395364" />
<weight sym="guanine" prob="0.1405504188012911" />
<weight sym="thymine" prob="0.4917024190973832" />
</Distribution>
What Else??
 Dynamic Programming (HMMs)
 Bibliography
 Alignments
 Blast and Fasta parsing
What Else??
 BioSQL support
 GUI components
 Chromatograms
 Molecular Biology (pI, mass, restriction
enzymes)
 Molecular Structure

Das könnte Ihnen auch gefallen