Beruflich Dokumente
Kultur Dokumente
Static field:
DistributionFactory df = DistributionFactory.DEFAULT;
Factory method:
Distribution d = df.createDistribution(dna);
Two Levels of BioJava
Macro type programming
Tools classes (SeqIOTools,
DistributionTools etc).
Static methods for common tasks.
Full programming
Lots of customizations and ‘plug and
play’ possible.
More exposure to the sharp edges of the
API. Less documentation.
Alphabets, Symbols and
Sequences
Symbols
In BioJava the DNA residue “A” is an
object.
In Bioperl “A” would be a String.
The “A” object is part of the sequence
not the sequence.
“A” from DNA is not equal to “A” from
RNA or “A” from Protein.
Why not Strings?
DNA A != RNA A != Protein A
For Strings “A”.equals(“A”);
DNA Alphabet also contains
K,Y,W,S,R,M,B,D,G,V,N
Why not Strings?
Object Y contains C and T, The String “Y”
doesn’t contain anything
Translation HashMaps with Strings are
flawed.
Biojava GGN translates to GLY
String GGN maps to null
A fully redundant String to String HashMap
translation table requires 4096 keys!
Symbols are Canonical
DNATools.a() == DNATools.a();
There is only one instance of ‘a’
DNATools.a().equals(DNATools.a());
ProteinTools.a() != DNATools.a();
Even on Remote JVM’s!
During serialization Alphabet indexing is
transient and ‘reconnected’ via
readResolve() methods.
Alphabets
A set of Symbols
Alphabets can be infinite
DoubleAlphabet, IntegerAlphabet
Some Alphabets have a Finite number
of Symbols
DNA, RNA etc
Alphabet and FiniteAlphabet interfaces
org.biojava.bio.Alphabet
boolean contains(Symbol s)
Returns whether or not this Alphabet contains the symbol.
List getAlphabets()
Return an ordered List of the alphabets which make up a compound alphabet.
Symbol getAmbiguity(java.util.Set syms)
Get a symbol that represents the set of symbols in syms.
Symbol getGapSymbol()
Get the 'gap' ambiguity symbol that is most appropriate for this alphabet
String getName()
Get the name of the alphabet.
Symbol getSymbol(java.util.List rl)
Get a symbol from the Alphabet which corresponds to the specified ordered
list of symbols.
SymbolTokenization getTokenization(java.lang.String name)
Get a SymbolTokenization by name.
void validate(Symbol s)
Throws a precanned IllegalSymbolException if the symbol is not contained
within this Alphabet.
org.biojava.bio.FiniteAlphabet
In addition to the previous methods
void addSymbol(Symbol s)
Adds a symbol to this Alphabet
Iterator iterator()
Retrieve an Iterator over the Symbols in this
Alphabet.
void removeSymbol(Symbol s)
Remove a symbol from this alphabet.
int size()
The number of symbols in the alphabet.
The Default Alphabets
DNA (a,c,g,t)
RNA (a,c,g,u)
PROTEIN (all amino acids including ‘Sel’)
PROTEIN-TERM (all PROTEIN plus “*”)
STRUCTURE (PDB structure symbols)
Alphabet of all integers (Infinite Alphabet)
Can generate SubIntegerAlphabets
Alphabet of all doubles (Infinite Alphabet)
Getting the common Alphabets
import org.biojava.bio.symbol.*;
import java.util.*;
import org.biojava.bio.seq.*;
public class AlphabetExample {
public static void main(String[] args) {
Alphabet dna, rna, prot;
//get the DNA alphabet by name
dna = AlphabetManager.alphabetForName("DNA");
//get the RNA alphabet by name
rna = AlphabetManager.alphabetForName("RNA");
//get the Protein alphabet by name
prot = AlphabetManager.alphabetForName("PROTEIN");
//get the protein alphabet that includes the * termination Symbol
prot = AlphabetManager.alphabetForName("PROTEIN-TERM");
//get those same Alphabets from the Tools classes
dna = DNATools.getDNA();
rna = RNATools.getRNA();
prot = ProteinTools.getAlphabet();
//or the one with the * symbol
prot = ProteinTools.getTAlphabet();
}
}
SymbolLists are made of
Symbols
org.biojava.bio.symbol.SymbolList
A sequence of Symbols from the same
Alphabet.
Uses biological coordinates from 1 to
length
cf String from 0 to length-1
Doesn’t this waste memory?
A SymbolList is not really a List of Symbol
Objects.
Rather a List of Object references.
Still a bit heavier than a char[] but not
serious.
A C T
G
AACGTGGGTTCCAACT
The Bigger Picture
AlphabetManager
“DNA”
“Protein”
A C T
G
AACGTGGGTTCCAACT
The SymbolList interface
void edit(Edit edit)
Apply an edit to the SymbolList as specified by the edit object.
Alphabet getAlphabet()
The alphabet that this SymbolList is over.
Iterator iterator()
An Iterator over all Symbols in this SymbolList.
int length()
The number of symbols in this SymbolList.
String seqString()
Stringify this symbol list.
SymbolList subList(int start, int end)
Return a new SymbolList for the symbols start to end inclusive.
String subStr(int start, int end)
Return a region of this symbol list as a String.
Symbol symbolAt(int index)
Return the symbol at index, counting from 1.
List toList()
Returns a List of symbols.
String to SymbolList
import org.biojava.bio.seq.*
import org.biojava.bio.symbol.*;
“DNA” AlphabetManager
W
BasisSymbol
A T AtomicSymbols
AATW
Translating Ambiguity
BioJava handles translation of
ambiguity very smoothly.
DNA ‘n’ = [a,c,g,t]
Transcribes to RNA ‘n’ [a,c,g,u]
ggn translates to Gly
agn translates to [Ser, Arg]
Most protein ambiguities have no
‘token’ and are printed as ‘X’
CrossProduct Alphabets
A CrossProductAlphabet is a combination of
two or more Alphabets.
Any type of CrossProductAlphabet is
possible
Dimers (DNA x DNA)
Codon (DNA x DNA x DNA)
Conditional ((DNA x DNA) x DNA)
Mixed ((DNA x DNA x DNA) x PROTEIN)
Finite and Compound Alphas
C T DNA AtomicSymbols
A G
[AAC][GTG]GGTTCCAACT
What are they good for?
Codon Symbols (DNA x DNA x DNA).
Many analysis Classes such as Count
and Distribution use Symbol as an
argument. A hexamer can be an
AtomicSymbol.
Phred is DNA x Integer
1st and Higher order Markov Models
use CrossProductAlphabets.
How do I make a
CrossProductAlphabet?
import java.util.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class CrossProduct {
public static void main(String[] args) {
//make a CrossProductAlphabet from a List
List l = Collections.nCopies(3, DNATools.getDNA());
Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);
//get the same Alphabet by name
Alphabet codon2 =
AlphabetManager.generateCrossProductAlphaFromName(
"(DNA x DNA x DNA)“
);
//show that the two Alphabets are canonical
System.out.println(codon == codon2);
}
}
Making Triplet Views on a
SymbolList
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class CodonView {
public static void main(String[] args) {
try {
//make a DNA SymbolList
SymbolList dna = DNATools.createDNA("atgcccgcgtaa");
System.out.println("Length of dna " + dna.length());
//get a Codon View (window size of three)
SymbolList codons = SymbolListViews.windowedSymbolList(dna, 3);
System.out.println("Length of codons " + codons.length());
//get a Triplet View
SymbolList triplets = SymbolListViews.orderNSymbolList(dna, 3);
System.out.println("Length of triplets "+ triplets.length());
}
catch (Exception ex) {
ex.printStackTrace();
}
}
}
Getting a Symbol for a Codon
import java.util.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class MakeATG {
public static void main(String[] args) {
//make a CrossProductAlphabet from a List
List l = Collections.nCopies(3, DNATools.getDNA());
Alphabet codon = AlphabetManager.getCrossProductAlphabet(l);
//get the codon made of atg
List syms = new ArrayList(3);
syms.add(DNATools.a());
syms.add(DNATools.t());
syms.add(DNATools.g());
Symbol atg = null;
try {
atg = codon.getSymbol(syms);
}
catch (IllegalSymbolException ex) {
//used Symbol from Alphabet that is not a component of codon
ex.printStackTrace();
}
System.out.println("Name of atg: "+ atg.getName());
}
}
Breaking a Codon into its
Parts
import java.util.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class BreakingComponents {
public static void main(String[] args) {
//make the 'codon' alphabet
List l = Collections.nCopies(3, DNATools.getDNA());
Alphabet alpha = AlphabetManager.getCrossProductAlphabet(l);
//get the first symbol in the alphabet
Iterator iter = ((FiniteAlphabet)alpha).iterator();
AtomicSymbol codon = (AtomicSymbol)iter.next();
System.out.print(codon.getName()+" is made of: ");
//break it into a list its components
List symbols = codon.getSymbols();
for(int i = 0; i < symbols.size(); i++){
if(i != 0)
System.out.print(", ");
Symbol sym = (Symbol)symbols.get(i);
System.out.print(sym.getName());
}
}
}
Basic Sequence Operations
Getting a section of a
SymbolList
symbolAt(int i)
Returns a Symbol
subList(int min, int max)
Returns a SymbolList
subString(int min, int max)
Returns the subsection tokenized to a
String
Transcription
In BioJava DNA sequences and RNA sequences are from
different Alphabets. To convert between them:
//convert it to RNA
SymbolList rna = DNATools.toRNA(dna);
SequenceIterator i = (SequenceIterator)
SeqIOTools.fileToBiojava("fasta", "dna“, br);
Alignment a =
(Alignment) SeqIOTools.fileToBiojava(“MSF", “rna“, br);
Features, Locations,
Annotations
Features and Annotations
Sequence data often comes with
added information about the various
properties of the sequence (Genbank,
SwissProt etc).
BioJava divides this information into
global properties (Annotations) and
Localized properties (Features).
Annotatable
Annotatable is an “mix-in” interface
that indicates the implementing object
contains a Annotation object.
It defines one method.
Annotation getAnnotation();
Annotations
org.biojava.bio.Annotation
Annotations are used for Global properties.
Species, Accession Number, xrefs, date,
publication.
Key – value maps.
Key and Value are objects but almost always are
Strings.
Annotation.EMPTY_ANNOTATION
static convenience class
good place holder, avoids null pointer exceptions
immutable
Annotation API
Map asMap()
Return a map that contains the same key/values as this
Annotation.
boolean containsProperty(java.lang.Object key)
Returns whether there the property is defined.
Object getProperty(java.lang.Object key)
Retrieve the value of a property by key.
Set keys()
Get a set of key objects.
void removeProperty(java.lang.Object key)
Delete a property
void setProperty(java.lang.Object key, java.lang.Object value)
Set the value of a property.
FeatureHolder
FeatureHolder is another “mix-in”
interface which allows the
implementing object to hold Features.
Sequence implements FeatureHolder.
Features are created by
FeatureHolders.
FeatureHolders can be filtered.
FeatureHolder methods
boolean containsFeature(Feature f)
Check if the feature is present in this holder.
int countFeatures()
Count how many features are contained.
Feature createFeature(Feature.Template ft)
Create a new Feature, and add it to this FeatureHolder.
Iterator features()
Iterate over the features in no well defined order.
FeatureHolder filter(FeatureFilter filter)
Query this set of features using a supplied FeatureFilter.
FeatureHolder filter(FeatureFilter fc, boolean recurse)
Return a new FeatureHolder that contains all of the children of this one that
passed the filter fc.
FeatureFilter getSchema()
Return a schema-filter for this FeatureHolder.
void removeFeature(Feature f)
Remove a feature from this FeatureHolder.
Features are Annotatable
Features implement Annotatable
Can hold an annotation
Global annotations of a Feature
/note:
/db_xref: etc
Features may be nested
Features implement FeatureHolder!
Therefore Features may hold nested
Features
c.f. The AWT Menu is a MenuItem
e.g. A gene has exons and introns
Filtering can be recursive
A Feature cannot hold itself (directly or
indirectly)
Location API
Locations are objects that specify a minimum and
maximum bound on a region of sequence.
Contains some useful methods, particularly
getMin() and getMax().
Many methods have been deprecated and are now
delegated to LocationTools.
LocationTools is the best place to get new
instances of a Location.
PointLocation, RangeLocation, CircularLocation,
CompoundLocation.
LocationTools
static boolean areEqual(Location locA, Location locB)
Return whether two locations are equal.
static boolean contains(Location locA, Location locB)
Return true iff all indices in locB are also contained by locA.
static Location flip(Location loc, int len)
Flips a location relative to a length.
static Location intersection(Location locA, Location locB)
Return the intersection of two locations.
static CircularLocation makeCircularLocation(int min, int max, int seqLength)
A simple method to generate a RangeLocation wrapped in a CircularLocation
static Location makeLocation(int min, int max)
Return a contiguous Location from min to max.
static boolean overlaps(Location locA, Location locB)
Determines whether the locations overlap or not.
static Location subtract(Location x, Location y)
Subtract one location from another.
static Location union(java.util.Collection locs)
The n-way union of a Collection of locations.static
Location union(Location locA, Location locB)
Return the union of two locations.
Location Example
import org.biojava.bio.symbol.*;
import org.biojava.bio.seq.*;
public class SpecifyRange {
public static void main(String[] args) {
try {
//make a RangeLocation specifying the residues 3-8
Location loc = LocationTools.makeLocation(3,8);
//print the location
System.out.println("Location: "+loc.toString());
//make a SymbolList
SymbolList sl = RNATools.createRNA("gcagcuaggcggaaggagc");
System.out.println("SymbolList: "+sl.seqString());
//get the SymbolList specified by the Location
SymbolList sym = loc.symbols(sl);
System.out.println("Symbols specified by Location: "+sym.seqString());
}
catch (IllegalSymbolException ex) {
//illegal symbol used to make sl
ex.printStackTrace();
}
}
}
Filtering Features
FeatureHolders have a filter method that
accepts a FeatureFilter as an argument.
Features that are accepted by the
FeatureFilter are returned as a new
FeatureHolder.
Filtering may be done recursively so that
nested Features are subjected to the same
FeatureFilter .
FeatureFilters
FeatureFilter is an interface that specifies
one method.
boolean accept(Feature f)
There are 26 implementations of
FeatureFilter in BioJava available as inner
classes of the FeatureFilter interface.
Most commonly used are ByType,
BySource, StrandFilter, OverlapsLocation,
ContainedByLocation.
Also boolean logic filters: And, Or, Not
Analysis and Distributions
Distributions and Counts
The Distribution and Count interfaces
are from the org.biojava.bio.dist
package.
Counts are maps from AtomicSymbols
to counts.
Distributions are maps from Symbols
to frequencies.
Distributions
Distributions are central to analysis
Map Symbols to Frequencies
Can be trained or weights can be set
Used heavily in dp (dynamic programming)
package.
HMM transitions and emmissions
Many implementations, frequently used are:
SimpleDistribution
OrderNDistribution
UniformDistribution
Distribution API
Alphabet getAlphabet()
The alphabet from which this spectrum emits symbols.
Distribution getNullModel()
Retrieve the null model Distribution that this Distribution recognizes.
double getWeight(Symbol s)
Return the probability that Symbol s is emited by this spectrum.
void registerWithTrainer(DistributionTrainerContext dtc)
Register this distribution with a training context.
Symbol sampleSymbol()
Sample a symbol from this state's probability distribution.
void setNullModel(Distribution nullDist)
Set the null model Distribution that this Distribution recognizes.
void setWeight(Symbol s, double w)
Set the probability or odds that Symbol s is emited by this state.
DistributionFactory
Generally a Distribution is created using a
DistributionFactory.
The DistributionFactory interface contains a
static inner class called DEFAULT that
implements DistributionFactory
DistributionFactory df = DistributionFactory.DEFAULT;
Distribution d = df.createDistribution(dna.getAlphabet());
Distribution Training
Distributions can be trained on
observed sequences using a
DistributionTrainerContext.
One or more Distributions can be
registered with the DTC.
//register the Distributions with the trainer
dtc.registerDistribution(dnaDist);
DistributionTrainerContext
A DistributionTrainer is assigned to each
registered Distribution by the DTC.
If unusual training behaivour is required you
can register your own DistributionTrainer at
the same time.
The dtc can also add pseudocounts if
needed.
Ambiguities are automagically handled.
Counts are split according to the null model.
Training Example
//make a DNA SymbolList
SymbolList dna = DNATools.createDNA("atcgctagcgtyagcntatsggca");
//get a DistributionTrainerContext
DistributionTrainerContext dtc = new SimpleDistributionTrainerContext();
//make the Distribution
Distribution dnaDist =
DistributionFactory.DEFAULT.createDistribution(dna.getAlphabet());
//register the Distribution with the trainer
dtc.registerDistribution(dnaDist);
for(int j = 1; j <= dna.length(); j++){
dtc.addCount(dnaDist, dna.symbolAt(j), 1.0);
}
//train the Distribution
dtc.train();
setWeight() Example
FiniteAlphabet a = DNATools.getDNA();
Distribution d =
DistributionFactory.DEFAULT.createDistribution(a);
//set the weight of each symbol
d.setWeight(DNATools.a(),0.3);
d.setWeight(DNATools.c(),0.2);
d.setWeight(DNATools.g(),0.2);
d.setWeight(DNATools.t(),0.3);
DistributionTools
DistributionTools holds static methods for creating
and manipulating Distributions.
Tasks include:
Equal emission spectra?
Shannon Entropy, information, KL Distance.
Generate biased sequences.
Make a Distribution[] from an Alignment (each Distribution
represents one position in an Alignment.
Average two or more Distributions.
Randomize a Distribution.
Make a Distribution from a Count.
Serialization of Distributions
Distributions are Serializable
Write to and Read from Binary
RMI
XMLDistributionWriter
Write any Distribution to a stream in XML format.
XMLDistributionReader
SAXParser
Read any Distribution from a XML stream
XML Output
<?xml version="1.0" ?>
<Distribution type="Distribution">
<alphabet name="DNA" />
<weight sym="adenine" prob="0.32178516910737204" />
<weight sym="cytosine" prob="0.04596199299395364" />
<weight sym="guanine" prob="0.1405504188012911" />
<weight sym="thymine" prob="0.4917024190973832" />
</Distribution>
What Else??
Dynamic Programming (HMMs)
Bibliography
Alignments
Blast and Fasta parsing
What Else??
BioSQL support
GUI components
Chromatograms
Molecular Biology (pI, mass, restriction
enzymes)
Molecular Structure