Beruflich Dokumente
Kultur Dokumente
Graph Terminology
1
degree of a node
number of edges meeting at it
2 3
2
leaf node
a node of degree 1
2
3
path
connected sequence of edges between two nodes
2
1 1
3 1
Graph Terminology
cycle
path which returns to its starting node
tree
graph with no cycles
subgraph
graph containing a subset of the nodes and edges of another graph
Graph Terminology
spanning tree
a tree subgraph that contains all the nodes (but not necessarily all the edges) of a graph
Graph Terminology
connected graph
graph in which there is a path between every pair of nodes
fully-connected graph
graph in which there is an edge between every pair of nodes (all nodes have degree n-1)
Graph Terminology
disconnected graph
graph in which some pairs of nodes have no path between them
component
subgraph in which all pairs of nodes are linked by a path, but no node has a path to a node in another component
Graph Terminology
forest
graph containing two or more components that are trees
Canonicalisation
a given chemical structure (or graph) can have many valid and unambiguous representations
for comparison purposes it would be useful to have a single unique or canonical representation process of converting input representation to canonical form is called canonicalisation or canonisation
Canonicalisation
an obvious approach:
generate all possible valid SMILES choose the one that comes first alphabetically
this would be very slow, but effective, and there is a danger of missing one
10
Canonicalisation
most methods in use today involve renumbering the atoms in some unique and reproducible way
can be used to number rows in connection table can determine order of atoms in SMILES
11
Morgans algorithm
1
1.
2.
Label each node with its degree Count number of different values
3 different values { 1, 2, 3 }
3
2
3
2
1 1
3 1
12
Morgans algorithm
1
3.
4.
Recalculate labels by summing label values at neighbour nodes Count number of different values
3 different values { 1, 2, 3 }
3
2
3
2
1 1
3 1
13
Morgans algorithm
3
3.
4. 5.
Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3
3 different values { 3, 5, 6 }
5
5
6
6
3 3
5 3
14
Morgans algorithm
5
3.
4. 5.
Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3
13
10
11
16
11
12
5 6
14
12 5
15
Morgans algorithm
13
3.
4. 5.
Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3
25
9 different values { 12, 13, 14, 18, 24, 25, 26, 30, 34 }
24
24
26
26
34
30
12 14 18 24
12
16
Morgans algorithm
25
3.
4. 5.
Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3
61
51
48
48
82
42
24 18 68 42
24
17
Morgans algorithm
61
3.
4. 5.
Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3 until there is no increase in the number of different values
10 different values { 42, 61, 68, 102, 127 109, 116, 127, 133, 138, 150 }
109 109
133
133
138
150
42 68 102 116
42
18
Morgans algorithm
61
most nodes now have different labels choose node with highest label as node 1 number its neighbours in order of label values
10 different values { 42, 61, 68, 102, 127 109, 116, 127, 133, 138, 150 }
109 109
133
133
138
150
42 68 102 116
42
19
Morgans algorithm
61
most nodes now have different labels choose node with highest label as node 1 number its neighbours in order of label values
10 different values { 42, 61, 68, 102, 127 109, 116, 127, 133, 138, 150 }
109 109
133
133
138
150
1
68
3
102 116
42
42
20
Morgans algorithm
61
127
109
109
because label values are tied, choose one with higher bond order (green) first
133
5
2
4
133
138
150
1
68
move to node 3
3
102 116
42
42
21
Morgans algorithm
61
13
127
continue till all nodes are numbered we now have a numbering for the rows of the connection table breadth-first trace
nodes are dealt with in a queue (first in, first out)
12
109
8 4
138
150
109
133
5
2
133
1 7
68
10
42 116
3 6
102
11
42
22
Morgans algorithm
continue till all nodes are numbered we now have a numbering for the rows of the connection table breadth-first trace
nodes are dealt with in a queue (first in, first out)
7
13
12
9 5
2 1
10 6 11
23
Morgans algorithm
61
11
127
10
9
12 13
138
150
109
more suitable for assigning atom numbers in SMILES where we want consecutive numbers to form a path
133
8
7
133
6 5
68
3
42 116
4 2
102
OC(=O)C(N)CC1C=CC(O)=CC=1
1
42
24
Symmetry perception
if ties between label values cannot be resolved on basis of atom/bond types, the atoms are symmetrically equivalent, and it doesnt matter which is chosen next Morgans algorithm is thus also useful for identifying symmetry in molecules
25
Morgans algorithm
Provides canonical numbering for the nodes in a graph that doesnt depend on any original numbering Works by taking more of the graph into account at each iteration
essence of relaxation technique is iteratively updating a value by looking at its immediate neighbours some graphs are known where the algorithm cannot distinguish nodes that are not symmetrically equivalent and several theoretical papers analysing it mathematically O. Ivanciuc, Canonical numbering and constitutional symmetry, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003
It is not infallible
26
Canonicalisation
Algorithms are applied to graphs not chemical structures Issues such as aromaticity, tautomerism and stereochemistry need to be addressed before canonical numbering of the graph
Daylights canonicalisation algorithm for SMILES perceives aromatic rings (using its own definition of aromaticity) as first step
27
Ring perception
How many rings are there in these structures and which ones are they?
28
29
Ring perception
Euler Relationship
nodes + rings = edges + components where rings is the number of edges that must be removed from the graph to turn it into a tree rings is also called the Frerejacques number or nullity
6+1=6+1
10 + 2 = 11 + 1
7+2=8+1
23 + 5 = 25 + 3
this is the minimum possible number of rings; it may be useful to identify others
30
two 6-membered rather than one 6- and one 10-membered two 5-membered rather than one 5- and one 6-membered
S O
S O S O
31
Sometimes a large envelope ring may be aromatic, when smaller rings are not Ring perception is a complex area where there are no right answers
32
13
12
9 5
2 1
move to the next node in the queue repeat until queue is empty
7
those edges from original graph not in the spanning tree are ring closures
10 3 6 11
33
Substructure Fragments
Subgraphs can be identified in a structure graph corresponding to functional groups, rings etc. OH OH
this can be done by tracing appropriate paths in the graph subgraphs may overlap
CH2 H2N CH
O OH
34
Substructure Fragments
More systematic subgraphs can also be identified (easier to do algorithmically) OH paths of connected atoms
CH2 H2N CH
O OH
35
Substructure fragments
ambiguous representations
o
36
Fragment codes
many early chemical information systems were based on identifying fragments of this sort
o o
originally the fragments were identified manually and represented on punched cards
special fragment codes (dictionaries of fragments) were devised for different systems
o
some of these are still in use, though with automated encoding of structures particularly important are the systems for Markush structures in patents (e.g. Derwent WPI code)
37
Fingerprints
0 means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple times)
each 0 or 1 can be represented as a single bit in the computer (a bitstring) for chemical structures often called structure fingerprints
38
Fingerprints
fingerprints are typically 150-2500 bits long where a fixed dictionary of fragments is used there can be a 1:1 relationship between fragment and bit position in fingerprint
disadvantage is that if structure contains no fragments from the dictionary, no bits are set
can be avoided if generalised fragments are used (involving e.g. any atom, any ring bond types)
39
Fingerprints
if fragment set is open-ended, the fragment description (e.g. C-C-N-C-C-O) can be hashed to a number in fixed range (e.g. 1 to 1024) and this is the bit number to be set disadvantages:
different and unrelated fragments may collide at the same bit position difficult to work back from bit position to fragment this usually causes only slight degradation in search performance (false hits), but can be more of a problem in other applications of fingerprints
40
Fingerprints
Hashed fingerprints
typically used in software from Daylight Chemical Information Systems Inc. Chemical Abstracts Service MDL Information Systems Inc
o
Dictionary fingerprints
41
2D structure depiction
SMILES
depiction algorithms are used for this identify and lay out ring systems first
complications over orientation of some systems Chemical Abstracts stores standard depictions of all ring systems it has encountered many features can be added to improve appearance
42
3D structure depiction
much more complicated than 2D need to store standard bond lengths and angles need to distinguish atoms in different hybridisation states (sp2 vs sp3 carbon) need rotate single bonds to avoid bumps sophisticated conformation generation programs identify low-energy conformers
very useful for identifying molecules with the correct shape to fit into biological receptor sites J. Sadowski, 3D structure generation, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 231-261. Wiley, 2003
43
Nomenclature generation
need to identify/prioritise ring systems first identify standard numbering for system
o
J. L. Wisniewski, Chemical nomenclature and structure representation: algorithmic generation and conversion, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003
44
there are several important jargon terms used in graph theory, which crop up in chemical informatics canonicalisation provides a unique numbering for the atoms in a molecule
its not always obvious how many rings there are, or which ones they are fingerprints represent the presence or absence of substructure fragments in a molecule
45
substructure search
o
query is a pattern of atoms and bonds does this pattern occur as a substructure (subgraph) of any of the molecules in my database?