5

1
Chemical Structure Representation and Search Systems
Graph Terminology
1
degree of a node
number of edges meeting at it
2 3
2
leaf node
a node of degree 1
2
3
path
connected sequence of edges between two nodes
2
1 1
3 1
Graph Terminology
cycle
path which returns to its starting node
tree
graph with no cycles
subgraph
graph containing a subset of the nodes and edges of another graph
Graph Terminology
spanning tree
a tree subgraph that contains all the nodes (but not necessarily all the edges) of a graph
Graph Terminology
connected graph
graph in which there is a path between every pair of nodes
fully-connected graph
graph in which there is an edge between every pair of nodes (all nodes have degree n-1)
Graph Terminology
disconnected graph
graph in which some pairs of nodes have no path between them
component
subgraph in which all pairs of nodes are linked by a path, but no node has a path to a node in another component
Graph Terminology
forest
graph containing two or more components that are trees
Canonicalisation
a given chemical structure (or graph) can have many valid and unambiguous representations

different order of rows in connection table different order of atoms in SMILES
for comparison purposes it would be useful to have a single unique or canonical representation process of converting input representation to canonical form is called canonicalisation or canonisation
process of applying rules (i.e. an algorithm)
Canonicalisation
an obvious approach:

generate all possible valid SMILES choose the one that comes first alphabetically
this would be very slow, but effective, and there is a danger of missing one
principle was used for canonicalising Wiswesser Line Notation
10
Canonicalisation
most methods in use today involve renumbering the atoms in some unique and reproducible way
can be used to number rows in connection table can determine order of atoms in SMILES
normally involve a node labelling technique called relaxation
example is Morgans algorithm (1965)
11
Morgans algorithm
1
1.
2.
Label each node with its degree Count number of different values
3 different values { 1, 2, 3 }
3
2
3
2
1 1
3 1
12
Morgans algorithm
1
3.
4.
Recalculate labels by summing label values at neighbour nodes Count number of different values
3
2
3
2
1 1
3 1
13
Morgans algorithm
3
3.
4. 5.
Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3
5
5
6
6
3 3
5 3
14
Morgans algorithm
5
3.
4. 5.
13
8 different values { 5, 6, 10, 11, 12, 13, 14, 16 }

10
10
11
16
11
12
5 6
14
12 5
15
Morgans algorithm
13
3.
4. 5.
25
9 different values { 12, 13, 14, 18, 24, 25, 26, 30, 34 }
24
24
26
26
34
30
12 14 18 24
12
16
Morgans algorithm
25
3.
4. 5.
61
9 different values { 18, 24, 25, 42, 48 51, 61, 68, 82 }

51
51
48
48
82
42
24 18 68 42
24
17
Morgans algorithm
61
3.
4. 5.
Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3 until there is no increase in the number of different values
10 different values { 42, 61, 68, 102, 127 109, 116, 127, 133, 138, 150 }
109 109
133
133
138
150
42 68 102 116
42
18
Morgans algorithm
61
most nodes now have different labels choose node with highest label as node 1 number its neighbours in order of label values
109 109
133
133
138
150
42 68 102 116
42
19
Morgans algorithm
61
most nodes now have different labels choose node with highest label as node 1 number its neighbours in order of label values
109 109
133
133
138
150
1
68
3
102 116
42
42
20
Morgans algorithm
61
move to node 2 number its remaining neighbours in order of label values
127
109
109
because label values are tied, choose one with higher bond order (green) first
133
5
2
4
133
138
150
1
68
move to node 3
3
102 116
42
42
21
Morgans algorithm
61
13
127
continue till all nodes are numbered we now have a numbering for the rows of the connection table breadth-first trace
nodes are dealt with in a queue (first in, first out)
12
109
8 4
138
150
109
133
5
2
133
1 7
68
10
42 116
3 6
102
11
42
22
Morgans algorithm
continue till all nodes are numbered we now have a numbering for the rows of the connection table breadth-first trace
nodes are dealt with in a queue (first in, first out)
7
13
12
9 5
2 1
10 6 11
23
Morgans algorithm
61
depth-first trace is also possible

nodes are dealt with in a stack (last in, first out)
109
11
127
10
9
12 13
138
150
109
more suitable for assigning atom numbers in SMILES where we want consecutive numbers to form a path
133
8
7
133
6 5
68
3
42 116
4 2
102
OC(=O)C(N)CC1C=CC(O)=CC=1
1
42
24
Symmetry perception
if ties between label values cannot be resolved on basis of atom/bond types, the atoms are symmetrically equivalent, and it doesnt matter which is chosen next Morgans algorithm is thus also useful for identifying symmetry in molecules
25
Morgans algorithm
Provides canonical numbering for the nodes in a graph that doesnt depend on any original numbering Works by taking more of the graph into account at each iteration
essence of relaxation technique is iteratively updating a value by looking at its immediate neighbours some graphs are known where the algorithm cannot distinguish nodes that are not symmetrically equivalent and several theoretical papers analysing it mathematically O. Ivanciuc, Canonical numbering and constitutional symmetry, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003
It is not infallible
There are many variations on it

26
Canonicalisation
Algorithms are applied to graphs not chemical structures Issues such as aromaticity, tautomerism and stereochemistry need to be addressed before canonical numbering of the graph
Daylights canonicalisation algorithm for SMILES perceives aromatic rings (using its own definition of aromaticity) as first step
27
Ring perception
How many rings are there in these structures and which ones are they?
rings are important features of chemical structures
nomenclature generation aromaticity perception synthetic significance fragment descriptor generation
28
Rings and ring systems
A ring system is a subgraph in which every edge is part of a cycle
29
Ring perception
Euler Relationship
nodes + rings = edges + components where rings is the number of edges that must be removed from the graph to turn it into a tree rings is also called the Frerejacques number or nullity
6+1=6+1
10 + 2 = 11 + 1
7+2=8+1
23 + 5 = 25 + 3
this is the minimum possible number of rings; it may be useful to identify others
30
Which rings to perceive?
Usually the smallest set of smallest rings

two 6-membered rather than one 6- and one 10-membered two 5-membered rather than one 5- and one 6-membered
But there may be more than one SSSR
C-S-C-C-C-C C-C-C-C-O-C C-S-C-C-O-C
S O
S O S O
three different 6-membered rings
31
Which rings to perceive?
Sometimes a large envelope ring may be aromatic, when smaller rings are not Ring perception is a complex area where there are no right answers
there is a lot of literature on the subject
32
Ring perception by spanning tree
start at an arbitrary node grow a spanning tree
13
12
add neighbours of current node to a queue

o
9 5
2 1
provided they are not already in it
move to the next node in the queue repeat until queue is empty
7
those edges from original graph not in the spanning tree are ring closures
10 3 6 11
33
Substructure Fragments
Subgraphs can be identified in a structure graph corresponding to functional groups, rings etc. OH OH

NH2 COOH phenyl
this can be done by tracing appropriate paths in the graph subgraphs may overlap
CH2 H2N CH
O OH
34
Substructure Fragments
More systematic subgraphs can also be identified (easier to do algorithmically) OH paths of connected atoms

every atom and its immediate neighbours rings

(its difficult to show pictures with atoms in several colours at once!)
Subgraphs can overlap
CH2 H2N CH
O OH
35
Substructure fragments
fragments provide index terms for a chemical structure

o
analogous to keywords in a text document
they can be used in searching for structures

o
retrieved structures must contain the same fragments as the query

many different structures can have the same fragments, connected together in different ways controlled vocabulary (dictionary) of structural features e.g. all unbranched paths of up to 6 atoms
ambiguous representations
o
fragments to be used may be a closed list

o
or an open-ended list (like free text searching)

o
36
Fragment codes
many early chemical information systems were based on identifying fragments of this sort
o o
originally the fragments were identified manually and represented on punched cards
special fragment codes (dictionaries of fragments) were devised for different systems
o
some of these are still in use, though with automated encoding of structures particularly important are the systems for Markush structures in patents (e.g. Derwent WPI code)
37
Fingerprints
the fragments present in a structure can be represented as a sequence of 0s and 1s 00010100010101000101010011110100
0 means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple times)
each 0 or 1 can be represented as a single bit in the computer (a bitstring) for chemical structures often called structure fingerprints
38
Fingerprints
fingerprints are typically 150-2500 bits long where a fixed dictionary of fragments is used there can be a 1:1 relationship between fragment and bit position in fingerprint
sometimes several related fragments will set the same bit
disadvantage is that if structure contains no fragments from the dictionary, no bits are set
can be avoided if generalised fragments are used (involving e.g. any atom, any ring bond types)
39
Fingerprints
if fragment set is open-ended, the fragment description (e.g. C-C-N-C-C-O) can be hashed to a number in fixed range (e.g. 1 to 1024) and this is the bit number to be set disadvantages:

different and unrelated fragments may collide at the same bit position difficult to work back from bit position to fragment this usually causes only slight degradation in search performance (false hits), but can be more of a problem in other applications of fingerprints
40
Fingerprints
Hashed fingerprints
typically used in software from Daylight Chemical Information Systems Inc. Chemical Abstracts Service MDL Information Systems Inc
o
Dictionary fingerprints

ISIS or MACCS keys (166 and 960 bits) customised dictionaries
Barnard Chemical Information Ltd

o
41
2D structure depiction
if structures are stored without 2D display coordinates, we need to generate them
SMILES
depiction algorithms are used for this identify and lay out ring systems first

complications over orientation of some systems Chemical Abstracts stores standard depictions of all ring systems it has encountered many features can be added to improve appearance
then add side chains, avoiding collisions
42
3D structure depiction
much more complicated than 2D need to store standard bond lengths and angles need to distinguish atoms in different hybridisation states (sp2 vs sp3 carbon) need rotate single bonds to avoid bumps sophisticated conformation generation programs identify low-energy conformers
very useful for identifying molecules with the correct shape to fit into biological receptor sites J. Sadowski, 3D structure generation, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 231-261. Wiley, 2003
43
Nomenclature generation
most systematic nomenclature is based on ring systems

need to identify/prioritise ring systems first identify standard numbering for system
o
frequently need to store this
add side chains and substituents with appropriate locants
J. L. Wisniewski, Chemical nomenclature and structure representation: algorithmic generation and conversion, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003
44
Conclusions from Lecture 3
there are several important jargon terms used in graph theory, which crop up in chemical informatics canonicalisation provides a unique numbering for the atoms in a molecule
Morgan algorithm can be used to achieve it
its not always obvious how many rings there are, or which ones they are fingerprints represent the presence or absence of substructure fragments in a molecule
they are ambiguous representations of structure
45
Topic for Lecture 4: Structure searching
two main varieties of search full structure search

o o
query is is complete molecule is this molecule in the database?

or tautomers, stereoisomers etc. of it,
substructure search
o
query is a pattern of atoms and bonds does this pattern occur as a substructure (subgraph) of any of the molecules in my database?

5

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

5

Hochgeladen von

Copyright:

Verfügbare Formate

1

Chemical Structure Representation and Search Systems

different order of rows in connection table different order of atoms in SMILES

process of applying rules (i.e. an algorithm)

principle was used for canonicalising Wiswesser Line Notation

normally involve a node labelling technique called relaxation

example is Morgans algorithm (1965)

8 different values { 5, 6, 10, 11, 12, 13, 14, 16 }

9 different values { 18, 24, 25, 42, 48 51, 61, 68, 82 }

move to node 2 number its remaining neighbours in order of label values

depth-first trace is also possible

There are many variations on it

rings are important features of chemical structures

nomenclature generation aromaticity perception synthetic significance fragment descriptor generation

Rings and ring systems

A ring system is a subgraph in which every edge is part of a cycle

Which rings to perceive?

Usually the smallest set of smallest rings

But there may be more than one SSSR

C-S-C-C-C-C C-C-C-C-O-C C-S-C-C-O-C

three different 6-membered rings

Which rings to perceive?

there is a lot of literature on the subject

Ring perception by spanning tree

start at an arbitrary node grow a spanning tree

add neighbours of current node to a queue

provided they are not already in it

NH2 COOH phenyl

every atom and its immediate neighbours rings

Subgraphs can overlap

fragments provide index terms for a chemical structure

analogous to keywords in a text document

they can be used in searching for structures

retrieved structures must contain the same fragments as the query

fragments to be used may be a closed list

or an open-ended list (like free text searching)

the fragments present in a structure can be represented as a sequence of 0s and 1s 00010100010101000101010011110100

sometimes several related fragments will set the same bit

ISIS or MACCS keys (166 and 960 bits) customised dictionaries

Barnard Chemical Information Ltd

if structures are stored without 2D display coordinates, we need to generate them

then add side chains, avoiding collisions

most systematic nomenclature is based on ring systems

frequently need to store this

add side chains and substituents with appropriate locants

Conclusions from Lecture 3

Morgan algorithm can be used to achieve it

they are ambiguous representations of structure

Topic for Lecture 4: Structure searching

two main varieties of search full structure search

query is is complete molecule is this molecule in the database?

Das könnte Ihnen auch gefallen