Sie sind auf Seite 1von 45

1

Chemical Structure Representation and Search Systems

Graph Terminology
1

degree of a node
number of edges meeting at it
2 3
2

leaf node
a node of degree 1
2
3

path
connected sequence of edges between two nodes

2
1 1

3 1

Graph Terminology

cycle
path which returns to its starting node

tree
graph with no cycles

subgraph
graph containing a subset of the nodes and edges of another graph

Graph Terminology

spanning tree
a tree subgraph that contains all the nodes (but not necessarily all the edges) of a graph

Graph Terminology

connected graph
graph in which there is a path between every pair of nodes

fully-connected graph
graph in which there is an edge between every pair of nodes (all nodes have degree n-1)

Graph Terminology

disconnected graph
graph in which some pairs of nodes have no path between them

component
subgraph in which all pairs of nodes are linked by a path, but no node has a path to a node in another component

Graph Terminology

forest
graph containing two or more components that are trees

Canonicalisation

a given chemical structure (or graph) can have many valid and unambiguous representations

different order of rows in connection table different order of atoms in SMILES

for comparison purposes it would be useful to have a single unique or canonical representation process of converting input representation to canonical form is called canonicalisation or canonisation

process of applying rules (i.e. an algorithm)

Canonicalisation

an obvious approach:

generate all possible valid SMILES choose the one that comes first alphabetically

this would be very slow, but effective, and there is a danger of missing one

principle was used for canonicalising Wiswesser Line Notation

10

Canonicalisation

most methods in use today involve renumbering the atoms in some unique and reproducible way

can be used to number rows in connection table can determine order of atoms in SMILES

normally involve a node labelling technique called relaxation

example is Morgans algorithm (1965)

11

Morgans algorithm
1

1.

2.

Label each node with its degree Count number of different values

3 different values { 1, 2, 3 }
3

2
3

2
1 1

3 1

12

Morgans algorithm
1

3.

4.

Recalculate labels by summing label values at neighbour nodes Count number of different values

3 different values { 1, 2, 3 }
3

2
3

2
1 1

3 1

13

Morgans algorithm
3

3.

4. 5.

Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3

3 different values { 3, 5, 6 }
5

5
6

6
3 3

5 3

14

Morgans algorithm
5

3.

4. 5.

Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3

13

8 different values { 5, 6, 10, 11, 12, 13, 14, 16 }


10

10

11
16

11

12
5 6

14

12 5

15

Morgans algorithm
13

3.

4. 5.

Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3

25

9 different values { 12, 13, 14, 18, 24, 25, 26, 30, 34 }
24

24

26

26

34
30

12 14 18 24
12

16

Morgans algorithm
25

3.

4. 5.

Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3

61

9 different values { 18, 24, 25, 42, 48 51, 61, 68, 82 }


51

51

48

48

82
42

24 18 68 42
24

17

Morgans algorithm
61

3.

4. 5.

Recalculate labels by summing label values at neighbour nodes Count number of different values Repeat from step 3 until there is no increase in the number of different values

10 different values { 42, 61, 68, 102, 127 109, 116, 127, 133, 138, 150 }
109 109

133

133

138
150

42 68 102 116
42

18

Morgans algorithm
61

most nodes now have different labels choose node with highest label as node 1 number its neighbours in order of label values

10 different values { 42, 61, 68, 102, 127 109, 116, 127, 133, 138, 150 }
109 109

133

133

138
150

42 68 102 116
42

19

Morgans algorithm
61

most nodes now have different labels choose node with highest label as node 1 number its neighbours in order of label values

10 different values { 42, 61, 68, 102, 127 109, 116, 127, 133, 138, 150 }
109 109

133

133

138
150

1
68

3
102 116

42

42

20

Morgans algorithm
61

move to node 2 number its remaining neighbours in order of label values

127

109

109

because label values are tied, choose one with higher bond order (green) first

133

5
2

4
133

138
150

1
68

move to node 3

3
102 116

42

42

21

Morgans algorithm
61

13
127

continue till all nodes are numbered we now have a numbering for the rows of the connection table breadth-first trace
nodes are dealt with in a queue (first in, first out)

12
109

8 4
138
150

109

133

5
2

133

1 7
68

10
42 116

3 6
102

11
42

22

Morgans algorithm

continue till all nodes are numbered we now have a numbering for the rows of the connection table breadth-first trace
nodes are dealt with in a queue (first in, first out)
7

13
12

9 5
2 1

10 6 11

23

Morgans algorithm
61

depth-first trace is also possible


nodes are dealt with in a stack (last in, first out)
109

11
127

10
9

12 13
138
150

109

more suitable for assigning atom numbers in SMILES where we want consecutive numbers to form a path

133

8
7

133

6 5
68

3
42 116

4 2
102

OC(=O)C(N)CC1C=CC(O)=CC=1

1
42

24

Symmetry perception

if ties between label values cannot be resolved on basis of atom/bond types, the atoms are symmetrically equivalent, and it doesnt matter which is chosen next Morgans algorithm is thus also useful for identifying symmetry in molecules

25

Morgans algorithm

Provides canonical numbering for the nodes in a graph that doesnt depend on any original numbering Works by taking more of the graph into account at each iteration

essence of relaxation technique is iteratively updating a value by looking at its immediate neighbours some graphs are known where the algorithm cannot distinguish nodes that are not symmetrically equivalent and several theoretical papers analysing it mathematically O. Ivanciuc, Canonical numbering and constitutional symmetry, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003

It is not infallible

There are many variations on it


26

Canonicalisation

Algorithms are applied to graphs not chemical structures Issues such as aromaticity, tautomerism and stereochemistry need to be addressed before canonical numbering of the graph

Daylights canonicalisation algorithm for SMILES perceives aromatic rings (using its own definition of aromaticity) as first step

27

Ring perception

How many rings are there in these structures and which ones are they?

rings are important features of chemical structures

nomenclature generation aromaticity perception synthetic significance fragment descriptor generation

28

Rings and ring systems

A ring system is a subgraph in which every edge is part of a cycle

29

Ring perception

Euler Relationship
nodes + rings = edges + components where rings is the number of edges that must be removed from the graph to turn it into a tree rings is also called the Frerejacques number or nullity

6+1=6+1

10 + 2 = 11 + 1

7+2=8+1

23 + 5 = 25 + 3

this is the minimum possible number of rings; it may be useful to identify others

30

Which rings to perceive?

Usually the smallest set of smallest rings


two 6-membered rather than one 6- and one 10-membered two 5-membered rather than one 5- and one 6-membered

But there may be more than one SSSR

C-S-C-C-C-C C-C-C-C-O-C C-S-C-C-O-C

S O

S O S O

three different 6-membered rings

31

Which rings to perceive?

Sometimes a large envelope ring may be aromatic, when smaller rings are not Ring perception is a complex area where there are no right answers

there is a lot of literature on the subject

32

Ring perception by spanning tree

start at an arbitrary node grow a spanning tree

13
12

add neighbours of current node to a queue


o

9 5
2 1

provided they are not already in it

move to the next node in the queue repeat until queue is empty
7

those edges from original graph not in the spanning tree are ring closures

10 3 6 11

33

Substructure Fragments

Subgraphs can be identified in a structure graph corresponding to functional groups, rings etc. OH OH

NH2 COOH phenyl

this can be done by tracing appropriate paths in the graph subgraphs may overlap

CH2 H2N CH

O OH

34

Substructure Fragments

More systematic subgraphs can also be identified (easier to do algorithmically) OH paths of connected atoms

every atom and its immediate neighbours rings


(its difficult to show pictures with atoms in several colours at once!)

Subgraphs can overlap

CH2 H2N CH

O OH

35

Substructure fragments

fragments provide index terms for a chemical structure


o

analogous to keywords in a text document

they can be used in searching for structures


o

retrieved structures must contain the same fragments as the query


many different structures can have the same fragments, connected together in different ways controlled vocabulary (dictionary) of structural features e.g. all unbranched paths of up to 6 atoms

ambiguous representations
o

fragments to be used may be a closed list


o

or an open-ended list (like free text searching)


o

36

Fragment codes

many early chemical information systems were based on identifying fragments of this sort
o o

originally the fragments were identified manually and represented on punched cards

special fragment codes (dictionaries of fragments) were devised for different systems
o

some of these are still in use, though with automated encoding of structures particularly important are the systems for Markush structures in patents (e.g. Derwent WPI code)

37

Fingerprints

the fragments present in a structure can be represented as a sequence of 0s and 1s 00010100010101000101010011110100

0 means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple times)

each 0 or 1 can be represented as a single bit in the computer (a bitstring) for chemical structures often called structure fingerprints

38

Fingerprints

fingerprints are typically 150-2500 bits long where a fixed dictionary of fragments is used there can be a 1:1 relationship between fragment and bit position in fingerprint

sometimes several related fragments will set the same bit

disadvantage is that if structure contains no fragments from the dictionary, no bits are set

can be avoided if generalised fragments are used (involving e.g. any atom, any ring bond types)

39

Fingerprints

if fragment set is open-ended, the fragment description (e.g. C-C-N-C-C-O) can be hashed to a number in fixed range (e.g. 1 to 1024) and this is the bit number to be set disadvantages:

different and unrelated fragments may collide at the same bit position difficult to work back from bit position to fragment this usually causes only slight degradation in search performance (false hits), but can be more of a problem in other applications of fingerprints

40

Fingerprints

Hashed fingerprints

typically used in software from Daylight Chemical Information Systems Inc. Chemical Abstracts Service MDL Information Systems Inc
o

Dictionary fingerprints

ISIS or MACCS keys (166 and 960 bits) customised dictionaries

Barnard Chemical Information Ltd


o

41

2D structure depiction

if structures are stored without 2D display coordinates, we need to generate them

SMILES

depiction algorithms are used for this identify and lay out ring systems first

complications over orientation of some systems Chemical Abstracts stores standard depictions of all ring systems it has encountered many features can be added to improve appearance

then add side chains, avoiding collisions

42

3D structure depiction

much more complicated than 2D need to store standard bond lengths and angles need to distinguish atoms in different hybridisation states (sp2 vs sp3 carbon) need rotate single bonds to avoid bumps sophisticated conformation generation programs identify low-energy conformers

very useful for identifying molecules with the correct shape to fit into biological receptor sites J. Sadowski, 3D structure generation, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 231-261. Wiley, 2003

43

Nomenclature generation

most systematic nomenclature is based on ring systems


need to identify/prioritise ring systems first identify standard numbering for system
o

frequently need to store this

add side chains and substituents with appropriate locants

J. L. Wisniewski, Chemical nomenclature and structure representation: algorithmic generation and conversion, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003

44

Conclusions from Lecture 3

there are several important jargon terms used in graph theory, which crop up in chemical informatics canonicalisation provides a unique numbering for the atoms in a molecule

Morgan algorithm can be used to achieve it

its not always obvious how many rings there are, or which ones they are fingerprints represent the presence or absence of substructure fragments in a molecule

they are ambiguous representations of structure

45

Topic for Lecture 4: Structure searching

two main varieties of search full structure search


o o

query is is complete molecule is this molecule in the database?


or tautomers, stereoisomers etc. of it,

substructure search
o

query is a pattern of atoms and bonds does this pattern occur as a substructure (subgraph) of any of the molecules in my database?

Das könnte Ihnen auch gefallen