Beruflich Dokumente
Kultur Dokumente
4665
4666
Neugebauer et al.
Table 1. Confusion Matrices and True-Positive Rates of the Initial and
the Pruned Decision Trees
building.5
SHP2
nRCOOR
Mor11m
MW
a
SHP2
nRCOOR
Mor11m
MW
1
-0.598
0.180
-0.697
1
-0.229
-0.613
1
-0.592
No.
0 (initial tree)
0 (pruned tree)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
No. of
main
leafs descriptor TP
9
4
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
1
1
5
1
12
1
1
1
1
1
4
1
1
3
1
1
1
1
1
1
1
3
1
1
1
1
1
1
3
1
1
1
1
1
1
SHP2
SHP2
nRCO
X1A
Mor03m
nR09
S107
GGI1
9
22
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
true-positive
ratec
FN
FP
TN
16
3
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
23
16
3
0
3
0
3
1
0
2
0
0
0
0
4
1
1
0
0
2
0
8
0
0
0
0
0
3
0
2
2
1
4
1
2
1
0
0
7
3
0
1
0
3
1
0
0
1
0
0
0
1
1114
1121
1134
1137
1134
1137
1134
1136
1137
1135
1137
1137
1137
1137
1133
1136
1136
1137
1137
1135
1137
1129
1137
1137
1137
1137
1137
1134
1137
1135
1135
1136
1133
1136
1135
1136
1137
1137
1130
1134
1137
1136
1137
1134
1136
1137
1137
1136
1137
1137
1137
1136
0.997
0.986
0.997
1.000
0.997
1.000
0.997
0.999
1.000
0.998
1.000
1.000
1.000
1.000
0.996
0.999
0.999
1.000
1.000
0.998
1.000
0.993
1.000
1.000
1.000
1.000
1.000
0.997
1.000
0.998
0.998
0.999
0.996
0.999
0.998
0.999
1.000
1.000
0.994
0.997
1.000
0.999
1.000
0.997
0.999
1.000
1.000
0.999
1.000
1.000
1.000
0.999
0.360
0.880
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
a For an explanation of descriptors other than SHP2, see ref 8. b TP, truepositives; FN, false-negatives; FP, false-positives; TN, true-negatives. c See
Experimental Section.
4668
TP
TP + FN
tp(F) )
TN
TN + FP
Neugebauer et al.
built from all the available data, and no incremental decision tree
induction is used. A 10-fold stratified crossvalidation was performed. Random seeds were set arbitrarily.
Overfitting can be prevented by pruning the tree. Therefore, the
concept space is examined starting from the easiest complex
description toward more complex concept descriptions (simplestfirst ordering). The simplest-first search and breakup at a sufficient
complex concept description is a reliable way to avoid overfitting.9
The C4.5 algorithm J48 implemented in WEKA is robust concerning overfitting.10
References
(1) Yin, H.; Hamilton, A. D. Strategies for targeting protein-protein
interactions with synthetic agents. Angew. Chem., Int. Ed. 2005, 44,
4130-4163.
(2) Fry, D. C. Protein-protein interactions as targets for small molecule
drug discovery. Biopolymers 2006, 84, 535-552.
(3) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J.
Experimental and computational approaches to estimate solubility
and permeability in drug discovery and development settings. AdV.
Drug DeliVery ReV. 2001, 46, 3-26.
(4) Cullell-Young, M.; Leeson, P. A.; del Fresno, M.; Bayes, M.
Enfuvirtide. Drugs Future 2002, 27, 450-457.
(5) http://www.epa.gov/ncct/dsstox/sdf_fdamdd.html.
(6) Gasteiger, J.; Rudolph, C.; Sadowski, J. Automatic generation of 3Datomic coordinates for organic molecules. Tetrahedron Comp. Method
1990, 3, 537-547.
(7) Irwin, J. J.; Shoichet, B. K. ZINCsA free database of commercially
available compounds for virtual screening. J. Chem. Inf. Model. 2005,
45, 177-182.
(8) Todeschini, R. Talete srl, DRAGON (for Windows), software for
molecular descriptor calculations, version 5.4, 2006 (http://www.talete.mi.it/).
(9) Quinlan, J. R. C4.5: Programs for Machine Learning; Morgan
Kaufmann Publishers: San Francisco, CA, 1993.
(10) I. H. Witten, E. F. Data Mining. Practical Machine Learning Tools
and Techniques, 2nd ed.; Morgan Kaufmann Publishers: San
Francisco, CA, 2005.
(11) Randic, M. Molecular Shape Profiles. J. Chem. Inf. Comput. Sci.
1995, 35, 337-382.
(12) Schuur, J. H.; Selzer, P.; Gasteiger, J. The coding of the threedimensional structure of molecules by molecular transforms and its
application to structure-spectra correlations and studies of biological
activity. J. Chem. Inf. Comput. Sci. 1995, 36, 334-344.
JM070533J