Beruflich Dokumente
Kultur Dokumente
489
Multiple linear regression (MLR) and computational neural networks (CNN) are utilized to develop
mathematical models to relate the structures of a diverse set of 332 organic compounds to their aqueous
solubilities. Topological, geometric, and electronic descriptors are used to numerically represent structural
features of the data set compounds. Genetic algorithm and simulated annealing routines, in conjunction
with MLR and CNN, are used to select subsets of descriptors that accurately relate to aqueous solubility.
Nonlinear models with nine calculated structural descriptors are developed that have a training set rootmean-square error of 0.394 log units for compounds which span a -log(molarity) range from -2 to +12
log units.
INTRODUCTION
MITCHELL
AND JURS
PREDICTION
OF
AQUEOUS SOLUBILITY
OF
ORGANIC COMPOUNDS
Table 1. Compounds Used in the Aqueous Solubility Study with Their Experimental and Calculated -log(molarity) Valuesd
compound name
expt AqSol
calc AqSol
compound name
expt AqSol
calc AqSol
ethane
propane
butane
2-methylbutane
pentane
2,2-dimethylbutane
2,3-dimethylbutanea
2-methylpentaneb
3-methylpentane
hexane
2,4-dimethylpentane
3-methylhexaneb
heptane
2,3,4-trimethylpentanea
2,2,4-trimethylpentane
octane
nonane
decanea
isobutaneb
2,2-dimethylpentane
2,3-dimethylpentane
2-methylhexane
3,3-dimethylpentane
3-methylheptane
2,2,5-trimethylhexane
4-methyloctaneb
1-propene
1-octene
1-pentyne
1-hexyne
benzene
tolueneb
ethylbenzene
m-xylene
o-xylene
p-xylene
1,2,3-trimethylbenzene
(1-methylethyl)benzene
1,3,5-trimethylbenzene
propylbenzene
1,2,4-trimethylbenzene
naphthaleneb
1-naphthol
2-naphthol
butylbenzeneb
sec-butylbenzenea
tert-butylbenzene
1-methylnaphthalene
2-methylnaphthalene
dibenzofuran
dibenzo-p-dioxin
diphenyl
1,5-dimethylnaphthalene
2,3-dimethylnaphthaleneb
phenylhexane
fluorene
diphenylmethane
anthracene
phenanthrene
1-methylphenanthrene
fluoranthrene
pyrene
2,3-benzofluorene
1,2-benzanthracene
chrysene
naphthacene
triphenylenea
benzo[a]pyreneb
9,10-dimethyl-1,2-benzanthracene
3-methylcholanthrene
1,2:5,6-dibenzanthracene
coronene
thiophene
2-ethylthiopheneb
1,2,4,5-tetramethylbenzene
pentylbenzenea
1,3-dimethylnaphthalene
1,4-dimethylnaphthalene
1-ethylnaphthalene
2,6-dimethylnaphthalene
1,4,5-trimethylnaphthalene
2-methylanthracene
2.73
2.84
2.95
3.18
3.25
3.62
3.63
3.79
3.78
3.91
4.36
4.47
4.55
4.83
4.77
5.26
5.92
6.55
3.08
4.36
4.28
4.60
4.23
5.16
5.18
6.05
2.20
4.52
1.76
2.20
1.64
2.22
2.78
2.83
2.77
2.80
3.22
3.30
3.34
3.25
3.33
3.58
2.39
2.49
3.78
3.93
3.69
3.69
3.75
4.46
5.33
4.34
4.69
4.77
5.22
4.91
4.17
6.55
5.21
5.88
5.96
6.19
8.91
7.19
8.14
8.19
6.88
7.97
6.83
7.94
8.23
9.38
1.45
2.59
4.59
4.61
4.29
4.14
4.16
4.89
4.92
6.80
2.82
2.65
2.91
3.29
3.52
3.59
3.66
3.88
3.79
4.06
4.27
4.42
4.70
4.18
4.55
5.39
5.98
6.65
2.78
4.32
4.17
4.54
4.16
5.15
5.28
5.79
1.66
3.93
1.80
2.36
1.85
2.20
2.54
2.97
2.80
2.85
3.50
3.24
3.82
3.06
3.53
3.48
1.62
1.36
3.82
3.70
3.71
3.60
3.75
4.65
4.74
4.74
4.38
4.47
5.35
5.37
4.87
5.95
5.79
5.51
6.47
6.62
7.56
7.57
7.73
8.02
7.07
8.26
7.56
8.02
8.86
9.32
1.42
2.16
4.04
4.52
4.47
4.38
3.91
4.65
4.91
5.83
9-methylanthracene
9,10-dimethylanthracene
perylene
benz[g,h,i]perylene
1-butanol
1-pentanol
1-hexanol
D-mannitol
1-heptanol
1-octanol
isobutanol
2-methyl-1-butanola
3-pentanol
tert-pentanol
4-methyl-2-pentanol
2,4-dimethyl-3-pentanol
2-octanol
2-butanoneb
4-methyl-2-butanonea
3,3-dimethyl-2-butanone
oxalic acida
malonic acid
succinic acid
pentanoic acid
suberic acidb
dodecanoic acid
ethyl acetate
isopropyl acetate
isobutyl acetate
butyl acetatea
pentyl acetatea
diethyl ether
methyl tert-butyl ether
propyl isopropyl etherb
isopropyl tert-butyl ether
dibutyl ether
styrene
phenol
hydroquinone
m-cresol
o-cresol
2,4-dimethylphenola
o-phenylphenola
furfural
benzaldehyde
acetophenoneb
benzoic acid
phthalic acidb
phenylacetic acid
m-toluic acid
diphenyl ether
D-tartaric acid
DL-tartaric acid
citric acid
salicylic acidb
methylparaben
propylparaben
butylparabena
ethylparaben
p-hydroxybenzoic acid
D-mandelic acid
L-mandelic acid
salicylic acid acetate
cinnamic acid
propionyl-r-mandelic acida
benzoyl-r-mandelic acid
1-methylfluorene
1,2-benzofluoreneb
cyclopentane
cyclohexane
methylcyclopentanea
methylcyclohexane
1,1,3-trimethylcyclopentanea
propylcyclopentane
1,1,3-trimethylcyclohexane
cyclopentene
1,4-cyclohexadieneb
cyclohexene
cycloheptatriene
D-limonene
pentylcyclopentaneb
acenaphthene
5.87
6.57
8.80
9.03
0.0209
0.609
1.23
-0.0569
1.84
2.37
-0.0864
0.470
0.236
0.0273
0.785
0.889
1.69
-0.525
0.731
0.731
-0.389
-0.762
0.188
0.510
1.29
4.79
0.0227
0.602
1.24
1.18
1.88
-0.408
0.235
1.34
2.37
1.99
2.58
0.0334
0.168
0.721
0.635
1.22
2.39
0.0757
1.26
1.32
1.57
1.15
0.886
2.14
3.95
-0.837
-0.767
-0.896
1.81
1.82
2.62
2.86
2.31
1.43
-0.0414
0.185
1.65
2.46
1.60
1.51
5.22
6.68
2.52
3.14
3.30
3.81
4.48
4.74
4.85
1.73
1.97
2.48
2.15
4.00
6.09
4.54
5.48
5.98
8.60
8.92
-0.111
0.422
1.12
0.0136
1.78
2.52
-0.466
-0.0512
0.270
0.0985
0.942
1.35
1.92
-0.549
0.849
0.810
-0.885
-0.780
0.265
0.00511
2.22
4.65
-0.0369
0.125
1.04
0.743
1.31
0.339
0.556
1.32
2.06
2.01
2.14
0.471
0.555
0.704
0.645
1.37
2.07
0.736
1.12
1.64
1.07
1.27
1.34
1.19
3.77
-0.833
-0.432
-0.833
0.968
1.71
2.62
3.14
1.97
1.19
0.998
1.00
1.42
1.66
1.28
2.48
5.22
7.47
2.69
3.03
3.15
3.59
4.60
4.43
4.93
2.02
1.83
2.48
1.82
3.98
6.26
4.78
MITCHELL
AND JURS
Table 1. Continued
compound name
expt AqSol
raffinose
sucrosec
norethindroneb
norethindrone acetate
quinone
testosterone
progesterone
hydrocortisone acetate
hydrocortisone
methylprednisolone
acrylonitrilea
trichloromethane
dichloromethane
tetrachloromethane
tetrafluoromethanec
bromomethane
dichlorodifluoromethanea
1,1,2,2-tetrachloroethane
methoxychlor
trichloroethene
1,2-dichloroethene
tetrachloroethene
1,2,3,5-tetrachlorobenzene
1,2,4-trichlorobenzene
m-dichlorobenzene
o-dichlorobenzene
p-dichlorobenzene
chlorobenzene
2,2,3,3,4,4,5,5,6-nonachlorobiphenylb
2,2,3,3,5,5,6,6-octachlorobiphenyl
2,2,3,3,4,4-hexachlorobiphenyl
2,2,3,3,6,6-hexachlorobiphenyl
2,2,4,4,5,5-hexachlorobiphenyl
2,2,4,4,6,6-hexachlorobiphenyla
2,2,3,4,5-pentachlorobiphenyl
2,2,4,5,5-pentachlorobiphenyl
2,3,4,5,6-pentachlorobiphenyl
2,2,5,5-tetrachlorobiphenyl
2,3,4,5-tetrachlorobiphenyl
3,3,4,4-tetrachlorobiphenyl
2,4,5-trichlorobiphenyl
2,4,6-trichlorobiphenyl
2,2-dichlorobiphenyl
2,4-dichlorobiphenyl
2,5-dichlorobiphenyl
4,4-dichlorobiphenylb
2-chlorobiphenyl
4-chlorobiphenyl
decachlorobiphenyl
p,p-DDE
p,p-DDT
pentachlorobenzene
1,2,3,4-tetrachlorobenzene
1,2,4,5-tetrachlorobenzene
1,2,3-trichlorobenzene
1,3,5-trichlorobenzene
iodobenzene
2,2,3,3,4,5-hexachlorobiphenylb
2,3,3,4,5,6-hexachlorobiphenyl
2,2,5-trichlorobiphenyl
2,4,4-trichlorobiphenyl
2,4-dichlorobiphenyl
2,6-dichlorobiphenyla
3-chlorobiphenyl
hexachlorophene
dichlorophen
ioxynil
m-fluorobenzoic acid
o-fluorobenzoic acid
4-chlorophenoxyacetic acid
(2,4,5-trichlorophenoxy)acetic acid
(2,4-dichlorophenoxy)acetic acida
2-(2,4,5-trichlorophenoxy)propionic acid
3,6-dichloro-2-methoxybenzoic acid
4-(2,4-dichlorophenoxy)propionic acid
1,2,3,4-tetrachlorodibenzo-p-dioxin
2-chlorodibenzo-p-dioxin
1,2,4-trichlorodibenzo-p-dioxin
2,3-dichlorodibenzo-p-dioxin
2,7-dichlorodibenzo-p-dioxin
1-chlorodibenzo-p-dioxina
octachlorodibenzo-p-dioxinc
griseofulvin
o,p-DDT
0.410
-0.0719
4.67
4.79
0.879
4.05
4.53
4.60
3.08
2.99
-0.146
1.19
0.740
2.26
3.68
0.851
2.64
1.76
6.68
2.04
1.07
2.57
4.73
3.64
3.04
3.02
3.31
2.42
9.93
9.30
9.00
7.86
8.57
8.48
7.10
7.44
7.78
6.44
7.26
8.68
6.27
6.07
5.36
5.29
5.27
6.63
4.63
5.25
1.08
6.85
8.18
5.37
4.38
5.19
4.08
4.55
3.03
8.04
7.83
5.65
5.14
5.60
5.07
4.88
3.71
3.95
3.61
1.97
1.29
2.32
2.97
2.51
3.31
1.69
3.64
8.77
5.86
7.53
7.23
7.83
5.72
1.28
4.60
6.80
calc AqSol
0.328
4.17
4.85
0.986
3.84
4.55
4.58
3.58
2.72
-0.344
1.20
0.745
2.24
1.02
2.18
1.78
6.71
1.61
1.22
2.81
4.56
3.84
3.10
2.98
3.16
2.30
10.1
9.67
8.28
7.90
8.16
8.11
7.20
7.38
7.69
6.59
6.93
8.29
6.21
5.99
5.11
5.46
5.36
6.50
4.57
5.36
10.6
7.14
7.43
5.87
4.79
5.18
3.62
3.76
2.57
8.01
8.06
5.87
6.34
5.40
5.18
5.05
4.07
3.34
3.31
1.56
1.44
2.14
3.34
2.88
3.25
2.47
3.84
8.72
6.03
7.76
7.59
7.40
5.70
4.68
7.05
compound name
expt AqSol
calc AqSol
DDD
trichloroacetic acid
1,2,3,4,5,6-hexachlorocyclohexane
nitromethane
octylamine
nitrobenzene
aniline
benzidine
acridine
amitrole
m-dinitrobenzene
indole
o-phenanthroline
2,2-biquinolinea
13H-dibenzo(a,i)carbazole
picric acidb
p-nitrophenol
styphnic acid
m-nitrophenol
o-nitrophenol
metronidazole
L-tyrosine
salicylamide
o-nitrobenzoic acidb
L-phenylalanine
cinchomeronic acid
isocinchomeronic acid
lutidinic acid
quinolinic acid
L-tryptophan
hippuric acid
acetanilidea
N,N-dimethyl-N-phenylurea
urea
glycine
DL-alanineb
L-alanineb
R-aminobutyric acid
DL-leucine
norleucinea
L-leucine
L-aspartic acid
R-aminoisobutyric acid
L-norleucine
meprobamate
diphenamida
mebendazolea
carboxin
N-(2-methylcyclohexyl)-N-phenylurea
triallate
tetracyclinec
oxytetracyclinec
caffeine
barbitala
metharbital
butabarbital
butethal
thioopental
amobarbital
pentobarbital
thiamylalb
theophylline
alloxantin
vinbarbital
allobarbital
secobarbital
phenobarbital
riboflvain
lenacil
terbacilb
dicyanodiamide
CDEC
picloram
chloramben
diurona
linuron
monuron
fluometuron
neburonb
barbana
cyanazine
indomethacin
bromacil
diallatea
6.76
-1.57
4.60
-0.255
2.81
1.80
0.416
2.63
3.60
-0.522
2.37
1.21
1.82
5.40
7.41
1.26
0.900
1.66
1.01
1.75
1.22
2.57
1.79
1.35
0.804
1.86
2.14
1.83
1.19
1.23
1.68
1.35
1.67
-0.946
-0.493
-0.243
-0.250
-0.305
1.10
1.06
0.750
1.41
-0.210
0.975
1.71
2.98
3.88
3.14
4.11
4.88
3.12
3.14
0.947
1.42
1.98
2.18
1.83
3.36
2.55
2.54
3.68
1.39
2.23
2.43
2.07
2.23
2.29
3.68
4.59
2.48
0.311
3.37
2.75
2.47
3.76
3.52
2.92
3.42
4.76
4.37
3.15
4.62
2.52
4.08
6.65
-1.07
4.96
-0.777
2.94
1.66
0.593
2.28
3.18
0.0822
2.53
1.64
3.49
5.66
7.76
1.43
1.75
1.52
1.68
1.54
1.56
1.84
0.828
1.82
1.76
1.34
1.75
1.51
1.36
1.84
1.89
1.46
2.15
-0.532
-0.462
-0.475
-0.480
-0.189
1.25
2.01
1.26
0.697
-0.423
1.38
1.65
3.39
3.89
3.41
3.56
4.26
1.42
1.35
1.74
1.70
2.64
3.03
2.48
2.33
3.33
0.953
2.01
2.22
2.10
2.57
1.90
3.84
4.25
2.15
-0.0295
3.78
2.93
2.22
3.35
3.61
2.93
3.30
4.48
4.33
3.37
4.33
2.57
4.07
Prediction set compound. b Cross-validation set compound. c Outlier compound. d The calculated values are from the Type 3 model.
PREDICTION
ORGANIC COMPOUNDS
OF
AQUEOUS SOLUBILITY
OF
MITCHELL
coefficient
-0.0364
AND JURS
( 0.0083
label
descriptor
SHDW 3
0.0189
-0.00439
( 0.0014
( 0.00025
GRAV 3
ALLP 3
WTPT 4
2SP3
QNEG
PPSA 1
FPSA 3
0.0391
( 0.0033
WPSA 3
4.11
( 0.52
4.18
( 0.45
-4.60
-3.25
( 0.60
( 0.37
3.21
( 0.43
0.747
( 0.110
of the molecules. The cpsa descriptors encoded a combination of the partial positive and negative surface areas of the
molecules, including both partial atomic charge and solventaccessible surface area information. Two of the hydrogenbonding descriptors (SAAA 3 and CHAA 2) were from an
ADAPT program and included information about hydrogenbonding acceptor atoms, while the third hydrogen-bonding
descriptor (EHBB) was the electrostatic hydrogen bonding
basicity described by Lowrey et al.41 in the TLSER studies.
This Type 1 model had a tset rms error of 0.638 log units
after five compounds were flagged as outliers and removed
from the tset. Two hundred ninety-five compounds were
used in the development of the linear regression model. The
outliers were sucrose, tetrafluoromethane, octachlorodibenzop-dioxin, tetracycline, and oxytetracycline. The main reason
for these compounds being outliers probably was their
structural difference from the remainder of the set of
compounds. In the case of the two halogenated compounds,
the very high degree of halogenation may have contributed
to their outlier status. The external pset of 32 compounds
had an rms error of 0.556 log units. The aqueous solubility
of sucrose was the highest value in the data set.
The descriptors that were chosen based on linear feature
selection were then used as input for a CNN to develop a
Type 2 model. The architecture of the network was nine
input neurons, six hidden neurons, and one output neuron,
giving 67 adjustable parameters. The 295 compounds in the
tset from the linear regression experiments were split into a
265-compound tset and a 30-compound cvset for the neural
network experiments. Training of the network resulted in
tset, cvset, and pset rms errors of 0.460, 0.455, and 0.446
log units, respectively. The correlation coefficient, R,
between the experimental values and the calculated values
was 0.982 for this Type 2 model. The tset rms error
improved by 28% and the pset improved by 20% over the
Type 1 model results.
Nonlinear feature selection was also used to choose a
subset of descriptors to relate molecular structure to aqueous
solubility. The 9:6:1 architecture of the neural network was
retained from the Type 2 experiments. A new set of nine
descriptors, defined in Table 3, was selected as the best
a tset rms error ) 0.394 log units; cvset ) 0.358 log units; pset )
0.343 log units.
nonlinear model. A variety of descriptor types was represented in the model. A shadow projection descriptor (SHDW
3) was again present as well as a gravitation index (GRAV
3) which was another geometry-based descriptor. There were
three topological descriptors (ALLP 3, WTPT 4, and 2SP3)
that contained information about weighted paths and carbon
types, one pure electronic descriptor (QNEG), and three cpsa
descriptors (PPSA 1, FPSA 3, and WPSA 3). The partial
positive surface area was the only descriptor that appeared
in both the Type 1 and Type 3 models. Quite surprisingly,
there were no descriptors that directly encoded hydrogenbonding effects. Seemingly the combination of information
represented by the nine descriptors that were chosen was
enough to describe the physical interactions present in
controlling aqueous solubility.
The Type 3 model that was developed provided the most
accurate tool for the estimation of aqueous solubility. The
model had tset, cvset, and pset rms errors of 0.394, 0.358,
and 0.343 log units. The correlation coefficient, R, between
the experimental values and the calculated values was 0.987
for this Type 3 model. The rms errors represented a tset
improvement of 14%, a cvset improvement of 21%, and a
pset improvement of 23% compared to the previous neural
network results. It also represented a 38% improvement in
both the tset and pset over linear regression results. The
results of the highest quality model that was developed is
shown in Figure 1, a plot of calculated and predicted versus
observed aqueous solubility values.
At attempt was also made to develop linear models using
PCA. Models that were comparable to, but not better than,
the Type 1 model reported here were able to be developed
only by using the top 70 principal components that were
calculated from the entire pool of 210 descriptors that were
generated as the pool from which the model descriptors were
chosen. Many of these descriptors were highly correlated
among themselves or contained very little information, with
less than 10% of the values of a descriptor varying for the
compounds in the data set. The nine-descriptor principal
component model had a tset rms error of 0.600 log units, a
pset rms error of 0.557 log units, and R value of 0.969.
Taking into consideration the fact that the principal components that appeared in the model had a contribution from
all 210 descriptors as opposed to the nine pure descriptors
in the Type 1 model, the Type 1 model had equal predictive
power with many fewer adjustable parameters. Attempts
were also made to improve the results of the linear principal
PREDICTION
OF
AQUEOUS SOLUBILITY
OF
ORGANIC COMPOUNDS
MITCHELL
(38)
(39)
(40)
(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)
(49)
AND JURS
CI970117F