Beruflich Dokumente
Kultur Dokumente
ABSTRACT
A statistical analysis is reported of
1,200 of the 1,404 nuclear magnetic resonance (NMR)derived protein and nucleic acid structures deposited in the Protein Data Bank (PDB) before 1999.
Excluded from this analysis were the entries not yet
fully validated by the PDB and the more than 100
entries that contained F 95% of the expected hydrogens. The aim was to assess the geometry of the
hydrogens in the remaining structures and to provide a check on their nomenclature. Deviations in
bond lengths, bond angles, improper dihedral angles,
and planarity with respect to estimated values were
checked. More than 100 entries showed anomalous
protonation states for some of their amino acids.
Approximately 250,000 (1.7%) atom names differed
from the consensus PDB nomenclature. Most of the
inconsistencies are due to swapped prochiral labeling. Large deviations from the expected geometry
exist for a considerable number of entries, many of
which are average structures. The most common
causes for these deviations seem to be poor minimization of average structures and an improper balance between force-field constraints for experimental and holonomic data. Some specific geometric
outliers are related to the refinement programs
used. A number of recommendations for biomolecular databases, modeling programs, and authors submitting biomolecular structures are given. Proteins
1999;37:404416. r 1999 Wiley-Liss, Inc.
Key words: proteins; hydrogens; NMR; nucleic acids; PDB; stereochemistry; validation
INTRODUCTION
Biomolecular structures are at the basis of many studies
in a number of research fields such as drug design and
functional genomics. This research critically depends on
the quality of the coordinates. In the course of our work on
the validation of macromolecular structures13 we endeavored to assess the geometrical aspects of hydrogens of all
protein and nucleic acid coordinates that were solved with
nuclear magnetic resonance (NMR) and deposited in the
Protein Data Bank (PDB) before December, 24, 1998.4,5
Protons provide the most important information for solution structure determination by NMR spectroscopy. Hence
the correct naming of hydrogen atoms and their local
r 1999 WILEY-LISS, INC.
geometry are of paramount importance for solving, refining, and comparing NMR-derived structures.
We recently analyzed the experimental data and the
coordinates of 97 NMR-solved proteins,1 and the software
used, called AQUA, is available from the world wide web
(WWW).6,7 Unfortunately, experimental data are only available for about one third of all NMR-related PDB files,
which precludes a rigorous validation of all NMR structures against their experimental data. Here, we concentrate on a series of nomenclature and geometrical checks
that are independent of the availability of experimental data.
The calculations were performed by using new routines
implemented in the WHAT IF program,8 which can be used
as WWW servers at http://swift.embl-heidelberg.de/servers2. The data underlying this study can be found at
http://swift.embl-heidelberg.de/service/counting/nmr.
In this study the nomenclature recommended by the
IUPAC by Markley et al.9,10 is used. Many software
packages use proprietary nomenclature rules, and also the
PDB does not adhere to the IUPAC rules. The new
nomenclature replaces the older IUPAC-IUB recommendations11 for peptides and proteins that were never widely
adopted by the protein structure community. This is
probably because no software that uses this nomenclature
was available. The 1998 recommendations detail the names
for hydrogens, whereas the nomenclature for the heavier
atoms has been described earlier.12,13 The nomenclature of
heavy atoms in proteins is routinely checked by programs
such as PROCHECK14 and PROCHECK_NMR.7 WHAT IF
can also check the recently recommended nomenclature of
hydrogens in proteins and nucleic acids.
Thorough studies of the geometrical aspects of heavy
(nonhydrogen) atoms in proteins and nucleic acids are
available in the literature. Engh and Huber15 and Parkinson et al.16 derived ideal bond lengths and bond angles
Abbreviations: CSD, Cambridge Structural Database; IUPAC, International Union of Pure and Applied Chemistry; IUB, International
Union of Biochemistry; NMR, nuclear magnetic resonance; NOE, nuclear Overhauser enhancement; PDB, Protein Data Bank; rms, root mean
square; WWW, world wide web.
Grant sponsor: BIOTECH program of DGXII of the Commission of
the European Union; Grant number: BIO4-CT960189.
*Correspondence to: Robert Kaptein, Bijvoet Center for Biomolecular Research, Utrecht University, Padualaan 8, 3584 CH Utrecht, The
Netherlands. E-mail: kaptein@nmr.chem.uu.nl
Received 22 February 1999; Accepted 28 June 1999
405
406
PDB
Amino acid
terminal amine
H1
H2
H3
H1
H2
H3
1H
2H
3H
HN
HA
HB1
HB2
HB3
HB2
HB3
HG2
HG3
HD2
HD3
HE
HH11
HH12
HH21
HH22
HB2
HB3
HD21
HD22
HB2
HB3
HD2
HB2
HB3
HG
HB2
HB3
HG2
HG3
HE21
HE22
HB2
HB3
HG2
O
O2
HXT
H
HA
1HB
2HB
3HB
1HB
2HB
1HG
2HG
1HD
2HD
HE
1HH1
2HH1
1HH2
2HH2
1HB
2HB
2HD2
1HD2
1HB
2HB
HD2
1HB
2HB
HG
1HB
2HB
1HG
2HG
2HE2
1HE2
1HB
2HB
1HG
Glycine
Histidine
Terminal carboxyl
Alaninec
Arginine
Asparagine
d
d
Aspartate
Cysteine
Glutamine
d
d
Glutamate
O8
O9
H9
HN
H
H1
H2
H3
H2
H3
H2
H3
H2
H3
H
H11
H12
H21
H22
H2
H3
H21
H22
H2
H3
H2
H2
H3
H
H2
H3
H2
H3
H21
H22
H2
H3
H2
Isoleucine
pro-R
pro-S
pro-S
pro-R
pro-S
pro-R
ZZ
ZE
EZ
EE
pro-S
pro-R
E
Z
pro-R
pro-S
Lysine
pro-R
pro-S
pro-S
pro-R
Methionine
Leucine
pro-S
pro-R
pro-R
pro-S
pro-S
xij i
i
PDB
Sta
2HG
HE2
1HA
2HA
1HB
2HB
HD1
HD2
HE1
HE2
HB
1HG1
2HG1
1HG2
2HG2
3HG2
1HD1
2HD1
3HD1
1HB
2HB
HG
1HD1
2HD1
3HD1
1HD2
2HD2
3HD2
1HB
2HB
1HG
2HG
1HD
2HD
1HE
2HE
1HZ
2HZ
3HZ
1HB
2HB
1HG
2HG
1HE
pro-R
IUPAC
H3
H2
H2
H3
H2
H3
H1
H2
H1
H2
H
H12
H13
H21
H22
H23
H11
H12
H13
H2
H3
H
H11
H12
H13
H21
H22
H23
H2
H3
H2
H3
H2
H3
H2
H3
H1
H2
H3
H2
H3
H2
H3
H1
HG3
HE2
HA2
HA3
HB2
HB3
HD1
HD2
HE1
HE2
HB
HG12
HG13
HG21
HG22
HG23
HD11
HD12
HD13
HB2
HB3
HG
HD11
HD12
HD13
HD21
HD22
HD23
HB2
HB3
HG2
HG3
HD2
HD3
HE2
HE3
HZ1
HZ2
HZ3
HB2
HB3
HG2
HG3
HE1
pro-R
pro-S
pro-S
pro-R
pro-R
pro-S
pro-R
pro-S
pro-R
pro-R
pro-R
pro-S
pro-S
pro-S
pro-R
pro-S
pro-R
pro-S
pro-S
pro-R
pro-S
pro-R
pro-S
pro-R
pro-S
pro-R
n,m
(Z )
n m i,j1
ij
407
Phenylalanine
Prolined
d
Serine
Threonine
Tryptophan
Tyrosine
Valine
All
PDB
H2
H3
H2
H3
H1
H2
H1
H2
H
H2
H3
H2
H3
H2
H3
H2
H3
HE2
HE3
HB2
HB3
HD1
HD2
HE1
HE2
HZ
H2
H3
HB2
HB3
HG2
HG3
HD2
HD3
2HE
3HE
1HB
2HB
HD1
HD2
HE1
HE2
HZ
H2
H1
1HB
2HB
1HG
2HG
1HD
2HD
H2
H3
H
H
H1
H21
H22
H23
H2
H3
H1
H1
H3
H3
H2
H2
H2
H3
H1
H2
H1
H2
H
H
H11
H12
H13
H21
H22
H23
HB2
HB3
HG
HB
HG1
HG21
HG22
HG23
HB2
HB3
HD1
HE1
HE3
HZ3
HH2
HZ2
HB2
HB3
HD1
HD2
HE1
HE2
HH
HB
HG11
HG12
HG13
HG21
HG22
HG23
1HB
2HB
HG
HB
HG1
1HG2
2HG2
3HG2
1HB
2HB
HD1
HE1
HE3
HZ3
HH2
HZ2
1HB
2HB
HD1
HD2
HE1
HE2
HH
HB
1HG1
2HG1
3HG1
1HG2
2HG2
3HG2
Sta
IUPAC
Sta
PDB
Nucleic acid
pro-R
pro-S
HO58
HO38
H5T
H3T
OP1
OP2
O1P
O2P
C18
H18
C28
H28
H29
O28
HO28
C38
H38
O38
C48
O48
H48
C58
O58
H58
H59
C1*
H1*
C2*
1H2*
2H2*
O2*
2HO*
C3*
H3*
O3*
C4*
O4*
H4*
C5*
O5*
1H5*
2H5*
H2
H61
H62
H8
H41
H42
H5
H6
H1
H21
H22
H8
H3
H6
C7
H71
H72
H73
H3
H5
H6
H2
1H6
2H6
H8
1H4
2H4
H5
H6
H1
1H2
2H2
H8
H3
H6
C5M
1H5M
2H5M
3H5M
H3
H5
H6
Phosphate
e
e
pro-R
pro-S
-D-2-(Deoxy)ribose
pro-R
pro-S
pro-R
pro-S
pro-S
pro-R
pro-S
pro-R
f
f
pro-S
pro-R
pro-R
pro-S
Purines and
adenine
e
e
Cytosinee
e
pro-R
pro-S
Guanosine
e
e
Thymine
Uracil
pro-S
pro-R
pro-S
pro-R
Z
E
Z
E
Z
E
hydrogen and deviating heavy atom names are given. The backbone hydrogens of amino acids are only given for alanine. In cases in which
IUPAC names contain Greek or superscripted characters the name was repeated with Roman characters without superscripting.
aThe stereochemistry is indicated as a reference: the pro-R/pro-S nomenclature27,28 for prochiral tetrahedral groups and the Z/E nomenclature for
planar groups.29
bNot present in any of the studied structures.
cAccording to the IUPAC nomenclature, H is a valid atom designator, but HN is generally preferred by the NMR community for clarity.
dThe consensus PDB atom nomenclature deviates from the IUPAC nomenclature by interchange of the designators for the side-chain amide hydrogens of
Asn and Gln and the sec-amino hydrogens of N-terminal Pro.
eThe amino hydrogens and the phosphorous oxygen atoms in nucleic acids are labeled in the PDB naming scheme without clear stereochemical
preference.
fFor brevity, only RNA is included. In DNA sugars the 28-hydroxyl group (O28 and HO28) should be replaced by a hydrogen (H29).
408
No. of entries
1,404
79
109
12
4
1,200
The
list was compiled by using the 3DB software from the PDB on
December 24, 1998.
aLayer 1 entries had not been fully validated by the PDB to yield Layer
2 (normal) entries.
bEntries with 95% of the expected hydrogens were excluded.
cIf a model contained fewer than five of the common residues (20 amino
acids and 5 nucleic acid residues), the entry was excluded.
dFour entries (1BBA, 1COD, 1HDP, and 1NIL) were found to be
inadequately minimized average structures as evidenced by an rms Z
score of the heavy-atom bond angles 7 .
X-PLOR program (present as another keyword) and presented no significant differences with the other X-PLOR
entries. The five most commonly used refinement programs for NMR structures are X-PLOR,20 DISCOVER
(Molecular Simulations, San Diego, CA), AMBER,19
DIANA,22 and DGII.23 For some entries, the refinement
program was unavailable in the PDBFINDER database
because the PDB headers do not always contain this information, or the program was not scored as one of the keywords,
as is the case for the entries refined with DYANA. These
entries, along with those refined with less common refinement programs, were classified in a separate set.
RESULTS AND DISCUSSION
Test Set of Protein and Nucleic Acid Structures
We tried to be as inclusive as possible in our test set of
protein and nucleic acid structures but had to reject 204
from the 1,404 entries from further analysis (see Table II).
Excluded were entries that were only partly validated by
the PDB (layer 1 release), entries with too many missing
hydrogens, entries containing inadequately minimized
structures, and those with fewer than 5 normal residues.
From the remaining 1,200 entries, 820 multimodel entries
consisted on average of 19 10 models with a maximum of
80 models in one ensemble. Each model contains on
average 84 55 amino acids (1,010 entries containing
amino acids) or 21 8 nucleotides (234 nucleic acidcontaining entries) and 2 3 residues that differed from
the 20 common amino acids or 5 nucleotides (179 entries).
These nonstandard residues are not discussed here. Each
protein or nucleic acid model contains on average 1,230
847 atoms. The total number of atoms in the data set was
just over 20 million, of which approximately half were
hydrogens. In the following paragraphs the results of the
individual checks are discussed.
Missing hydrogens
Missing hydrogens present a serious problem, especially
when the local geometry deviates significantly from ex-
409
Labile
hydrogens
Hydrogens
lacking
5
2
2
0/1
0/1
1
0/1/2
3
1
1
1
1
1,644b
742
3,224
79
5,689
Normal residues
Total residues
41,929
38,997
35,998
46,237/1,333
63,870/1,925
37,241
281/16,493/1,530
69,385
51,013
48,174
11,109
28,711
494,226
43,573
38,997
35,998
47,570
65,795
37,983
18,304
72,609
51,092
48,174
11,109
28,711
499,915
Anomalous
entriesa
38
22
46
6
112
Only
amino acids that contain a polar side-chain atom that can be protonated under physiological conditions are
listed. The N- and C-terminal groups have not been checked here.
aThe number of entries containing one or more anomalous protonation states is listed.
bOf these 1,644 arginines, 62, 17, and 1,565 had only 1, 3, and 4 of the expected hydrogens, respectively.
cCysteines for which the S is close to a cation or other sulfur atom were not considered to have a missing
hydrogen.
Fig. 1. Correlation between the percentage of residues with unexpected protonation states and the percentage of expected hydrogens
present. All deviating states have fewer hydrogens than expected, which
caused the two quantities to be correlated. The protonation state of
nucleotides has not been studied, and entries containing only nucleotides
are not shown in this figure. Polar atoms, such as Cys S that are
coordinated by a cations other than H are counted as if a hydrogen is
present. The symbols open circle and cross represent entries that contain
protein and protein-nucleic acids complex, respectively. The labeled entry,
1MNB, is described in the text. There is a clustering of a large number of
entries with all hydrogens present and no deviating protonation states.
Nomenclature
Nomenclature differences between IUPAC and PDB
consensus
The IUPAC nomenclature was described extensively
elsewhere.9,10 Some hydrogen nomenclature rules have
already been described in the older IUPAC study12 to
which the 1998 recommendations refer (e.g., the numbering of hydrogens within a methyl group). Sample coordinate files containing IUPAC atom designators for the
common amino acids and nucleotides are available from
WWW address (http://swift.embl-heidelberg.de/service/
names).
There are two main differences between the IUPAC
nomenclature and the consensus naming scheme that is in
use at the PDB (see Table I). The first difference concerns
the numbering of the methylene hydrogens (in a CH2
group), which starts with 2 in the IUPAC nomenclature
and with 1 in the PDB nomenclature. The second difference is that the IUPAC always uses a suffix number, but
the PDB sometimes uses a prefix. For example, the first
methyl hydrogen on C2 of threonine is called H21 and
1HG2 in the IUPAC and PDB nomenclature, respectively.
There are a few other differences between the IUPAC
and the PDB nomenclature. In the IUPAC nomenclature
the stereochemical numbering of the carboxylic oxygen
atoms depends on the presence of a hydrogen, the oxygen
atom bearing the hydrogen being numbered 2. When there
is no hydrogen, the original convention that the orientation of the group determines the numbering is applied. The
PDB nomenclature uses only this latter rule even if a
hydrogen is present. The IUPAC name for the methyl
carbon in thymine is C7, but this atom is named C5M in
PDB files, and the names of the methyl-bound hydrogens
differ accordingly. The oxygen atoms on the phosphorous
atom of the nucleic acid backbone are named OP1 and OP2
in the IUPAC nomenclature instead of the older names
O1P and O2P that are in use by the PDB. The PDB
nomenclature of the side-chain amide hydrogens of asparagine and glutamine and the sec-amino group of N-terminal
proline is inverted with respect to the IUPAC nomenclature. The amino hydrogens and the phosphorous oxygen
atoms (OP1 and OP2) in nucleic acids display no clear
410
Type
Description
1
2
3
Methylene
Methyl
Side-chain amide
(Asn and Gln)
Heavy atoms (Arg)a
Hydrogens (Arg)b
Amino (nucleic acids)
Amino (Lys)
Amino (N-terminal)
Sec-amino N-terminal
(Pro)
Iso-propyl (Leu and
Val)
Aromatic - and
-atoms (Phe and
Tyr)
Oxygen atoms on
phosphate
Oxygen atoms
(Asp and Glu)
Oxygen atoms
(C-terminal)
Stereochemicalc
Nonstereochemicalc
Total inconsistencies
Total checkedd
4
5
6
7
8
9
10
11
12
13
14
Affected
atoms
Inconsistently No. of
labeled
entries
atoms
involved
2
3
2
66,994
32,563
7,064
172
214
405
6
2
2
3
3
2
41,086
3,227
26,508
6,675
1,207
425
155
174
78
63
2,242
6,201
35
29,936
144
6,908
72
9,671
637
240,282
6,855
247,137
14,761,635
1,059
264
1,109
1,200
Fourteen
Geometry
Bond lengths
411
The rms Z scores over all bond lengths provide interesting differences between types of molecules and refinement
programs. Because only the correct reference values for
heavy atoms are precisely known, one should be careful in
evaluating the hydrogen values for which the focus is only
on identifying outliers. Figure 4 shows the percentage of
highly distorted ( 4 ) bond lengths and bond angles
involving hydrogens. The bond angles are discussed below.
The labeled entries, many of which are average structures,
contain various errors. In the worst case, the four average
structures 1SJL, 1SJK, 1AGU, and 1A9I, contain thymine
methyl groups that have carbon-hydrogen bond lengths as
small as 0.6 . This bad local geometry is likely the result
of inadequate energy minimization after averaging coordinates. Entry 1D83 has the highest percentage of both bond
length and bond angle outliers of all nonaveraged structures. The outliers in this entry are the bond lengths of
C28-H28 (sugar) and C8-H8 (Gua). In entry 1D69, Thy2
and 10 in both chains have three methyl hydrogens that
have the exact same coordinates as the position of the
pseudoatom M7. Entry 1RCS has a bond length of C38-H38
(sugar) of 1.2 in all occurrences and entry 2NR1
contains - and -hydrogens (Met) with 1.2 bond lengths
as well. Similarly, entry 1D3X has bond lengths for C28H28 (sugar) of Cyt20 of 1.2 in all models but not in any of
the other residues. The entry 1NCV has few hydrogen
bond length outliers, but the bond lengths of Phe15 C1-H1
and Tyr28 C2-H2 are 1.4 in the last model in which both
residues are highly distorted. The most common outlier is
found in many X-PLOR entries. X-PLOR uses for the
cysteine S-H a bond length of 1.0 , whereas 1.3 would
be more correct. This difference causes the only outliers for
412
Tetrahedral geometry
Peptide planarity
ence between average structures and the other structures. Figure 8 shows the rms deviations from planarity
for the peptide bond versus the side-chain planarity discussed below. The X-PLOR entries and half of the entries
solved with DIANA are too restricted in the peptide
planarity with respect to the amide hydrogen. More than
90% of the X-PLOR entries have an rms deviation of the
hydrogens planarity of 0.05 . Entries solved with
AMBER, DISCOVER, or DGII show higher rms deviations.
In Figure 9, the percentage of peptide hydrogen outliers
( 0.25 ) is shown. A deviation for a hydrogen of 0.25 , at
perfect planarity of the five heavy atoms, corresponds to a
rotation out of the plane by 14. For the dihedral angle
this rotation is 2.5 away from the mean which is expected
for 1% of the residues. There are 188 and 95 entries that
have a percentage of outliers 1 and 5%, respectively.
Relatively many of these entries were solved with AMBER,
DISCOVER, or DGII. Thus, for a considerable number of
NMR-solved proteins the distribution of the amide hydrogen deviation from planarity differs substantially from the
distribution of the angle of the atomic resolution protein
structures.
Planarity of amino acid side chains
The combined planarity distortions of side chains of the
amino acids Arg, Asn, Asp, Gln, Glu, His, Phe, Trp, and Tyr
is shown in Figure 7b. Hooft et al.18 derived the heavyatom planarity standard deviations from the CSD with
values ranging from 0.0046 (His) to 0.037 (Arg). For
aromatic groups, these numbers were derived without the
413
414
415
Fig. 8. Planarity deviations of different refinement programs. Entry 1MHU, which was solved
with X-PLOR, is located at 0.005, 0.72 and is
off-graph to allow a better separation of the other
entries. The classification is as in Figure 3. Entries
whose refinement method was not classified or
was less common and entries without a peptide
bond or amino acid planar group were omitted.
416
the members of the Validation Project for a fruitful interaction on many facets of the project.
REFERENCES
1. Doreleijers JF, Rullmann JAC, Kaptein R. Quality assessment of
NMR structures: a statistical survey. J Mol Biol 1998;281:149
164.
2. Hooft RWW, Vriend G, Sander C, Abola EE. Errors in protein
structures. Nature 1996;381:272.
3. Wilson KS, Dauter Z, Lamzin VS, et al. Who checks the checkers?
Four validation tools applied to eight atomic resolution structures. J Mol Biol 1998;276:417436.
4. Sussman JL, Lin, D, Jiang J, et al. Protein Data Bank (PDB):
database of three-dimensional structural information of biological
macromolecules. Acta Crystallogr 1998; D54:10781084.
5. Bernstein FC, Koetzle TF, Williams GJ, et al. The Protein Data
Bank: a computer-based archival file for macromolecular structures. J Mol Biol 1977;112:535542.
6. Rullmann JAC. AQUA computer program. ftp://ftp-nmr.chem.uu.nl/
pub/aqua, 1996.