Bmi 701 12 1 2015 PDF

Bioinformatics for discovery:
Introduction to GWAS and EWAS
BMI 701:Introduction to Biomedical Informatics
12/1/2015
Chirag J Patel
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
Complex traits are a function of genes and
environment...
Phenotype Genome Environment
P=G+E
Type 2 Diabetes
Variants Infectious agents
Cancer
Nutrients
Alzheimer’s
Pollutants
Gene expression Drugs

G
We are great at G investigation!
over 2000
Genome-wide Association Studies (GWAS)
https://www.ebi.ac.uk/gwas/
>2,000 traits/diseases
>15,000 SNPs
>16,000 SNP-trait associations
https://www.ebi.ac.uk/gwas/
Dissecting G in P:
What is a Genome-wide Association Study?
SNP(A)
SNP(A) SNP(a)
SNP(a)
SNP(A)
SNP(A) SNP(a)
SNP(a)
SNP(A)
SNP(A) SNP(a)
SNP(a)
diseased
diseased SNP(A) SNP(a)
diseased SNP(A)
SNP(A) SNP(a)
SNP(a)
diseased
diseased SNP(Z) SNP(z)
non-
diseased
non-
diseased
non-
diseased
diseased
non-
diseased
diseased
non-
diseased
diseased
non-
diseased
non-
diseased
non-
diseased
non-
diseased
non-
diseased
diseased
genome-wide diseased
Hypothesis-free “search engine” for genetic variants
associated with a complex trait or disease
in unrelated populations
The road to GWAS...
A new paradigm of GWAS for discovery of G in P:
Human Genome Project to GWAS
Sequencing of the genome Characterize common variation Measurement tools
HapMap project:
http://hapmap.ncbi.nlm.nih.gov/
High-throughput variant
assay
< $99 for ~1M variants
2001 2001-current day ~2003 (ongoing)
Vol 447 | 7 June 2007 | doi:10.1038/nature05911
ARTICLES
Comprehensive, high-throughput analyses
Genome-wide association study of 14,000
GWAS cases of seven common diseases and
3,000 shared controls
The Wellcome Trust Case Control Consortium*
Nature 2008
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the
identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip
500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major
diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at
P , 5 3 1027: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1
diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these
signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found
compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a
25 27
Number of raw publications with subject of
“GWAS”
Number of Publications 'GWAS'
3000
2000
1000
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Year
pubmed MeSH terms:
human + GWAS
Number of raw publications with subject of
“GWAS”
mega-meta-GWAS
Number of Publications 'GWAS'
3000
2000 GWAS
age-related macular degeneration
Risch + Merikangas
linkage vs. association
WTCCC
1000
human genome sequenced
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Year
pubmed MeSH terms:
human + GWAS
GWAS is relevant today (even with NGS) around the corner

Why execute GWAS?
onimm, 0In"a0,"a,
The Future
The Future ofComplex of Genetic
Genetic Human Studies of
of about 2 or less willforne
Studies Diseases
linkage analysis
Complex Human Diseases because the numbe

(more than -2500)
able.
Neil Risch and Kathleen Merikangas Although tests of
Neil Risch and Kathleen Merikangas est effect are of low
above example, direc
Science, 1996 a disease locus itself
Geneticists have made substantial progress in age analysis we have To illustrate
chosen this poi
for this ar
Geneticists have madeidentifying
substantial progress in age analysis we have chosen for this argu-
genetic basis of many human ment is a popular current paradigmtransmis
thehuman sion/disequilibrium in whit
basis of many
identifying the genetic diseases, at least those ment
with is a popular
conspicuous deter- paradigm
currentpairs of in whichboth Inwith
siblings, this the
test,disease,
diseases, at least those with conspicuous
minants. These deter- pairs
successes include siblings, both with
of Huntington's the disease,
examined for sharing locus from
are atofaalleles at heter
multip
minants. These successesdisease, Huntington'sdisease,
includeAlzheimer's examined
and some sharing
forforms of ofsites
alleles at multiple affected
defined by offspring is e
mar
disease, Alzheimer's disease, some forms of sites
andcancer. in the genome defined in
byThe the
genetic genome
mark- genetic
lian inheritance, all a
breast
breast cancer. However, the detection ofcomplexHowever, the detection of ge- ers.
ge- ers. The more often the affected siblings chance more often the affected
of beingsibli
tran
netic factors for diseases-such as
netic factors for complex diseases-such as share the same allele at a particular site, the eration. In contrast,t
share the same allele at a particular site,
schizophrenia,
schizophrenia, bipolar disorder, bipolar disorder,
and diabetes- and diabetes-
more likely close tolikely
the site is more the disease close to with
the site isassociated the dise
dise
has been far
has been far more complicated. There more havecomplicated.
gene. Using There have gene.
the formulas in (1),Using the formulas
we calculate in (1),
mitted more calculth
weoften
been numerous reportsbeen of genes that of
or locireports
numerous thegenes
expectedor loci that Ythe
proportion of expected
alleles shared proportion
by of alleles
YFor shared
this approach,
might underlie
might underlie these disorders, but few ofthese a pair but
these disorders, few ofthese
of affected theofbest
siblings aforpair affected
possible siblings
withformultiple possi
the bestaffect
findingsThe
findings have been replicated. have beenna-
modest replicated. The modest
case-that na- linked
is, a closely case-that a closelyjustlinked
markeris, locus marker
on single lo
affect
ofthe
turethese
ture ofthe gene effects for gene effects
disorders these disorders likely
likely for(recombination fraction (recombination
0 = 0) that is fully fraction 0 = 0)Forthat
parents. fu
the issame
explains the contradictoryexplainsand the contradictory
inconclusive and inconclusive
informative (heterozygosityinformative
= 1) (2)-as (heterozygosity = 1) (2)-as
can calculate the pr
claims about
claims about their identification. their the
Despite identification. Despite the parents as pq(y + 1
smallthe
small effects of such genes, effects of suchofgenes, the magnitude
magnitude 1 +W wherew=of pq(y-1)2 1 +W wherew= the probability
pq(y-1)2for a
their
their attributable risk (the attributable
proportion risk (the proportion
of people 2+w of people (py+q)2 2+w the high ris
transmit(py+q)2
affected due to them) may be largedue
affected because
to them) theymay be large because they Association tests ca
are quite frequent in the arepopulation,
quite frequent making no linkage ofIf athere
If there is making
in the population, marker is no pairs ofof aaffected
at alinkage markersiblat
them of public health significance.
them of public health significance.particular site to the disease, particular the site to the associated
siblings disease, with the diseas
sibli
Has the genetic study ofHas complex disorders
the genetic studywould be expected
of complex to share
disorders wouldallelesbe50% of theto share
expected allelesis the
over 50% 50%sameof t
reached its limits? The persistent
reached its lack ofThetime;
limits? that is, Ylack
persistent wouldof equal
time; Values
0.5.that Y the
is, Y ofwould equal probability
0.5. Valuesof paro
A new
replicability of these paradigm
reports of linkage
replicability isbe-needed
of these for various
reports for
valuesdiscovery!
of linkage of
be-p andfory various
are givenvalues in theof p creased
and y are at low values
given in t
tween various loci and tween loci andthird
variousdiseases
complex of the table.
column diseases
complex an allele
thirdForcolumn of the table.probability
of the of
For an allele par
might imply that it has.might
We argue that that
implybelow moderate
it has. We below that(p is moderate
arguefrequency that con- (p creased.
0.1 to 0.5) frequency is 0.1 toThe 0.5)formula
that co
How does a GWAS work?
Single nucleotide polymorphisms (SNPs):
How many SNPs are in the human genome?
>3,000,000,000 bases in human genome
SNPs appear ~1000 bases

~3,000,000 SNPs
40-60% have minor allele frequency <5%
GWAS focus on frequency >5%

HapMap Consortium, 2010
Can’t measure everything:
Tag SNPs and Linkage Disequilibrium (LD)
LD = co-occurance of SNPs in a contiguous region
Bush and Moore, 2012

The phenomenon of LD makes GWAS possible:
How and why?: Indirect association
LD blocks
Figure 3. Indirect Association. Genotyped SNPs often lie in a region of high linka
will be statistically associated with disease as a surrogate for the disease SNP throu
doi:10.1371/journal.pcbi.1002822.g003
additional studies to map the precise needed to capture the variation

location of the influential SNP. African genome.
Conceptually, the end result of GWAS Bush
It isand Moore,
important 2012
to note that t
under the common disease/common var- ogy for measuring genomic
c
a
–log10[P] –log10[P]
LD block
0
2
4
0
2
4
rs 29 3 8 8 6 4
rs 3 0 1 9 8 8 0
Tag
rs 6 4 6 9 6 6 8
rs 3 0 1 9 8 8 5
rs 1 0 5 0 5 2 9 2
2 alleles
rs1 0 0 1 6 4 6
rs 1 1 7 8 1 5 1 9
rs 2 0 4 7 9 6 2
rs 7 0 1 1 0 5 7
are
rs 13 9 4 8 7 4
T2DM loci (Table
rs 7 8 3 3 7 3 4
rs 8 6 8 6 5 1
*
Can’t
rs 1 5 0 5 5 2 1
rs 2 0 6 2 9 4 7
1). In
rs 7 0 0 0 5 0 5
rs 1 0 5 0 5 2 9 3
rs 7 8 3 3 7 1 2
rs 1 3 9 4 8 7 5
rs 1 0 50 5 3 1 4
rs 6 4 6 9 6 7 4
correlated
rs 7 8 1 7 7 5 4
rs 6 4 6 9 6 7 5
rs 1 0 5 0 5 3 1 0
rs 2 4 6 4 5 9 2
measure
rs 2 4 6 6 2 9 9
*
rs 1 3 2 6 6 6 3 4
rs 2 4 6 6 2 9 5
EXT2 together
rs 2 46 6 2 9 3
rs 1 0 2 8 2 9 4 0
because
rs 1 5 7 8 9 7 8
*
**
rs 6 4 6 9 6 8 1
all cases, the strongest
rs 2 4 6 6 3 1 8
SLC30A8
500K - 1M per chip
*
rs2 4 6 6 3 1 6
rs 1 9 9 5 2 2 2
*
rs7 0 0 5 1 4 0
they rs 9 6 1 6 3 0
*
rs 1 0 5 05 3 0 9
Sladek et al, 2007
rs1 4 9 9 4 3 0
are
rs 2 6 4 9 1 0 2
everything:
rs9 2 4 3 8 8
rs 1 4 9 9 4 3 3
ALX4
rs 1 6 2 2 1 0 8
rs 9 0 4 5 4 4
rs1 7 9 3 7 3 3
rs 1 7 9 3 7 3 2
MAX statistic (see Methods) was obtained with the additive model.
association for the
tified significant associations for seven SNPs representing four new
rs 2 4 6 4 5 9 4
Tag SNPs and Linkage Disequilibrium
SNPs are common proxies for other SNPs
d
b
inherited
–log10[P] –log10[P]
0
2
4
0
2
4
rs 2 2 5 9 0 4 9
rs 2 9 0 1 5 8 7
rs 7 0 8 6 2 8 5
rs 1 2 2 5 7 0 5 3
rs 1 0 7 8 6 0 4 4
rs 7 9 1 0 9 7 7
rs 5 5 1 2 6 6
rs 1 8 8 7 9 2 2
rs 2 1 4 9 6 3 2
rs 2 4 2 1 9 4 0
rs 3 7 3 7 2 2 5
rs 1 1 1 8 7 0 2 5
rs 6 5 8 3 8 2 0
IDE
rs 7 0 7 8 4 1 3
of this gene (Fig. 2a)
rs 1 8 3 2 1 9 7
solely in the secretory
final stages of insulin
Digitizing SNPs:
e.g., Illumina Infinium Array
image: www.lifa-core.de/
image: illumina.com
Assessing Thousands of Factors Simultaneously:
Data-driven search for differences in SNP frequencies
GCAGGTACATG...GGTA...
GCAGGTACACG...GGTA...
disease cases GCAGGTACATG...GGTA...
healthy controls GCAGGTACATG...GGTA...
disease cases
healthy controls
~100,000 - ~1,000,000 association tests

Associating One SNP with Disease
Case-Control Study Design
?
SNP (A/a) Disease
A a
cases diseased
controls non-
diseased
What is an “Odds Ratio”?
?
SNP (A/a) Disease
A a Odds Ratio a vs A:
cases diseased c d Odds of disease with allele a
non-
vs.
controls diseased
x y
Odds of disease with allele A
Chi-squared test
1: equal odds (no difference)
>1: increased odds (increased risk)
<1: decreased odds (decreased risk)

Calculating the Odds Ratio
?
SNP (A/a) Disease
A a
Odds Ratio a vs A:
cases diseased c d [d/(d+y)]/[y/(d+y)] Odds with allele a
non- Odds with allele A
controls
diseased
x y [c/(x+y)]/[x/(c+x)]
Chi-squared test
Odds Ratio d/c

y/x
dx
cy
How would you interpret an OR of 2?

Cohort Study Design
?
SNP (A/a) Disease
vs.
SNP (A/a) Non-diseased
Cox survival regression
Relative Risk
•Direct measure of risk vs. odds ratio
•Need to wait!
•If incidence is low, N needs to be large!
Models to associate genotypes with disease
Examples for a case-control study
ND=4 NC=4
Disease Non-diseased
Aa AA aa Aa
Aa AA aa Aa
Examples for a case-control study
ND=4 NC=4
Disease Non-diseased
Aa AA aa Aa
Aa AA aa Aa
A a
diseased 6 2 OR A (vs a)
non-
diseased 2 6 OR a (vs A)
Genotypic Test (“2 or 1 df test”)
ND=4 NC=4
Diseased Non-diseased
Aa AA aa Aa
Aa AA aa Aa
AA Aa aa
diseased
2 2 0 OR AA (vs. Aa)
non-
0 2 2 aa (vs. Aa)
diseased
Associating One SNP with Quantitative Trait
(e.g., height, weight, cholesterol)
SNP rs1234 SNP rs123456

125
100
100
height
80
height
trait
trait
75
60
50
25
40
1 2 3 1 2 3
factor(SNP) factor(SNP)
GG GC CC CC CT TT
Associating One SNP with Quantitative Trait
Linear Regression and Additive Risk Model
SNP rs123456
125
y=ɑ+βx+ε T= risk allele
xCC=0 if individual is CC
100 xCT=1 if individual is CT
xTT=2 if individual is TT
height
trait
75
ɑ
height = ɑ+βx
50
β: change in height for 1 risk allele

25
β
1 2 3
CC (0) CT (1)
factor(SNP)
TT (2)
Prototypical “Manhattan plot” to visualize
associations
NATURE | Vol 447 | 7 June 2007
AA Aa aa
diseased
a non-
evol
15 diseased part
−log10(P)
10 ease
tase
5
well
0 biol
1
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Chromosome T
b capt
100
imp
STR
rved test statistic
80
~100,000 - ~1,000,000 association tests
reve
60 subs
Science, 2007
libri
40 clea
ibility with schizophrenia, a psychotic disorder with many similar- assium channel. Ion channelopathies are well-recognized as causes of
ities to BD. In particular association findings have been reported with episodic central nervous system disease, including seizures, ataxias
15 Bipolar disorder
10
5
0
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Coronary artery disease
15
10
5
0
10
11
12
13
14
15
16
17
18
19
20
21
22
X
15
Crohn’s disease
10
5
0
1
10
11
12
13
14
15
16
17
18
19
20
21
22
X
15
Hypertension
10
−log10(P)
5
0
1
10
11
12
13
14
15
16
17
18
19
20
21
22
X
15 Rheumatoid arthritis
10
5
0
1
10
11
12
13
14
15
16
17
18
19
20
21
22
X
15 Type 1 diabetes
10
5
0
1
10
11
12
13
14
15
16
17
18
19
20
21
22
X
15 Type 2 diabetes
10
5
0
1
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Chromosome
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases Chromosomes are shown in alternating colours for clarity, with
2log10 of the trend test P value for quality-control-positive SNPs, excluding P values ,1 3 1025 highlighted in green. All panels are truncated at
Type I Error:
False Positives!
what is a p-value?
chance we attain the observed result if no difference (H0)
Many tests: some can be significant (low p-value by chance)!
100 tests at a p-value of 0.05...

how many would be significant per chance?
Bonferroni “correction”:
Correct the 0.05 significance level by number of tests
e.g., 1M SNPs: 0.05/1x10-6 = 5x10-8

QQplot:
Distribution of of observed p-values vs. Ho p-
values
Histogram of runif(10000) Histogram of gwas$P.value
150000
500
400
100000
Frequency
Frequency
300
200
50000
100
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p-values under Ho
runif(10000) p-values of GWAS in Total Cholesterol
gwas$P.value
random uniform distribution Global Lipids Consortium, 2012

QQplot:
Distribution of of observed p-values vs. Ho p-
values
Histogram of gwas$P.value
150000
100000
Frequency
50000
0
0.0 0.2 0.4 0.6 0.8 1.0
gwas$P.value
p-values of GWAS in Total Cholesterol
sent study cannot provide conclusive exclusion of any given gene. This already allow us, for selected diseases, to highlight pathways and
is the consequence of several factors including: less-than-complete mechanisms of particular interest. Naturally, extensive resequencing
coverage of common variation genome-wide on the Affymetrix chip; and fine-mapping work, followed by functional studies will be
Which diseases show evidence of association?
poor coverage (by design) of rare variants, including many structural
variants (thereby reducing power to detect rare, penetrant, alleles)25;
required before such inferences can be translated into robust state-
ments about the molecular and physiological mechanisms involved.
Examining the QQplot of test statistics in WTCCC

difficulties with defining the full genomic extent of the gene of interest;
and, despite the sample size, relatively low power to detect, at levels of
We turn now to a discussion of the main findings for each disease,
focusing here only on the most significant and interesting results
BD CAD CD
30 30 30
25 25 25
20 20 20
15 15 15
10 10 10
Observed test statistic
5 5 5
0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
HT RA T1D
30 30 30
25 25 25
20 20 20
15 15 15
10 10 10
5 5 5
0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
T2D
30
25
20
15
10
5
0
0 5 10 15 20
Expected chi-squared value
Figure 3 | Quantile-quantile plots for seven genome-wide scans. For each 360,000 SNPs. SNPs at which the test statistic exceeds 30 are represented by
of the seven disease collections, a quantile-quantile plot of the results of the triangles. Additional quantile-quantile plots, which also exclude all SNPs
trend test is shown in black for all SNPs that pass the standard project filters, located in the regions of association listed in Table 3, are superimposed in
have a minor allele frequency .1% and missing data rate ,1%. SNPs that blue (for BD, the exclusion of these SNPs has no visible effect on the plot, and
Observational associations do not equal causation...
Confounding bias
What is a confounder?
?
Ice Cream $ Drowning
Summer!
Confounder is correlated to both the “risk” factor and disease,
leading to invalid inference.
Common source of bias in observational studies (e.g., case-control,

cohort, etc)
Population Stratification:
A source of possible confounding in GWAS
?
SNP Disease
race/ethnicity
Ancestry correlated with allele frequency and disease
GWAS are done on specific populations separately.
(most have been done in populations of European ancestry)

Mediation
SNPs indicative of a mediator factor?
Example: FTO and Type 2 Diabetes
FTO Body Mass
?
FTO Diabetes
Body Mass
Association between FTO and Type 2 Diabetes via BMI?
... or does FTO have a independent role in Type 2 Diabetes...?

PLINK:
(Standard) Whole Genome Analysis Software
PLINK:
(Standard) Whole Genome Analysis Software
•cited >9000 times since 2007
•allele frequency
•linkage disequilibrium (LD)
•data manipulation/filtering
•association: allelic, genotypic models
•chi-square
•logistic
•linear
http://pngu.mgh.harvard.edu/~purcell/plink/
Examples:
GWASs in Type 2 Diabetes

Type 2 Diabetes Mellitus:
A complex, multifactorial disease
•Insulin production vs. use
•beta-cell function
•insulin sensitivity (BMI)
•Moves glucose from blood into

cells
CDC,
•Complications arise due to
glucose in blood, hyperglycemia
•diagnosed by blood glucose body weight, diet, lifestyle, age

levels
family history: 25%

15. E. Larney, S. Larsen, Am. J. Phys. Anthropol. 125, 42 (2004). N. Ogihara, M. Nakatsukasa, Eds. (Springer, Heidelberg,
16. S. K. S. Thorpe, R. H. Crompton, Am. J. Phys. Anthropol. Germany, 2006), pp. 199–208.
References and Notes
1. B. G. Richmond, D. S. Strait, Nature 404, 382 (2000).
2. J. Kingdon, Lowly Origins (Princeton Univ. Press,
127, 58 (2005).
17. S. K. S. Thorpe, R. H. Crompton, M. M. Gunther, ARTICLES
R. F. Ker, R. McN. Alexander, Am. J. Phys. Anthropol.
28. C. P. E. Zollikofer et al., Nature 434, 755 (2005).
29. M. Pickford, Anthropologie 69, 191 (2005).
30. We thank the Indonesian Institute of Science, Indonesian
Princeton, NJ, 2003). 110, 179 (1999). Nature Conservation Service, and Leuser Development
3. C. V. Ward, M. G. Leakey, A. Walker, Evol. Anthropol. 7, 18. R. McN. Alexander, Principles of Animal Locomotion Programme for granting permission and giving support
A genome-wide association study

197 (1999).
4. Y. Haile-Selassie, Nature 412, 178 (2001).
5. T. D. White et al., Nature 440, 883 (2006).
(Princeton Univ. Press, Princeton,
19. C. V. Ward, Yrbk. Phys. Anthropol. 45, 185 (2002).
20. R. W. Wrangham, N. L. Conklin-Brittain, K. D. Hunt,
NJ, 2003). for research in the Leuser Ecosystem. R. McN. Alexander,
T. M. Blackburn, S. Burtles. J. Rees, N. Jeffery,
E. E. Vereecke, A. Walker, A. Wilson, and B. Wood
REPORTS
identifies novel risk loci for type 2 diabetes
6. K. Kovarovic, P. Andrews, J. Hum. Evol., in press (available
at http://dx.doi.org./doi:10.1016/j.jhevol.2007.01.001; doi:
Study design: Richa Saxena1–6 and Valeriya Lyssenko7 (Team 10.1016/j.jhevol.2007.01.001).
Gates,1 Carrie Sougnez,1 Diane Gage,1 Marcia Nizzari,1 David 22.
Int. J. Primatol. 19, 949 (1998).
21. H. Pontzer, R. W. Wrangham, J. Hum. Evol. 46, 317 (2004).
Payne et al.,
mentR. ofC. Pediatrics, J. Anat.
Harvard 208, 709
Medical School, Boston, MA
(2006).
commented on the manuscript. R. Savage developed the
animation (fig. S1). Studies of captive animals were
hosted by the North of England Zoological Society. This
Late Cenozoic
10
Leaders), Peter Almgren,7 Paul I. W. de Bakker,1–6 Noël P.
Robert
7. N. Altshuler,
Sladek
Patterson,
1–6
1,2,4Stacey
D. B. Gabriel
, Ghislain
J. Richter,
1
(Chair) E. 1S.
Rocheleau
S. Gnerre, Johan Rung4*, 23.
*, Lander, 02115,
Christian USA.Dina
M. Pickford,
5
Division of Endocrinology,
B.*,Senut,
Lishuang B. Gommery, Shen 1
Children’s
,
inDavid Hospital,
Serre 1
, research was supported by grants from the Leverhulme
Burtt,1 Jose C. Florez,1–6 Hong Chen,8 Joanne Meyer,8 Joel N. D.GCKR
Philippe Reich,
Boutinreplication
Nature
5
, Daniel441,genotyping
1103
Vincent and
(2006).4 analysis (Malmö Diet
, Alexandre Belisle 4
, Boston,
Samy Environments
MA 02115,
Hadjadj 6
, andUSA.Hominid
Beverley
11
Division
Balkau Evolution:
of7Genetics,
, Barbaraa Tribute to
Children’s
Heude Bill 7
, Trust, the Royal Society, the L.S.B. Leakey Foundation,
8. K. and Cancer et al.,Study):
Primates
8 Sekar37, Bishop,
12
Hirschhorn,1,6,9–11 Mark J. Daly,1–3,5 Thomas E. Hughes,8Leif D. Hunt Kathiresan1,3,5 (Team 4,9 Leader), Hospital, Boston, 4MA 02115,
P. Andrews, P. USA.
Banham, Department
Eds. (Geologicalof10Medicine, Society, 10,11and the Natural Environment Research Council.
Guillaume Charpentier , Thomas 1 363J.(1996). Hudson , Alexandre Montpetit , Alexey V. Pshezhetsky , Marc Prentki ,
et ,al., Symp. Zool. Soc. London 48,Meyre
7,12 1–6 1 1 7
Groop, David Altshuler (Chair) Candace Guiducci, Aarti Surti, Noël P.13Burtt, Olle Melander, Helsinki University Hospital, University of Helsinki, Helsinki,5,14
9.
7 J.I. G. Fleagle2,12 359 5, Constantin London, 1999), pp. 27–38.
Clinical characterization and phenotypes: Valeriya Lyssenko
1,3
Barry Posner David 7 J. Balding , David Finland. 13 Polychronakos
6Skaraborg Institute,
&
Skövde, Philippe
Sweden. Froguel
14
Malmska
J.12Hum. Evol. R.46,
Marju Orho-Melander (Chair) 7 8 9 8
Supporting Online Material

and Richa Saxena1–6 (Team Leaders), Peter Almgren,7 Kristin (1981). Statistical analysis: Benjamin F. Voight1–3,5 and Paul I. W.
24. G.N.Brice,
Municipal
M. Young,
B. Bullman,
Health
L. MacLatchy,
Center
J. Campbell,
11 and Hospital,
B. Castle,
Jakobstad,
163
Cetnarsyj, (2004).
Finland.
C. Hospital Medical School, Jenner Wing, Cranmer Terrace, House, Guy’s
et al., Cour. Forsch-Inst. Senckenb. 243,
10 10 4 7
10. R. H. Crompton 25.
15 Chapman,
D. Gommery, C. Chu,B. N. Coates,
Senu, M. Pickford,T. Cole, E. R. Davidson,
Musiime, 16 London, SW17 0RE, UK. Department of Medical Genetics, UK. 19Clinical
1 13 1
Ardlie, Kristina Bengtsson Boström, Noël P. Burtt, Hong Chen, Type 2 diabetes
8
de Bakkermellitus
1–6
(Teamresults Leaders), from Richathe interaction
Saxena, 1–6 of environmental
Valeriya Folkhälsan factors
Research with a combination
Center, Helsinki, of
2 Finland. genetic
9 Depart- variants, most ofHospital, Hathersage Road, Manchester, M13 Belvoir Park H
Ann. Paléontol. 88, 167 www.sciencemag.org/cgi/content/full/316/5829/1328/DC1
Nature, 2/2007
13 3 1
A. Donaldson, H. Dorkins, F. Douglas, D. Eccles, R. Eeles, St. Mary’s
which115
Jose C. Florez,1–6 Bo Isomaa,14,15 Sekar Kathiresan,1,3,5 Guillaume (2003).7 7
Burtt,1 Hong Chen, 8
for thesement of Clinical Sciences, (2002). Medicine
Community
were hitherto
Lyssenko, unknown.
Peter Almgren, NoëlAP.systematic search Gung-Wei variants was
6
D. G.recently
Evans,7 S.made Goff,6 S.possible
Goodman,by D. Research
5 the development
2 of8 20
Yrb. Phys. Anthropol. 19, Handbook H.ofGregory,
Paleoanthropology Vol.6 2: Western
F. Elmslie, Goudie, 0JH, UK.
Table South
S1 East of Scotland Clinical Genetics Service, Clinical and
Lettre,1,6,9–11 16
Ulf Lindblad, Helen N. Lyon, 1,6,9–11
11.
Olle Melander, 7
J.
high-density T. Stern,
8
Chirn, arrays
Qichengthat 8
Ma, permit
Hemangthe Parikh, 7
59 (1975).
Delwood Richardson,
genotyping 8
of hundreds Unit, 26. C. V.
of J.thousands
Gray, Ward,
University
15 in
Hospital
of Malmö,
polymorphisms.
L. Greenhalgh, 16 Lund We University,
17 testedMalmö,
S. V. Hodgson, 392,935 General Hospital, Crewe Road, Edinburgh, EH4 Health, 30 G
Am. J. Phys. Anthropol. T.Primate Evolution and Human Origins,
1–3,5 17 8 8 7,12 1,2 17 1 8 W. Henke,
Movies S1 to S3
Christopher Newton-Cheh, Peter Nilsson, Marju Orho- 12. S. Darrell
K. S. Thorpe,
single-nucleotide R. H. J.Crompton,
Ricke,polymorphisms
Jeffrey Roix, Leif in Groop,
a French Shaun Purcell,
case–control Sweden.
cohort. Homfray,
MarkersDepartment
6
R.withS. of Clinical
Houlston,
the most 1Sciences,
L. Izatt,Medicine
significant Research
L. difference
Jackson, 18
in 2XU, UK. 9 Department of Medical Genetics, The Princess
genotype 21
Department
Melander,7 Lennart Råstam,16 Elizabeth K. Speliotes,1,3,6,9–11 131,
1–6 1–3,5
frequencies David 384 (2006).cases
Altshuler,
between Mark J.ofDaly type 2(Chair) diabetes and controls were L.I.University
Unit, Tattersall,
Jeffers,
fast-tracked Eds.
Hospital
19
V. (Springer,
Malmö,
Johnson-Roffey,
for testing LundHeidelberg,
University,
12
in F. Kavalier,
a second Germany,
Malmö,
18
cohort. Kirk,2007),
C.Sweden. 19
This 10 Anne 5Hospital,
identified February 2007;Road,
Coxford accepted 18 AprilS016
Southampton, 20075YA, UK. Trust, Box 13
Marja-Riitta Taskinen,12 Tiinamaija Tuomi,12,15 Benjamin 13. F. K. D. Hunt, J. Hum. Evol. 26, 183 (1994). 18
Clinical
F.pp. Chemistry,
7
1011–1030.
Lalloo, C. Langman, University
18
I. Locke, 1
Hospital
M. Longmuir, Malmö,
4
J. Lund
Mackay, 20
10.1126/science.1140799
Clinical Genetics Unit, Birmingham Women’s Hospital, 22
Department
four loci1containing variants that confer type 2 diabetes risk, in addition19to confirming 6 the 19 known association with the TCF7L2
Voight,1–3,5 David Altshuler,1–6 Joel N. Hirschhorn,1,6,9–11 Thomas Broad Institute of Harvard and Massachusetts Institute of University,
A. Magee, Malmö, S. Mansour, Sweden. Department of
Z. Miedzybrodzka, 17
Miller, 11
J.Psychiatry, Metchley Park Road, Edgbaston, Birmingham, B15 2TG, of Chester Ho
E. Hughes,8 Leif Groop7,12 (Chair) gene. These loci include
Technology a non-synonymous
(MIT), Cambridge, MA 02142, USA. polymorphism
2
Center for in the P.zinc
Massachusetts transporter
Morrison, 19
General SLC30A8,
Hospital,
V. Murday, 4 which21
Harvard
J. Paterson, is G.
expressed
Medical School,
Pichert, 18exclusively in Regional Genetic Service, Department of 23Department
UK. 11 Yorkshire
DNA sample QC and diabetes replication genotyping: insulin-producing and two linkage disequilibrium
General Hospital, blocks M.that MAcontain genes
b-cells, 8 6 potentially involved in b-cell
Human Genetic Research, Massachusetts Boston, Porteous, 02115, USA.
N. Rahman, M. Rogers,15 S. Rowe, 22
S. Shanley, 1
Clinical Genetics, Cancer Genetics Building, St. James Road, Headin
Genome-Wide Association Analysis

6
Candace Guiducci1 and Valeriya Lyssenko7 (Team Leaders), development Boston,orMA function USA. 3Department ofand
02114, (IDE–KIF11–HHEX EXT2–ALX4).
Medicine, Mas- These G. Scott,2 L. Side,
associations
A. Saggar, 23
explain 4
a substantial
L. Snadden, M. Steel,2 portion
M. Thomas,of 5
disease applying
UniversityriskHospital,stringent quality-control
Beckett Street, Leeds, LS9 7TF, UK. filters, high-
Anna Berglund,7 Joyce Carlson,18 Lauren Gianniny,1 Rachel sachusettsproof
and constitute General Hospital, for
of principle Boston, MA 02114, USA.
the genome-wide approach Supporting
S. Thomas,
to the
1 Online Material
elucidation of complex genetic traits.
12
Department of Clinical Genetics, Leicester Royal Infirm- Supporting
Hackett,1 Liselotte Hall,18 Johan Holmkvist,7 Esa Laurila,7 Marju 4
Department of Molecular Biology, Massachusetts General www.sciencemag.org/cgi/content/full/1142358/DC1
1
Clinical Genetics Service, Royal Marsden Hospital, Downs quality LE1
ary, Leicester, genotypes
5WW, UK. 13 for 386,731
Department commonwww.sciencema
of Clinical SNPs
Materials and Methods were obtained (4). To extend the set of putative
Identifies Loci for Type 2 Diabetes
2
7 7
Orho-Melander, Marketa Sjögren, Maria Sterner, Aarti 18
Hospital, Boston, MA 02114, USA. Department of Medicine,5 Road, Sutton, Surrey, SM2 5PT, UK. Department of Genetics, St Michael’s Hospital, Southwell Street, Bristol, Materials and
14
Surti1 Margareta Svensson,7 Malin Svensson,7 Ryan Tewhey, The1rapidly increasing prevalence of type 2MA diabetes mellitus Depart- is Figs.Clinical
USA. 6(T2DM) S1 andGenetics,
genotypes S2for 392,935 Ninewells Hospital, Dundee,
single-nucleotide DD1 9SY,
polymorphisms BS2 8EG, UK.
(SNPs)
causal in Institute of Human Genetics, International
allelesParkway,
testedNewcastle
for association, Figs. S1 to S8
upon Tyne, NE1 weTables devel-
Harvard Medical School, Boston, 02115,
thought to be due Tables
UK. S13Medical
to S6 cases and Community Genetics, Kennedy-Galton Table 1).Centre for toLife, Central S1 to S
Noël P. Burtt1 (Chair) ment of toGenetics,
environmental Harvardfactors,Medical such as increased
School, Boston,availabil-
MA 1,363 T2DM and controls (Supplementary In order
Whole genome scan genotyping: Brendan Blumenstiel
Centre,
References Level 8V, Northwick Park and St. Mark’s NHS Trust, 3BZ, oped
UK. 15
284,968
Institute of additional
Medical multimarker
Genetics, University (haplo-
Downloaded from www.sciencemag.org on February 8, 2010

7 21
References
and Triglyceride Levels

ity of
1 food and decreased opportunity and
02115, USA. Department of Clinical Sciences, Diabetes and motivation for physical enrich for risk alleles , the diabetic subjects studied in stage 1 were
Watford Rd, Harrow, HA1 one UK. 4 Institute
3UJ, affected of Medical relativeHospital ageofat Wales,
and type) Heath Park, Cardiff, CF14 4XW, UK.
(Team Leader), Melissa Parkin,1 Matthew DeFelice,1 Candace activity, acting on genetically
Endocrinology Research susceptible individuals.
Unit, University The heritability
Hospital Malmö, selected
9 March
Genetics, 2007; to accepted
have
Yorkhill
at least
NHS 20Trust,
April Dalnair
2007 Street, first degree
Glasgow, G3 16
Department
tests based on these SNP genotypes9 March
of Clinical Genetics, Alder Hey Children’s
(5, 6).2007
1 1 1 1of T2DM is one of the best established among 8 common diseases and, onset under 45 yr (excluding patients with maturity onset diabetes in
Guiducci, Ryan Tewhey, Rachel Barry, Wendy Brodeur, Noël Lund University, Malmö, Sweden. Diabetes and Metabolism Published
8SJ, UK. online
5 26
Clinical April 2007;
Genetics Department, Royal Devon and The
Hospital, 671,699
Eaton Road, allelic
Liverpool tests
L12 capture
2AP, UK. 17
(correlation
Clinical co-
Published onli
P. Burtt,1 Jody Camarata,1 Nancy Chia,1 Mary Fava,1consequently, John Disease genetic risk factors
Area, Novartis for T2DM
Institutes have been
for BioMedical the subject
Research, 100 of 10.1126/science.1142358
the young). Furthermore, in order to decrease phenotypic hetero- 2
Exeter Hospital (Heavitree), Gladstone Road, Exeter, EX1 Genetics Centre, rArgyll
efficient ≥ House,
0.8) Foresterhill,
78% of Aberdeen, SNPs
common 10.1126/scien
in
Diabetes Genetics Initiativediabetesof Broad young,Institute of Harvard and MIT, Lund University,
1 1 1 1 intense research1. Although the genetic causes of many 9 monogenic geneity and 6 to enrich for citing
variants determining insulin resistance and
Gibbons, Bob Handsaker, Claire Healy, Kieu Nguyen, Casey Technology Square, Cambridge, MA 02139, USA. Depart- Include
2ED, this UK.information
Department whenof Clinicalthis paper. St. George’s
Genetics, AB25 2ZR, UK. 18 Clinical Genetics, 7th Floor New Guy’s Include this in
forms of diabetes (maturity onset in the neonatal mito- b-cell dysfunction through mechanisms other than severe obesity, HapMapwe CEU (3).
and Novartis Institutes for BioMedical Research*† initially
chondrial and other syndromic types of diabetes mellitus) have been
elucidated, few variants leading to common T2DM have been clearly
studied diabetic patients with a body mass index (BMI)
,30 kg m22. Control subjects were selected to have fasting blood Each SNP and haplotype test was assessed
A Genome-Wide Association Study of

identified and individually confer only a small risk (odds ratio < 1.1– glucose ,5.7 mmol l21 in DESIR, a large prospective cohort for for theassociation to T2D and each of 18 traits ria (8).
withWe
Replication of Genome-Wide Here, we describe how integration of
22 data
1
New strategies for prevention and treatment of type 2 diabetes (T2D) require improved insight into the software package PLINK (http://pngu.mgh.
1.25) of developing T2DM . Linkage studies have reported many study of insulin resistance in French subjects . ciation with
T2DM-linked chromosomal regions and have identified putative, cau- fromGenotypes the WTCCC scan and our own replication
for each study subject were obtained using two plat- the log-odd
Association Signals in UK Samples

disease
sative geneticetiology.
4,patients
5) and ACDC
variants in CAPN10
with(also T2Dcalled
We analyzed
andADIPOQ)
(ref. 2), ENPP1
1467 matched 6
386,731 common
(ref. 3),
. In parallel,controls,
HNF4A (refs
candidate-gene
single-nucleotide
eachDiabetes
forms:with
studies Type 2 Diabetes in Finns Detects
characterized
Illumina
SNPs chosen using
similar
Genetics
polymorphisms
Infinium
information
forInitiative
ameasures
Human1generated
gene-centred
(DGI)
(SNPs) inwhich
BeadArrays,
of design;
glucose
(6) and
1464 harvard.edu/purcell/plink/). For T2D, a weighted
by the assay 109,365
and Human Hap300
the meta-analysis was used to combine results
(8). We ob
versusfor 31.6
Multiple Susceptibility Variants

studies have reported many T2DM-associated loci, with coding variants BeadArrays, which assay 317,503 SNPs chosen to tag haplotype
metabolism, lipids, obesity,
in the nuclear receptor PPARG (P12A) and the potassium channeland
7 blood pressure. With collaborators
Finland–United (FUSION
States and
Investigation
blocks identified by the Phase I HapMap . Of the 409,927 markers
WTCCC/UKT2D),
of 23 NIDDM the population-based and family-based subsam-
P values <
Reveals Risk Loci for Type 2 Diabetes we identified

KCNJ11 (E23K)8 being
and CDKN2B,
replicated. The strongest
andamong
in an
confirmed
known
the very few
intron (oddsofratio
three
that have
IGF2BP2,
was recently mapped to the transcription factor TCF7L2 and has been Laura
locibeen
< 1.7) T2DM
associated
convincingly
andassociation
with
an intronadditional
9
T2D—in
CDKAL1—and
oftypes
a
J. susceptibility
were
Scott,
obtained
noncoding
Genetics (FUSION) (7) has identified several
that passed quality
1
control
Karenreplicated
for an
L.each
region
(Supplementary
Mohlke,
averagefor
variants
near
Lori
associations
of 299.2%
T2D.
CDKN2A
Tables 2 and
L.a Bonnycastle,
(Human1)
3),
nearand 99.4%
ples
geno-
3
linear
(4).
Cristen
For quantitative
J. Willer,regression
or logistic 1
traits,
Yun Li,1 with or without
multivariable
against
with aco-
the
large
1was performed 3(4). Association results w consistent
William L. Duren, Michael
1study,We
R. Erdos, Heather
of 3490,032 M. Stringham, Peter S. Chines,
(Hap300) of markers for subject with reproducibility of
HHEX and in SLC30A8 found by a10–20 recent whole-genome In association
the (both WTCCC study. identified
analysis and variates
Eleftheria Zeggini, * Michael N. Weedon, confirmed
1,2 3,4
* Cecilia M. Lindgren, of*aTimothy1,2 populations
M. Frayling, 3,4
*glucokinase
consistently replicated in multiple . .99.9% platforms). Forty-three subjects were removed from SNPs that
association SNP in an intron of Anne
autosomal U. Jackson,
SNPs
regulatory Ludmila
protein Prokunina-Olsson,
1 16,179 samples yielded 3
in (GCKR) with serum Chia-Jen for Ding,
each 1
Amy haplotype
SNP, J. Swift,3 Narisutest, andNarisu, 3
phenotype are
Katherine S. Elliott, Hana Lango, Nicholas
2 3,4
J. Timpson, 2,5
John R. B. Perry, 3,4
analysis because of evidence of intercontinental admixture (Sup-
also sugges
Subjects and study design
triglycerides. The discovery of associated variants in Tianle
459,448
unsuspected
plementary Hu,
SNPs 1Fig.
Randall
that
genes passed
3) and and Pruim, 4
initial
outside
an additional Rui Xiao,
quality
coding
four because Xiao-Yi
regions Li,
1controltheir genotype- 1
Karen
available N. Conneely, 1
Nancy L. Riebow,
(www.broad.mit.edu/diabetes/).
3
Nigel W. Rayner,1,2 Rachel M. Freathy,3,4The Jeffrey
recentC. Barrett,of2 high-density
availability Beverley Shields, genotyping 4
arrays, which com- (5).Andrew determined
Wetoconsidered G. gender
Sprau, 3
Maurine
disagreed
only the with Tong,
393,453 3
clinical Peggy
records.
autosomal P. InWhite,
total,1T2DMKurtInN.genome-wide
Hetrick,5 Michael W. Barnhart, 5 trols by birt
Andrew P. Morris, Sian Ellard, Christopher
2 4,6 illustrates
J. power
Groves,
the 1 ability
Lorna W.
of genome-wide
Harries, 4 association studies provide potentially important clues to analysis involving hundreds
successful;
Craig W. Bark, minor5 Janet allele L.frequency
Goldstein, 5
Lee Watkins,
ex- of Fang Xiang,1 Jouko Saramies,6
bine the of association studies with the systematic nature of a association was tested for 100,764 (Human1) and 309,1635(Hap300)
SNPs with (MAF)
Jonathan L. Marchini,7 Katharine R. Owen, the
1
Beatrice
pathogenesis Knight, Lon additional
R. Cardon,
4of common diseases. 2
Mark Walker,8loci ceeding Thomas A. in Buchanan, 7
Richard M. Watanabe, 8,9
Timo
of thousands of statistical tests, modest levels
T. Valle,10 Leena Kinnunen,10,11
genomic of co
genome-wide search, led us to undertake a two-stage, genome-wide SNPs representing 392,935 unique loci (Fig. 1). Because unequal
1% both cases and controls and no
Graham A. Hitman,9 Andrew D. Morris,10 (Supplementary Alex S. F. Doney, Fig. 1). The
10
In theWellcome Trust Case we Control
association study to identify T2DM susceptibility male/female ratios in our cases and controls, we analysed the 12,666 Analysi
obtained extreme Gonçalo R. Abecasis,
departure
1
Elizabeth
fromseparately
Hardy-Weinbergfor W. eachPugh,
5
equi- Kimberly bias F. Doheny, imposed 5
Richard
on the N. null
Bergman, 9
distribution can over-
T
first stage of this study, sex-chromosome SNPs gender.
Consortium (WTCCC),† Mark I. McCarthy,1,2 1 ‡§ Andrew T. Hattersley
3,4 allowed us
ype 2 diabetes, obesity, ‡ and cardiovascular toUniversity,
Departments of Human Genetics, 2Medicine and 3Pediatrics, Faculty of Medicine, McGilllibrium
Jaakko(P
purifying Tuomilehto,
< selection,
Montreal
−4
10H3H and
in Canada.
1P3, cases 4or Francis
10,11,12 has been
controls)
McGill
S. (8).
Collins,
made
University and
3
Thispos-
Genome
* Michael whelm
Quebec Innovation
Boehnke 1
* number of true results. We
a small used
variation in
riskH3Afactors
Centre, Montreal are
1A4, Canada. 5
caused
CNRS by a
8090-Institute ofcombination sible
Biology, Pasteur Institute, by
T2D-specific
Lille 59019 genomic
Cedex, advances
data set shows
France. 6
Endocrinology no such
and as
evidence of
Diabetology, the human
sub- 9Ontario
University Hospital, three
Poitiers strategies to search for evidence of sys-
portion, w
86021 Cedex, France. 7INSERM U780-IFR69, Villejuif 94807, France. 8Endocrinology-Diabetology
The molecular mechanisms involved in the Institute
development of typesusceptibility,
2 diabetes areenvironment,
poorly genome Identifying the genetic
Unit, Corbeil-Essonnes
sequence, SNP variants
Hospital, that increase
Corbeil-Essonnes the risk
91100, France. of type 2 diabetes (T2D) in humans has
of genetic be-Research from11and HapMap databases, tematic bias from unrecognized population (8,struc-
Science, 6/2007
for Cancer Research, Toronto M5G 1L7, Canada. 10
Montreal Diabetes stantial confounding
Center, Montreal H2L 4M1, Canada. population
Molecular Nutritionsubstruc-
Unit and the Department of 13) that
understood. Starting from genome-wide genotype data for 1924 diabetic
the Centrecases and 2938 beenH3Ca 3J7,
formidable challenge.
Canada. 12Polypeptide Adopting
Laboratoryaand genome-wide association strategy, we genotyped 1161
havior,
and and
Nutrition, University
Cell Biology, chance.
Montreal H3A 2B2,Whole-genome
of Montreal and
Canada. 13Department of association
Hospitalier de l’Université
and
turegenotyping
de Montréal, Montreal
Epidemiology & Public Health, and genotyping
Finnish
Imperial T2D cases
College, Starrays (3).
biases
and
Mary’s Campus,
Hormone
(8). 14 ture,
Department of Anatomy
the
1174 Finnish normal glucose-tolerant (NGT) controls with >315,000
Norfolk Place, London W2 1PG, UK. Section of analytical approach, and genotyping
equilibrium
population controls generated by the Wellcome Trust
studies CaseImperial
(WGAS) Control Consortium,
offer a new weandset
approach outtoto genedetect We studied
London W12 1464 patients with fromT2D and genotypesartifacts (7, 8).additional
First, we>2examined
To distinguish true million the distribu-
Genomic Medicine, College London W12 0NN, Hammersmith Hospital, Du Cane Road, 0HS, UK. associations those Centre d’E
*These authors contributed equally to this work. single-nucleotide polymorphisms (SNPs) and imputed for an
replicated diabetes association signals through analysis of 3757 additional
discovery unbiased with regard to presumed 1467cases and 5346 controls reflecting fluctuations
controlsSNPs. from under the null or residual (Utah resid
autosomal WeFinland
carried out andassociation
Sweden,analysis each with tion of P-values
these in thegenetic
SNPs to identify population-based
variants sam-
and by integration of our findings with equivalent data from other international consortia. We errors arising from aberrant allele calling, we first
881
functions or locations of causal variants. ©2007 This characterized
Nature PublishingthatGroup
predispose for to18T2D,clinical
compared traits:ouranthropomet-
T2D association results ple, observing a close
with the results of two match to that expected
similar studies,
detected diabetes susceptibility loci in and around the genes CDKAL1, CDKN2A/CDKN2B, and submitted
and putative
genotyped signals
80 SNPs from
in anthe WTCCC
additional study
1215 Finnish T2D cases and 1258 Finnish NGT controls.
approach is based on Fisher’s theory for additive ric measures, glucose tolerance and insulin se- for a null distribution (genomic inflation1Department
IGF2BP2 and confirmed the recently described associations at HHEX/IDE and SLC30A8. Our findings to We
factor
additional
identify quality control,variants
T2D-associated including in andcluster-
an intergenic region of chromosome 11p12, contribute Genetics, Uni
effects at common alleles (1);
provide insight into the genetic architecture of type 2 diabetes, emphasizing the contribution of human heterozy- cretion, lipids and apolipoproteins, blood l = 1.05 for T2D). Second, we calculated
plot to visualization
the identification and of validation
T2D-associatedgenotyping variants on near the GC genes IGF2BP2 and CDKAL1 and the USA. 2Depar
ARTICLES
A genome-wide association study
identifies novel risk loci for type 2 diabetes
Robert Sladek1,2,4, Ghislain Rocheleau1*, Johan Rung4*, Christian Dina5*, Lishuang Shen1, David Serre1,
Philippe Boutin5, Daniel Vincent4, Alexandre Belisle4, Samy Hadjadj6, Beverley Balkau7, Barbara Heude7,
Guillaume Charpentier8, Thomas J. Hudson4,9, Alexandre Montpetit4, Alexey V. Pshezhetsky10, Marc Prentki10,11,
Barry I. Posner2,12, David J. Balding13, David Meyre5, Constantin Polychronakos1,3 & Philippe Froguel5,14
Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of
which were hitherto unknown. A systematic search for these variants was recently made possible by the development of
high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935
single-nucleotide polymorphisms in a French case–control cohort. Markers with the most significant difference in genotype
frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified
four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2
gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in
insulin-producing b-cells, and two linkage disequilibrium blocks that contain genes potentially involved in b-cell
development or function (IDE–KIF11–HHEX and EXT2–ALX4). These associations explain a substantial portion of disease risk
and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits.
How many SNPs (p-value?)

The rapidly increasing prevalence of type 2 diabetes mellitus (T2DM) is Sladek, 2007
genotypes for 392,935 single-nucleotide polymorphisms (SNPs) in
thought to be due to environmental factors, such as increased availabil- 1,363 T2DM cases and controls (Supplementary Table 1). In order to
ity of food and decreased opportunity and motivation for physical enrich for risk alleles21, the diabetic subjects studied in stage 1 were
European-based; N ~ 1000
activity, acting on genetically susceptible individuals. The heritability
of T2DM is one of the best established among common diseases and,
selected to have at least one affected first degree relative and age at
onset under 45 yr (excluding patients with maturity onset diabetes in
consequently, genetic risk factors for T2DM have been the subject of the young). Furthermore, in order to decrease phenotypic hetero-
cases: high fasting blood glucose/non-obese
intense research1. Although the genetic causes of many monogenic

forms of diabetes (maturity onset diabetes in the young, neonatal mito-
geneity and to enrich for variants determining insulin resistance and
b-cell dysfunction through mechanisms other than severe obesity, we
controls: non-obese
chondrial and other syndromic types of diabetes mellitus) have been initially studied diabetic patients with a body mass index (BMI)
elucidated, few variants leading to common T2DM have been clearly ,30 kg m22. Control subjects were selected to have fasting blood
identified and individually confer only a small risk (odds ratio < 1.1– glucose ,5.7 mmol l21 in DESIR, a large prospective cohort for the
1.25) of developing T2DM1. Linkage studies have reported many study of insulin resistance in French subjects22.
T2DM-linked chromosomal regions and have identified putative, cau- Genotypes for each study subject were obtained using two plat-
sative genetic variants in CAPN10 (ref. 2), ENPP1 (ref. 3), HNF4A (refs
Human Hap300 chip, showing no T2DM association in stage 1 BMI on the association between marker and disease, as it is asymp-
(P . 0.01) and separated by at least 100 kb. Using the first principal totically equivalent to the Armitage trend test used to detect asso-
component as a covariate for ancestry differences between cases and ciation in stages 1 and 2. None of the associations (Supplementary
controls, we tested for association between rs932206 and disease Table 7) was substantially changed by considering the effects of these
status. Our result suggests that this apparent association is largely covariates.
5 5 5 5 5
3 3 3 3 3
1 1 1 1 1
1 2 3 4 5
5 5 5 5 15
10
3 3 3 3
5
1 1 1 1
6 7 8 9 10
5 5 5 5 5
3 3 3 3 3
1 1 1 1 1
11 12 13 14 15
5 5 5 5 5
3 3 3 3 3
1 1 1 1 1
16 17 18 19 20
5 5 5
3 3 3
1 1 1
21 22 X
Figure 1 | Graphical summary of stage 1 association results. T2DM 2log10[pMAX], the P-value obtained by the MAX statistic, for each SNP
association was determined for SNPs on the Human1 and Hap300 chips. The (Note the different scale on the y axis of the chromosome 10 plot.). SNPs that
x axis represents the chromosome position from pter; the y axis shows passed the cutoff for a fast-tracked second stage are highlighted in red.
882
©2007 Nature Publishing Group Sladek, 2007
1 1 1
3 4 5
5 5 15
10
3 3
5
1 1
NATURE | Vol 445 | 22 February 2007 ARTICLES
8 9 10
Table 15| Confirmed association results 5 5

SNP Chromosome Position Risk Major MAF MAF Odds ratio Odds ratio PAR ls Stage 2 Stage 2 pMAX Stage 1 Stage 1 pMAX Nearest
3 (nucleotides) allele 3
allele (case) (ctrl) (het) 3
(hom) pMAX (perm) pMAX (perm) gene
rs7903146 10 114,748,339 T C 0.406 0.293 1.65 6 0.19 2.77 6 0.50 0.28 1.0546 1.5 3 10234 ,1.0 3 1027 3.2 3 10217 ,3.3 3 10210 TCF7L2
rs13266634 1 8 118,253,964 C 1 C 0.254 0.301 1.18 6 0.25 1
1.53 6 0.31 0.24 1.0089 6.1 3 1028 5.0 3 1027 2.1 3 1025 1.8 3 1025 SLC30A8
26
rs1111875 10 94,452,862 G G 0.358 0.402 1.19 6 0.19 1.44 6 0.24 0.19 1.0069 3.0 3 10 7.4 3 1026 9.1 3 10 26
7.3 3 1026 HHEX
rs7923837 13 10 94,471,897 G G 14
0.335 0.377 1.22 6 0.21 1.45 6 0.25 15 0.20 1.0065 7.5 3 1026 2.2 3 1025 3.4 3 1026 2.5 3 1026 HHEX
rs7480010 11 42,203,294 G A 0.336 0.301 1.14 6 0.13 1.40 6 0.25 0.08 1.0041 1.1 3 1024 2.9 3 1024 1.5 3 1025 1.2 3 1025 LOC387761
rs3740878 11 44,214,378 A A 0.240 0.272 1.26 6 0.29 1.46 6 0.33 0.24 1.0046 1.2 3 1024 2.8 3 1024 1.8 3 1025 1.3 3 1025 EXT2
rs11037909 11 44,212,190 T T 0.240 0.271 1.27 6 0.30 1.47 6 0.33 0.25 1.0045 1.8 3 1024 4.5 3 1024 1.8 3 1025 1.3 3 1025 EXT2
rs1113132 5 11 44,209,979 C 5
C 0.237 0.267 1.15 6 0.27 5
1.36 6 0.31 0.19 1.0044 3.3 3 10 24
8.1 3 1024 3.7 3 10 25
2.9 3 1025 EXT2
Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allele
3 3 3
frequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated using
stage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higher
1 1 1
frequency in controls; pMAX, P-value of the MAX statistic from the x2 distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX and
pMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls.
18 19 20 Sladek, 2007
Identification of four novel T2DM loci The most significant of these corresponds to rs13266634, a non-
Confirmed 8 SNPs with N ~ 1000
Our fast-track stage 2 genotyping confirmed the reported association synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkage
5
for rs7903146 How would you interpret the p-
(TCF7L2) on chromosome 10, and in addition iden-
tified significant associations for seven SNPs representing four new
disequilibrium block on chromosome 8, containing only the 39 end
of this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressed
T2DM values?
3 loci (Table 1). In all cases, the strongest association for the solely in the secretory vesicles of b-cells and is thus implicated in the
Odds ratios?
MAX statistic (see Methods) was obtained with the additive model.
1
final stages of insulin biosynthesis, which involve co-crystallization
a
X b
4 4
–log10[P]
–log10[P]
2 2
DM 0 2log10[pMAX], the P-value
SLC30A8
obtained by the MAX statistic,
IDE
for each SNP
KIF11
0
HHEX
Scaling up discovery by combining populations:
meta-analyses
g the Diabetes Genetics data from the WTCCC, DGI and FUSION scans)10 (Supplementary
nvestigation of NIDDM Note). We found strong evidence that the minor G allele of
nd (iv) the Framingham
Meta-analysis of SNP rs10830963:
rs10830963 was associated with increased risk of T2D (odds ratio ¼
omponent studies (n ¼ Combining
1.09 findings
(1.05–1.12), P ¼ 3.3 " 10#7;from multiple
Fig. 2 and cohorts
Supplementary Table 6
ry Table 1 online. online). The possibility that the fasting glucose association might
aring, the four consortia
n 10 and 20 SNPs promi- Study ID OR (95% CI) Weight
their individual, interim, (%)
mentary Table 2 online). DGI 1.12 (0.96, 1.30) 4.61
oci with consistent effects FUSION 1.20 (1.03, 1.39) 4.89
dies. Two of these repre- WTCCC 1.07 (0.95, 1.20) 8.03
deCODE 1.14 (1.03, 1.27) 9.58
6PC2 and GCK. In addi-
KORA 1.00 (0.84, 1.19) 3.53
nerated evidence for an Rotterdam 1.17 (1.04, 1.30) 8.75
NPs around the MTNR1B CCC 1.07 (0.88, 1.31) 2.69
rs1387153, P ¼ 2.2 " ADDITION/ELY 1.16 (1.02, 1.33) 6.04
10#11; DFS: rs10830963, Norfolk 1.00 (0.90, 1.10) 10.56
UKT2DGC 1.03 (0.96, 1.10) 23.18
5.8 " 10#4, for the most
OxGN/58BC 0.91 (0.75, 1.10) 2.85
ch analysis). The associa- FUSION Stage 2 1.15 (1.02, 1.30) 7.41
d on formal meta-analysis METSIM 1.16 (1.03, 1.30) 7.90
r exclusion of individuals 2
Overall (I = 26.6%, P = 0.176) 1.09 (1.05, 1.12) 100.00
¼ 1.1 " 10#57; rs4607517 Meta-analysis P value = 3.3 × 10
–7
NR1B), P ¼ 3.2 " 10#50; .722 1 1.39

pplementary Table 3 and
ent efforts to harmonize Figure 2 Association of rs10830963 with type 2 diabetes (T2D) in 13 case-
(including the additional control studies.
Propenko, 2009
VOLUME 41 [ NUMBER 1 [ JANUARY 2009 NATURE GENETICS

ARTICLES
Meta-analyses for T2D:
N>40K and 90K identifies >30 loci among 2,400,000 SNPs
Twelve type 2 diabetes susceptibility loci identified

through large-scale association analysis
By combining genome-wide association data from 8,130 individuals with type 2 diabetes (T2D) and 38,987 controls of
European descent and following up previously unidentified meta-analysis signals in a further 34,412 cases and 59,925 controls,
we identified 12 new T2D association signals with combined P < 5 × 10−8. These include a second independent signal at the
KCNQ1 locus; the first report, to our knowledge, of an X-chromosomal association (near DUSP9); and a further instance of
overlap between loci implicated in monogenic and multifactorial forms of diabetes (at HNF1A). The identified loci affect both
beta-cell function and insulin action, and, overall, T2D association signals show evidence of enrichment for genes involved in
cell cycle regulation. We also show that a high proportion of T2D susceptibility loci harbor independent association signals
influencing apparently unrelated complex traits.
Type 2 diabetes (T2D) is characterized by insulin resistance and Voight, 2010

the inverse-variance method (Online Methods, Fig. 1, Supplementary
deficient beta-cell function1. The escalating prevalence of T2D and Tables 1 and 2 and Supplementary Note). We observed only modest
the limitations of currently available preventative and therapeutic genomic control inflation (Lgc = 1.07), suggesting that the observed
options highlight the need for a more complete understanding of results were not due to population stratification. After removing SNPs
T2D pathogenesis. To date, approximately 25 genome-wide significant within established T2D loci (Supplementary Table 3), the result-
common variant associations with T2D have been described, mostly ing quantile-quantile plot was consistent with a modest excess of
through genome-wide association (GWA) analyses2–13. The identities disease associations of relatively small effect (Supplementary Note).
of the variants and genes mediating the susceptibility effects at most Weak evidence for association at HLA variants strongly associated
of these signals have yet to be established, and the known variants with autoimmune forms of diabetes (Supplementary Table 3 and
account for less than 10% of the overall estimated genetic contribution Supplementary Note) suggested some case admixture involving
to T2D predisposition. Although some of the unexplained heritability subjects with type 1 diabetes or latent autoimmune diabetes of adult-
will reflect variants poorly captured by existing GWA platforms, we hood; however, failure to detect T2D associations at other non-HLA
reasoned that an expanded meta-analysis of existing GWA data would type 1 diabetes susceptibility loci (for example, INS, PTPN22 and
Meta-analyses for T2D:
N>40K and 90K identifies >30 loci among 2,400,000 SNPs
ARTICLES
Locus established previously TCF7L2 Unconditional analysis

50
Locus identified by current study
Locus not confirmed by current study
40
30 HHEX/IDE
–log10(P)
KCNQ1 (2 signals*: )
CDC123/CAMK1D
CHCHD9 KCNJ11
CDKAL1
CDKN2A/2B CENTD2
20 MTNR1B
SLC30A8
ADAMTS9 IGF2BP2 HMGA2 ZFAND6
TP53INP1
BCL11A PPAR TSPAN8/LGR5 PRC1
WFS1 JAZF1
10 IRS1 ZBED3 HNF1A FTO
THADA HNF1B DUSP9
KLF14
NOTCH2
–5
Suggestive statistical association (P < 1 10 ) Conditional analysis
–4
Association in identified or established region (P < 1 10 )
–log10(P)
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
Chromosome
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-
analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those
taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and
should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously
established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered
conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas
secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).
13 autosomal loci exceeded the threshold for genome-wide significance (r2 < 0.05), and conditional analyses (see below) establish these SNPs
(P ranging from 2.8 × 10−8 to 1.4 × 10−22) with allele-specific odds as independent (Fig. 2 and Supplementary Table 4). Further analysis
.609 SLC30A8 Region CDKN2A/B Region
100 10 rs3802177 stage 1 rs3802177 100 rs10965250 stage 1 rs10965250 100
● r^2: 0.8 − 1.0 10 ● r^2: 0.8 − 1.0 ●
●
● r^2: 0.6 − 0.8 ● r^2: 0.6 − 0.8
recombination rate (cM/Mb)

80 8 ● r^2: 0.4 − 0.6 ● 80 ● r^2: 0.4 − 0.6 ● 80
● r^2: 0.2 − 0.4
●
● 8 ● r^2: 0.2 − 0.4
− log10(P−value)
− log10(P−value)
● r^2: 0.0 − 0.2 ● r^2: 0.0 − 0.2 ●
60 6 ● r^2 missing 60 ● r^2 missing 60
6
●●
●
●●●
●
40 4 ●
● 40 ●
●
●
● ● 40
●
●
●
●●
●
●
4 ● ●●
●●●●
●
● ●
● ● ● ●
● ●
●●● ● ● ●●
●
● ● ● ● ● ●●
●
● ● ●
● ● ●●
● ● ● ● ● ● ●
●● ●●● ● ● ●
● ●●● ●●
● ●● ● ●● ● ●● ●● ● ●
20 2 20 20
●
● ●● ●●
●● ●●
● ●
● ● ● ●● ●●●
●
●●
●
●
●● ●
●
●●
● ● ●●
●●
● ●● ● ●
2
● ●
● ●● ●
● ● ●●●
● ●● ●● ●●
● ●
● ● ● ●●
●
● ●
●
●●
● ● ● ● ● ● ● ●● ●●
●● ●
●● ●
●
●●●● ●●
● ● ●● ●●● ● ●● ● ● ● ●●●● ● ● ● ● ● ●● ● ●
● ● ●●
●● ●●
●
● ● ●●●●●● ●
●●
● ●●
●
●●● ●
● ●● ●
●●
●●
●●
● ●
● ●● ●
●●
●●
●
● ● ● ●●
●● ●
●
●
●●●
●
● ●●
● ● ●●●
●
●●● ●●●
●
●
●
● ●● ●●●
●
● ●
●●
● ●●●
●
●●● ●
●●
●
●
●
●
●●●
●
●
● ●● ●
●
●
●
●
●
●●
●
● ● ●
●● ● ● ●
●
● ● ●●
●
●● ●
● ●● ●●●
● ●
●
● ● ● ● ●●● ● ●
●
●●
●
●
●●● ●
● ● ●
●
●●● ●
●
● ● ●
●●
●
●
●●●
●
●●●●
●
● ●
●●
●
●
●
●
●
● ● ● ●
●●● ● ●●
●
●●● ● ● ● ●
●● ● ●● ● ●●●
●
●
●
●
● ● ●● ●●
●
● ●
●
●●
●
●
● ●● ● ● ●
●●
●
●●●●
●
● ●●
●
●
●● ●● ●● ● ● ●●●● ●● ●●●●●
●
● ●●● ● ● ● ● ●●
●
● ● ● ●●
●
● ●
●●
●
●
●
●●● ● ●
● ●
●
● ● ●●● ●
● ●● ● ● ●●
● ● ●●●●●● ●
●●
●●
●●● ●
●
●
● ●
● ● ●
●
● ●●
●
●
●● ●●
●
●●
●●
●
● ● ●●●● ●
●
●
● ●
● ● ●
●
●●●●
●●
●
●●● ●
●●
● ●●●
●●● ●
● ●●
●
●
● ● ● ●●●
● ●●
●
●●●
● ● ● ● ●●
●●
● ●● ●
●
●●
● ●● ●
●
● ●
●● ●● ●
●
● ●
●
●●
●
●●●
●
● ● ●
● ●
●●
●●
●
●● ●
●● ●●
●● ●●●
●
● ●
●
●●
●●
●
●●●
●
●
●
●●●
● ●●●
● ●
●●●● ●●●●
●● ● ●●● ●
●●●●
●
●●
●● ●●
●●
●●●● ● ●
●●
●
●
●
●●
●●●
●●
●
●
● ●
●
● ●
●●
●●
● ●●● ● ●●●
● ●● ● ●●●● ●●●●●● ●
●
●●
●
●
●
●●
●●
●
●●●
●●● ●●●● ●●
● ●
●●
● ●
● ●● ● ●● ●●●
●
●
●● ●
●●
● ●
●●
●●● ● ● ● ●●
●● ●
●
●●
●●
●
● ●
●●
●
●●●●● ●●
● ●●
●
●
●
●
●
●
●
●● ●●●
●
●●●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●●
●●
●●
●●
●● ● ●
●
●
●
●●
●
●
●
●
● ● ●
●●●
●
● ●●●
●●
●
●●●
●
● ●
●
●●●●● ●
●●
● ● ●●
●●
●●
●
● ●●●● ● ●
●●
●●
●●
●●●
●●●●
●
●
●
●
●
●
●●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●
●
●
●
● ●●●●
●●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
● ●
●●●
●
●
●
●
●●●
●
●
●
●
●●●
●●
●
●
●
●
●
● ●
●
●●●●
●
●●●
●●
●●
● ●
●
●
●
●
●●
●●●
●●
●●●
●●
●
●●
●
●
●●
●●●
●
●●●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●● ● ●●●
● ●
●●
●●●● ●
●●●
●●
●
●●●
● ●●●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●●
●●
●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●●
●●●
● ●
●
●
●
● ●●
●
● ●
●
●
●
●
●●
●●●●●● ●
●●
●
●
●
●●●
●
● ●
●
●●●●
●●
●
●
●●
●●
●
●● ●● ●
●
● ●●●
●●●●
●●●
●
●
●
●
●
●
●
●
●
●●
● ●●
●●
● ● ●
●●●
● ●
●
●●
●
●
●●●
●●
● ●●●
● ●
●
●●
●
●
●●●●
●● ●
●
●●●
●
●
●
●
●●
●
●●
● ●●●●●● ●●●
●●
●
●●●
●
●
●●
● ●●● ●● ● ● ●●
●● ●
●●
● ● ●● ●●●
● ●●
● ● ● ●
●
● ●●● ● ● ● ● ● ●●
● ●●●
● ● ●● ● ● ●
● ●
●●●●●●●
● ●●
● ●
● ● ●●
● ●
●
● ● ● ●●●
● ●
●
●●●
●●
●
● ●
● ●
●●●
●●
●●
●●●
●●● ● ●● ●
● ● ●
● ●● ●● ●●●
●
● ● ●●● ●● ● ● ● ●●●●
●● ●●●● ●
● ●●● ● ●●●● ● ●● ● ●●
●● ● ● ●● ●● ●
● ●
●●● ●●
●
●
●
●
●●
●●●
●●●●
●●
●
●●
●●
● ●●●
● ● ●
●
●
●
●●
●
●
●
●
●
●●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●●
●
●●
●
●
●●
● ●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
● ●●●
●●
●
●●
●
● ●
●
●●
●●
●
●●
●
●
●●●
●
●●●
●
●
●
●●
●●●
●
●
●●
●●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
● ●
●● ● ●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●●●
●●●
●●
● ●●
●●●●●
●
●
●●
●
●
●
●
●●
●
●●●
●
●
●
●●●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●●●
●●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
●
●●
●
●
●●
●
●
●●●
●
●●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
● ●●
●
●
●
●
●
●
●●●
●
●
●●●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
●●
●
●
●●● ● ●
●
●
●
●
●
●
●
●
●
●●●
●● ●
●
●●●●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●
●●
● ●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●●●
●●
●
●
●●
● ●
●●
●● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●●● ●
●
●●●●
● ●
●●
●
●
0 0 0 0
● ● ●
0
● ● ●● ●
● ● ●● ●● ●
● ●●●
● ●●
● ●● ●● ●
● ●
●
● ●
●●● ● ● ● ● ●
● ●
● ●
●● ● ●
● ● ● ● ●
●●● ●● ●
●
● ● ●
● ●● ●●
● ●
●●●●● ● ● ● ●● ●
●
●● ● ● ● ● ●
●●●● ● ●
● ●
●● ●● ● ● ● ●●●● ●●● ●● ●
●● ●
● ●● ●
● ●
●
●●●
● ●
●
●
●●
●
●● ●
●
●
●●
●
●●●
●●
● ●
●●
●
●
●●
●
●●
●
●●●●●●●●
●
● ●
●
●
●●
●●
●●●
●
●●●●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●●
●●●
●●●●●
● ●●●
●
● ●
●
●●●● ●
●●
●
●
●●●
●●
●
●●
●●
●
●
●●
●
●●
●●
●
●●
●
●
●
●●
●
●
●●
● ●
●
●
●●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●● ●
●●
●●
● ●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●● ●●
●
●
● ●
●●●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
● ●
●●●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
● ●●
●
●●
●●●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●●●●
●●
●
●●
●
●●
●●
●
●
●●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
● ●
●●
●●
●
●●
●●
●
● ●● ●
●●
●
●
●●
●●
●●●
● ●●●●
●●●●●
●
●
●●
●
●
●
●
●●●
●●
●●●
●●●●
●●
● ●
●●●●
●●● ●●●
●
●● ●
●
●●●
● ●
●●●
●
●
●●
●
●
●
●●
●
●
●●●●
●
●●●●
● ●
●●
●●●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
● ●
●● ●
●●
●●
● ●
●
●
●●
●●
●
●●
●●●
●
●●
●
●
●
●
●●
●
●
●●●
●●●
●●
●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●●
● ●
●
● ●● ● ● ●●
●
●
●●
●
●●
●● ●● ●
● ●●●●●●
●
●
●●
●
●
●
●
●●●
●●
●●●
●
● ●
●
●
●●●
●
●
●
●
●●●
●●●
●
●
●●
●
●●●●
●
●●
●
●●
●●
●●● ●●
●
●●
● ●●●
●●
●
●●
● ●
●
●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●●
●●
●●
●●●
●
●● ●
●●●●
●
●●●
●●
●●
● ●●
●●●●●
●● ●
●
●●
●
●●
●●
●
●
●●
●
●
●
●
●●
●●
●●
● ●
●●
●●
●●
●
●●
●
●
● ●
●●
●
●●●●●
●
● ●
●●●
●
●
●●
●
SLC30A8 −> <− IFNA21
LOC441376 −> <− IFNW1
<− RAD21 COLEC1 <− IFNB1 <− ELAVL2
UTP23 −> <− TNFRSF11 <− PTPLAD2 DMRTA1 −>
<− EIF3H <− SAMD12 KIAA1797 −> <− IFNA13 <− CDKN2B
PGCP <− TRPS1 <− EXT1 <− MLLT3 <− IFNA7 <− CDKN2A
2 −> MED30 −> <− IFNA4 MTAP −>
98 117 118 119 120 21 22 23 24

Position on chromosome 8 (Mb) Position on chromosome 9 (Mb)
In a gene...
CDC123/CAMK1D Region NotHHEX/IDE
in a gene...
Region
100 10 rs12779790 stage 1 rs12779790 100 rs5015480 stage 1 rs5015480 100
● r^2: 0.8 − 1.0 15 ● r^2: 0.8 − 1.0
~90% of GWAS hits are non-coding!

●
● r^2: 0.6 − 0.8 ● r^2: 0.6 − 0.8
recombination rate (c
●
80 8 ● r^2: 0.4 − 0.6 80 ● r^2: 0.4 − 0.6 ●
●
●
● 80
●
● r^2: 0.2 − 0.4 ● r^2: 0.2 − 0.4 ●●● ●
● ●
● ●●●
log10(P−value)
log10(P−value)
● ●●
● r^2: 0.0 − 0.2 ● r^2: 0.0 − 0.2 ●●
60 6 ● r^2 missing 60 10 ● r^2 missing ●
●
●
●
● 60
●● ●●
●
●
● ●
40 4 40 ●
●● ●
40
●
● ●
●
pporting!Figures!
RESEARCH ARTICLE !
! nome. In total, we identified 3,899,693 distinct
DHS positions along the genome (collectively
Systematic Localization of Common

spanning 42.2%), each of which was detected in
one or more cell or tissue types (median = 5).
Disease- and trait-associated variants are
Disease-Associated Variation in concentrated in regulatory DNA. We examined
the distribution of 5654 noncoding genome-wide
Regulatory DNA significant associations [5134 unique single-

nucleotide polymorphisms (SNPs); fig. S1 and
table S2] for 207 diseases and 447 quantitative
Matthew T. Maurano,1* Richard Humbert,1* Eric Rynes,1* Robert E. Thurman,1 Eric Haugen,1 traits (2) with the deep genome-scale maps of
Hao Wang,1 Alex P. Reynolds,1 Richard Sandstrom,1 Hongzhu Qu,1,2 Jennifer Brody,3 regulatory DNA marked by DHSs. This revealed
Anthony Shafer,1 Fidencio Neri,1 Kristen Lee,1 Tanya Kutyavin,1 Sandra Stehling-Sun,1 a collective 40% enrichment of GWAS SNPs in
Audra K. Johnson,1 Theresa K. Canfield,1 Erika Giste,1 Morgan Diegel,1 Daniel Bates,1 DHSs (fig. S1C, P < 10−55, binomial, compared to
R. Scott Hansen,4 Shane Neph,1 Peter J. Sabo,1 Shelly Heimfeld,5 Antony Raubitschek,6 the distribution of HapMap SNPs). Fully 76.6%
Steven Ziegler,6 Chris Cotsapas,7,8 Nona Sotoodehnia,3,9 Ian Glass,10 Shamil R. Sunyaev,11 of all noncoding GWAS SNPs either lie within a
Rajinder Kaul,4 John A. Stamatoyannopoulos1,12† DHS (57.1%, 2931 SNPs) or are in complete
Downloaded from www.sciencemag.org on September 12, 2012

linkage disequilibrium (LD) with SNPs in a near-
Genome-wide association studies have identified many noncoding variants associated with common by DHS (19.5%, 999 SNPs) (Fig. 1A) (12). To con-
diseases and traits. We show that these variants are concentrated in regulatory DNA marked by firm this enrichment, we sampled variants from
deoxyribonuclease I (DNase I) hypersensitive sites (DHSs). Eighty-eight percent of such DHSs are active the 1000 Genomes Project (13) with the same ge-
during fetal development and are enriched in variants associated with gestational exposure–related nomic feature localization (intronic versus inter-
phenotypes. We identified distant gene targets for hundreds of variant-containing DHSs that may explain genic), distance from the nearest transcriptional
phenotype associations. Disease-associated variants systematically perturb transcription factor recognition start site, and allele frequency in individuals of
sequences, frequently alter allelic chromatin states, and form regulatory networks. We also demonstrated European ancestry. We confirmed significant en-
tissue-selective enrichment of more weakly disease-associated variants within DHSs and the de novo richment both for SNPs within DHSs (P < 10−59,
identification of pathogenic cell types for Crohn’s disease, multiple sclerosis, and an electrocardiogram simulation) and also including variants in com-
trait, without prior knowledge of physiological mechanisms. Our results suggest pervasive involvement of plete LD (r 2 = 1) with SNPs in DHSs (P < 10−37,
regulatory DNA variation in common human disease and provide pathogenic insights into diverse disorders. simulation) (fig. S2).
In total, 47.5% of GWAS SNPs fall within
D
isease- and trait-associated genetic variants and enhancer elements (3–6) and enrichment with- gene bodies (fig. S1B); however, only 10.9% of
are rapidly being identified with genome- in expression quantitative trait loci (eQTL) (3, 7, 8). intronic GWAS SNPs within DHSs are in strong
wide association studies (GWAS) and re- Human regulatory DNA encompasses a vari- LD (r2 ≥ 0.8) with a coding SNP, indicating that
lated strategies (1). To date, hundreds of GWAS ety of cis-regulatory elements within which the co- the vast majority of noncoding genic variants
have been conducted, spanning diverse diseases operative binding of transcription factors creates are not simply tagging coding sequence. Analo-
~90% of GWAS hits are non-coding!

and quantitative phenotypes (2) (fig. S1A). How- focal alterations in chromatin structure. Deoxy- gously, only 16.3% of GWAS variants within
ever, the majority (~93%) of disease- and trait- ribonuclease I (DNase I) hypersensitive sites (DHSs) coding sequences are in strong LD with variants in
associated variants emerging from these studies are sensitive and precise markers of this actuated DHSs. SNPs on widely used genotyping arrays
lie within noncoding sequence (fig. S1B), com- regulatory DNA, and DNase I mapping has been (e.g., Affymetrix) were modestly enriched with-
plicating their functional evaluation. Several lines instrumental in the discovery and census of hu- in DHSs (fig. S2), possibly due to selection of
of evidence suggest the involvement of a propor- man cis-regulatory elements (9). We performed SNPs with robust experimental performance in
tion of such variants in transcriptional regulatory DNase I mapping genome-wide (10) in 349 cell genotyping assays. However, we found no evi-
mechanisms, including modulation of promoter
Stamatoyannopoulos, Science 2012
and tissue samples, including 85 cell types studied
under the ENCODE Project (10) and 264 sam-
dence for sequence composition bias (table S3).
To further examine the enrichment of GWAS
1
Department of Genome Sciences, University of Washington, ples studied under the Roadmap Epigenomics SNPs in regulatory DNA, we systematically clas-
Seattle, WA 98195, USA. 2Laboratory of Disease Genomics Program (11). These encompass several classes sified all noncoding GWAS SNPs by the quality
There have been few, if any, similar bursts of discovery in the
history of medical research.
David Hunter and Peter Kraft, NEJM, 2007

Common claims discussed in regards to GWAS:
Despite issues, yielded many discoveries vs. cost
~500,000 SNP chips x ~$500/chip
= $250M
$250M / ~2000 loci
= $125K/locus
Candidate genes: >$250M!

100 NIH R01s
Figure 1. GWAS Discoveries over Time

Data obtained from the Published GWAS Catalog (see Web
Fighter jet
Resources). Only the top SNPs representing loci with association

p values < 5 3 10!8 are included, and so that multiple counting
is avoided, SNPs identified for the same traits with LD r2 > 0.8 esti-
mated from the entire HapMap samples are excluded. Hadron Collider: $9B
to a doubling of the number of associated variants discov-
ered. The proportion of genetic variation explained by
significantly associated SNPs is usually low (typically less
Five years of GWAS Discovery (Visscher, 2012)
than 10%) for many complex traits, but for diseases such
as CD and multiple sclerosis (MS [MIM 126200]), and for
quantitative traits such as height and lipid traits, between
Complex traits are a function of genes and
environment...
Phenotype Genome Environment
P=G+E
Type 2 Diabetes
Variants Infectious agents
Cancer
Nutrients
Alzheimer’s
Pollutants
Gene expression Drugs

E: ???
Nothing comparable to elucidate E influence!
We lack high-throughput methods

and data to discover new E in P…
A similar paradigm for discovery should exist
for E!
Why?
σ 2
P = σ 2
G + σ 2
E
Heritability (H2) is the range of phenotypic variability
attributed to genetic variability in a population
σ 2
G
H 2 = 2
σP
Indicator of the proportion of phenotypic
differences attributed to G.
Height is an example of a heritable trait:
Francis Galton shows how its done (1887)
“mid-height of 205 parents

described 60% of variability of 928
offspring”
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
Stomach cancer
Leukemia
Lung cancer
Colon cancer
Bladder cancer
Sciatica
Cervical cancer
Testicular cancer
Gallstone disease
Type-2 diabetes
Longevity Type 2 Diabetes (25%)
Parkinson's disease
Osteoarthritis
Hypertension
Blood pressure, systolic
Asthma
Stroke
Hangover
Ovarian cancer
Breast cancer Heart Disease (25-30%)
QT interval
Prostate cancer
Heart disease
Menopause, age at
Insomnia
Depression
Body mass index
Blood pressure, diastolic
Autism
Thyroid cancer Autism (50%???)
Migraine
Crohn's disease
Rheumatoid arthritis
Lupus
Alcoholism
Sexual orientation
Nicotine dependence
Menarche, age at
Bone mineral density
Psoriasis
Anorexia nervosa
Alzheimer's disease
Obesity
Bipolar disorder
Attention deficit hyperactivity disorder
Polycystic ovary syndrome
Celiac disease
Graves' disease
Epilepsy
Schizophrenia
Height
Type-1 diabetes
Hair curliness
Eye color
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
Stomach cancer
Leukemia
Lung cancer
Colon cancer
Bladder cancer
Sciatica
Cervical cancer
Testicular cancer
Gallstone disease
Type-2 diabetes
Longevity
Parkinson's disease
Osteoarthritis
Hypertension
Blood pressure, systolic
Asthma
Stroke
σ2E : Exposome!
Hangover
Ovarian cancer
Breast cancer
QT interval
Prostate cancer
Heart disease
Menopause, age at
Insomnia
Depression
Body mass index
Blood pressure, diastolic
Autism
Thyroid cancer
Migraine
Crohn's disease
Rheumatoid arthritis
Lupus
Alcoholism
Sexual orientation
Nicotine dependence
Menarche, age at
Bone mineral density
Psoriasis
Anorexia nervosa
Alzheimer's disease
Obesity
Bipolar disorder
Attention deficit hyperactivity disorder
Polycystic ovary syndrome
Celiac disease
Graves' disease
Epilepsy
Schizophrenia
Height
Type-1 diabetes
Hair curliness
Eye color
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) SNPedia.com
Meta-analysis of the heritability of human traits based on
fifty years of twin studies
Tinca J C Polderman1,10, Beben Benyamin2,10, Christiaan A de Leeuw1,3, Patrick F Sullivan4–6,
Arjen van Bochoven7, Peter M Visscher2,8,11 & Danielle Posthuma1,9,11
Despite a century of research on complex traits in humans, the Nature Genetics, 2015
Specifically, the partitioning of observed variability into underlying
© 2015 Nature America, Inc. All rights reserved.
relative importance and specific nature of the influences of genetic and environmental sources and the relative importance of
genes and environment on human traits remain controversial. additive and non-additive genetic variation are continually debated1–5.
We report a meta-analysis of twin correlations and reported Recent results from large-scale genome-wide association studies
variance components for 17,804 traits from 2,748 publications (GWAS) show that many genetic variants contribute to the variation
including 14,558,903 partly dependent twin pairs, virtually in complex traits and that effect sizes are typically small6,7. However,
all published twin studies of complex traits. Estimates of the sum of the variance explained by the detected variants is much
17,804 traits of the phenome

heritability cluster strongly within functional domains, smaller than the reported heritability of the trait4,6–10. This ‘missing
and across all traits the reported heritability is 49%. For a heritability’ has led some investigators to conclude that non-additive
majority (69%) of traits, the observed twin correlations are variation must be important4,11. Although the presence of gene-gene
2,748 publications
consistent with a simple and parsimonious model where twin

resemblance is solely due to additive genetic variation. The
interaction has been demonstrated empirically5,12–17, little is known
about its relative contribution to observed variation18.
14,558,903 twin pairs
data are inconsistent with substantial influences from shared
environment or non-additive genetic variation. This study
In this study, our aim is twofold. First, we analyze empirical esti-
mates of the relative contributions of genes and environment for
provides the most comprehensive analysis of the causes of virtually all human traits investigated in the past 50 years. Second, we
individual differences in human traits thus far and will guide assess empirical evidence for the presence and relative importance of
future gene-mapping efforts. All the results can be visualized non-additive genetic influences on all human traits studied. We rely
using the MaTCH webtool. on classical twin studies, as the twin design has been used widely
to disentangle the relative contributions of genes and environment,
across a variety of human traits. The classical twin design is based
Insight into the nature of observed variation in human traits is impor- on contrasting the trait resemblance of monozygotic and dizygotic
tant in medicine, psychology, social sciences and evolutionary biology.
Average H (genome): 0.49 2
It has gained new relevance with both the ability to map genes for
human traits and the availability of large, collaborative data sets to do
twin pairs. Monozygotic twins are genetically identical, and dizygotic
twins are genetically full siblings. We show that, for a majority of traits
(69%), the observed statistics are consistent with a simple and parsi-
so on an extensive and comprehensive scale. Individual differences in monious model where the observed variation is solely due to additive
human traits have been studied for more than a century, yet the causes genetic variation. The data are inconsistent with a substantial influence
of variation in human traits remain uncertain and controversial. from shared environment or non-additive genetic variation. We also
show that estimates of heritability cluster strongly within functional
1Department of Complex Trait Genetics, VU University, Center for Neurogenomics domains, and across all traits the reported heritability is 49%. Our
Exposome may play an equal role.
and Cognitive Research, Amsterdam, the Netherlands. 2Queensland Brain
Institute, University of Queensland, Brisbane, Queensland, Australia. 3Institute
for Computing and Information Sciences, Radboud University Nijmegen,
results are based on a meta-analysis of twin correlations and reported
variance components for 17,804 traits from 2,748 publications includ-
Nijmegen, the Netherlands. 4Center for Psychiatric Genomics, Department ing 14,558,903 partly dependent twin pairs, virtually all twin studies of
of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA. complex traits published between 1958 and 2012. This study provides
5Department of Psychiatry, University of North Carolina, Chapel Hill, North
the most comprehensive analysis of the causes of individual differences
Carolina, USA. 6Department of Medical Epidemiology and Biostatistics,
Karolinska Institutet, Stockholm, Sweden. 7Faculty of Sciences, VU University, in human traits thus far and will guide future gene-mapping efforts. All
Explaining the other 50%:
A new data-driven paradigm for robust discovery of
via EWAS and the exposome
PERSPECTIVES
what to measure? how to measure?
itical entity for disease eti- some (telomere) length in
ogy (7). Recent discussion RADIATION Exposome peripheral blood mono-
as focused on whether and Reactive electrophiles nuclear cells responded
ow to implement this vision Metals to chronic psychological
8). Although fully charac- Endocrine disrupters stress, possibly mediated
rizing human exposomes Immune modulators by the production of reac-
STRESS
daunting, strategies can be Receptor-binding proteins tive oxygen species (15).
eveloped for getting “snap- Characterizing the
hots” of critical portions of exposome represents a tech-
person’s exposome during nological challenge like that of
ifferent stages of life. At LIFE-STYLE the human genome project, which
ne extreme is a “bottom-up” began when DNA sequencing
Internal
rategy in which all chemi-
als in each external source chemical how to analyze in relation to health?
was in its infancy (16). Analyti-
cal systems are needed to pro-
External environment
f a subject’s exposome are environment cess small amounts of blood from

INFECTIONS Xenobiotics
easured at each time point. thousands of subjects. Assays
lthough this approach would “A more comprehensive view of
Inflammation should be multiplexed for mea-
rom www.sciencemag.org on October 21, 2010

ave the advantage of relat- Preexisting disease suring many chemicals in each
g important exposures to
e air, water, or diet, it would
environmental exposure is
Lipid peroxidation
Oxidative stress
class of interest. Tandem mass
spectrometry, gene and protein
DRUGS
quire enormous effort and
ould miss essential compo-
needed ... to discover major
Gut flora chips, and microfluidic systems
offer the means to do this. Plat-
ents of the internal chemi-
al environment due to such causes of diseases...” forms for high-throughput assays
should lead to economies of scale,
actors as gender, obesity, DIET again like those experienced by
flammation, and stress. By the human genome project. And
ontrast, a “top-down” strat- because exposome technologies Wild, 2005
Rappaport and Smith, 2010, 2011
gy would measure all chem- would provide feedback for thera-

als (or products of their POLLUTION Characterizing the exposome. The exposome represents
peutic interventions and personal-
ownstream processing or
ffects, so-called read-outs
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
Buck-Louis and Sundaram 2012
ized medicine, they should moti-

vate the development of commer-
r signatures) in a subject’s
ood. This would require
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
Miller and Jones, 2014
cial devices for screening impor-

tant environmental exposures in
nly a single blood specimen Patel CJ and Ioannidis JPAI, 2014
blood samples.
each time point and would relate directly ruptors and can be measured through serum With successful characterization of both
We still cannot “query” the environment like the genome...
Connecting Environmental Exposure with Disease:
Missing the “System” of Exposures?
?
E+ E-
diseased
non-
diseased
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.

A maze of associations is one way to a fragmented
literature and Vibration of Effects
2. Publication bias univariate

is general recognition that a paper
much better chance of acceptance if
hing new is found. This means that, for
ation, the claim in the paper has to
sed on a p-value less than 0.05. From sex
g’s point of view5, this is quality by
tion. The journals are placing heavy
ce on a statistical test rather than sex & race
nation of the methods and steps that
o a conclusion. As to having a p-value
han 0.05, some might be tempted to
the system10 through multiple testing, sex & age
ple modelling or unfair treatment of
or some combination of the three that
to a small p-value. Researchers can be
sex & race & age
creative in devising a plausible story to
statistical finding.
e modelling
P < 0.05
oblem is akin to – but less well
sed and more poorly understood than –
e testing. For example, consider the use
r regression to adjust the risk levels of
atments to the same background level
There can be many covariates, and Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
t of covariates can be in or out of the included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
With ten covariates, there are over 1000 term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
models. Consider a maze as a metaphor Young, 2011
elling (Figure 3). The red line traces the
path out of the maze. The path through ways in the literature for dealing with model 2 The data cleaning team creates a JCE, 2015
ze looks simple, once it is known. selection, so we propose a new, composite modelling data set and a holdout set and
Example of fragmentation:
Is everything we eat associated with cancer?
50 random ingredients from

Boston Cooking School
Cookbook
Any associated with cancer?
Of 50, 40 studied in a cancer risk
Weak statistical evidence:
non-replicated
inconsistent effects
FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie
outliers are not shown (effect estimates .10).
non-standardized
Schoenfeld and Ioannidis, AJCN (2012)
Connecting Environmental Exposure with Disease:
Missing the “System” of Exposures?
?
E+ E-
diseased
non-
diseased
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.

Environment-Wide Association Studies (EWAS):
A GWAS-like study for the environment
NATURE | Vol 447 | 7 June 2007 case

control
β-carotene 2-hydroxyfluorene [factor]

a evol
15 lead part
−log10(P)
10 ease
tase
5
well
0 biol
1
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Environmental Category
te
s
s
al
T
on
in
es a
Chromosome
et
id h
m
rb
ic sp
M
ta
ca
st h o
Vi
b capt
ro
Pe op
yd
100
an
H
imp
rg
O
STR
test statistic
80
What specific environmental “loci” are associated to disease? reve
60 subs
libri
... but there is no “microarray” for environmental exposure...
Gold standard for breadth of human exposure information:
• Physical fitness and physical functioning An advanced computer system using high-
National Health and Nutrition Examination Survey1

• Reproductive history and sexual behavior
• Respiratory disease (asthma, chronic bron-
end servers, desktop PCs, and wide-area
networking collect and process all of the
chitis, emphysema) NHANES data, nearly eliminating the need
• Sexually transmitted diseases for paper forms and manual coding operations.
• Vision This system allows interviewers to use note-
since the 1960s

book computers with electronic pens. The staff
at the mobile center can automatically transmit
data into data bases through such devices as
The sample for the survey is selected to represent
now biannual: 1999 onwards
the U.S. population of all ages. To produce reli-

able statistics, NHANES over-samples persons 60
digital scales and stadiometers. Touch-sensi-
tive computer screens let respondents enter
their own responses to certain sensitive ques-
and older, African Americans, and Hispanics.
10,000 participants per survey
Since the United States has experienced dramatic

tions in complete privacy. Survey information
is available to NCHS staff within 24 hours of
collection, which enhances the capability of
growth in the number of older people during this
century, the aging population has major impli- collecting quality data and increases the speed
cations for health care needs, public policy, and with which results are released to the public.
research priorities. NCHS is working with public
health agencies to increase the knowledge of the In each location, local health and government
officials are notified of the upcoming survey.
>250 exposures (serum + urine)
health status of older Americans. NHANES has a

primary role in this endeavor. Households in the study area receive a letter
from the NCHS Director to introduce the
GWAS chip
All participants visit the physician. Dietary inter- survey. Local media may feature stories about Uses of the Data NHANES’ partnership with the U.S. Environ-
views and body measurements are included for the survey. mental Protection Agency allows continued
everyone. All but the very young have a blood Information from NHANES is made available study of the many important environmental
sample taken and will have a dental screening. NHANES is designed to facilitate and en- through an extensive series of publications and influences on our health.
Depending upon the age of the participant, the courage participation. Transportation is provided articles in scientific and technical journals. For
rest of the examination includes tests and proce- to and from the mobile center if necessary. data users and researchers throughout the world,
>85 quantitative clinical traits

dures to assess the various aspects of health listed Participants receive compensation and a report survey data are available on the internet and on
above. In general, the older the individual, the of medical findings is given to each participant. easy-to-use CD-ROMs.
more extensive the examination. All information collected in the survey is kept
(e.g., serum glucose, lipids, BMI)

strictly confidential. Privacy is protected by
public laws.
Research organizations, universities, health
care providers, and educators benefit from
survey information. Primary data users are
Survey Operations federal agencies that collaborated in the de-
Health interviews are conducted in respondents’ sign and development of the survey. The
National Institutes of Health, the Food and
Death index linkage (cause of
homes. Health measurements are performed in
specially-designed and equipped mobile centers, Drug Administration, and CDC are among the
which travel to locations throughout the country. agencies that rely upon NHANES to provide
death)
The study team consists of a physician, medical data essential for the implementation and
and health technicians, as well as dietary and health evaluation of program activities. The U.S.
interviewers. Many of the study staff are Department of Agriculture and NCHS coop-
bilingual (English/Spanish). erate in planning and reporting dietary and
nutrition information from the survey.
1 http://www.cdc.gov/nchs/nhanes.htm
Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey
Drugs
Infectious Agents
Nutrients and Vitamins

statins; aspirin hepatitis, HIV, Staph. aureus
vitamin D, carotenes
Plastics and consumables
phthalates, bisphenol A
Physical Activity
Pesticides and pollutants

steps
atrazine; cadmium; hydrocarbons
EWAS Approach for Discovery
Classify diseased/non-diseased participants:

Training Survey or Cohort:
E.g.: Diabetics and Non-diabetics
for each: bisphenol A

Environmental factors:
log transformed & z-standardized
reference groups “negative” { PCB199
β-carotene
cotinine
}
...
Regression:
disease
adjusted for other risk factors
age, sex, race, socioeconomic status, ...

zfactor βfactor
Significance tests (p-values): p-value(βfactor)

bisphenol A 0.8
PCB199 0.1
β-carotene 0.01
cotinine 0.03
... ...
EWAS Approach for Discovery
=
False Discovery Rate Estimation: # false positives ≤ α 50 false positives ≤ 0.05
= 0.5
The expected rate of false positives # findings ≤ α 100 findings ≤ 0.05
? # false positives (α)
cases controls
“Shuffle” (permute) disease and non-diseased
participants
Repeat many times
Re-run EWAS
FDR (p-value)
bisphenol A 1
PCB199 0.4
β-carotene 0.1
cotinine 0.2
... ...
Validation Survey or Cohort: p-value < 0.05 in test survey?

Exposome factors are associated with Type 2 Diabetes?
EWAS in Type 2 Diabetes:
Searching >250 exposures for associations with
cohort markers
1999-2000
FBG > 125 mg/dL 2001-2002
2003-2004
Novel Findings:
2005-2006
heptachlor β-carotene
γ-tocopherol (vitamin E)
PCB170 Heptachlor Epoxide
epoxide
OR=0.6,0.6
OR=1.8,1.6
OR=4.5,2.3 OR=3.2, 1.8
γ-tocopherol ●
● ●
●
Known “replicated” factors FDR(α<0.02) ~ 10%

●●
Associations:
●
2
●
−log10(pvalue)
●● ● ● ●
●● ●
β-carotene
● ●●● ●
● ● ● ●
● ● ●
● ●
●●
●
● ●
vitamin D
● ●
●
●
● ●
● ●
● ●
● ●●
●
●
●
●
●
●
1
● ● ● ● ● ● ●
●
● ● ●
● ● ●● ● ●
● ● ● ● ●
●
● ●
● ● ● ●●
●
● ●● ● ●
● ●● ●
● ● ● ●●
●
●● ●
● ● ● ●● ● ● ●
● ●
Interesting Patterns:
● ● ●●● ●● ●● ● ●
●
● ● ● ●●● ● ● ●●
●● ● ● ●
● ●
●
●
● ●● ●●
● ●
●
● ● ● ●●
●
●
●
● ● ●●
● ●● ●
●●
●
● ● ●
● ●
●● ●
● ● ● ●● ●● ● ● ●
● ●
●
● ●● ●
pesticides, PCBs ●● ● ● ● ●● ●
●
●
●● ●
0
nutrients carotenoid
nutrients minerals
nutrients vitamin A
nutrients vitamin B
nutrients vitamin C
nutrients vitamin D
nutrients vitamin E
phytoestrogens
cotinine
hydrocarbons
volatile compounds
allergen test
viral infection
bacterial infection
latex
phenols
phthalates
polybrominated ethers
polyflourochemicals
acrylamide
perchlorate
pcbs
dioxins
heavy metals
pesticides atrazine
pesticides chlorophenol
pesticides organochlorine
pesticides organophosphate
pesticides pyrethyroid
diakyl
furans dibenzofuran
What model is used to test for
Fasting Blood Glucose ≥ 126 mg/dL?
association? BMI, SES, ethnicity, age, sex
OR: Δ 1SD of exposure
N=500-2000 per cohort
Compare vs. GWAS?
PLoS ONE, 2010
Exposome factors associated with serum lipids?
Triglycerides, LDL-Cholesterol, HDL-Cholesterol

EWAS on Serum Lipid Levels:
Triglycerides, LDL-Cholesterol, HDL-Cholesterol
Risk factors for coronary heart disease (CHD)
Targets for intervention (ie, statins)
Influenced by smoking, physical activity, diet,

genetics1
LDL-C Δ1%: 1% increased risk for CHD2
HDL-C Δ1%: 2% decreased risk for CHD3
Triglycerides: higher risk for CHD
1. Teslovich et al. Nature (2010)
2 .Grundy et al. ATVB (2004)
3. Gotto et al. JACC (2004)

EWAS in HDL-C: cohort markers
17 Validated Factors 1999-2000
2001-2002
2003-2004
minerals 2005-2006
carotenes Vitamins heavy metals
A BC D E
hydrocarbons
cotinine
organochlorine pesticides
FDR < 5%
log10(HDL-C)
adjusted for BMI, SES, ethnicity, age, age2, sex
N=1000-3000
IJE 2012.
EWAS in Triglycerides and LDL-C
organochlorine pesticides
22 factors polychlorinated biphenyls
carotenoids
vitamin E
vitamin A
8 factors
carotenoids
vitamin E
vitamin A
IJE 2012.
Effect Sizes For Validated Factors:
HDL-C
pollutants nutrient factors
survey! N! P-value! FDR! Effect (mg/dL)!
% change = Δ 1 SD in Exposure
17 validated factors
IJE 2012.
How do effect sizes compare between GWAS and EWAS?
GWAS
ARTICLES NATURE | Vol 466 | 5 August 2010
EWAS
Table 1 | Meta-analysis of plasma lipid concentrations in >100,000 individuals of European descent. survey! N! P-value! FDR! Effect (mg/dL)!
Locus Chr Lead SNP Lead trait Other traits Alleles/MAF Effect size P eQTL CAD Ethnic
LDLRAP1 1 rs12027135 TC LDL T/A/0.45 21.22 4 3 10211 Y 111?

PABPC4 1 rs4660293 HDL A/G/0.23 20.48 4 3 10210 Y 1111
PCSK9 1 rs2479409 LDL TC A/G/0.30 12.01 2 3 10228 1111
ANGPTL3 1 rs2131925 TG TC, LDL T/G/0.32 24.94 9 3 10243 Y 1111
EVI5 1 rs7515577 TC A/C/0.21 21.18 3 3 1028 111?
SORT1 1 rs629301 LDL TC T/G/0.22 25.65 1 3 102170 Y Y 1111
ZNF648 1 rs1689800 HDL A/G/0.35 20.47 3 3 10210 1112
MOSC1 1 rs2642442 TC LDL T/C/0.32 21.39 6 3 10213 111?
GALNT2 1 rs4846914 HDL TG A/G/0.40 20.61 4 3 10221 1111
IRF2BP2 1 rs514230 TC LDL T/A/0.48 21.36 5 3 10214 111?
APOB 2 rs1367117 LDL TC G/A/0.30 14.05 4 3 102114 1111
rs1042034 TG HDL T/C/0.22 25.99 1 3 10245 1211
GCKR 2 rs1260326 TG TC C/T/0.41 18.76 6 3 102133 Y 1111
ABCG5/8 2 rs4299376 LDL TC T/G/0.30 12.75 2 3 10247 1111
RAB3GAP1 2 rs7570971 TC C/A/0.34 11.25 2 3 1028 12??
COBLL1 2 rs10195252 TG T/C/0.40 22.01 2 3 10210 Y 1111
rs12328675 HDL T/C/0.13 10.68 3 3 10210 11?1
IRS1 2 rs2972146 HDL TG T/G/0.37 10.46 3 3 1029 Y Y 1111
RAF1 3 rs2290159 TC G/C/0.22 21.42 4 3 1029 111?
MSL2L1 3 rs645040 TG T/G/0.22 22.22 3 3 1028 1121
KLHL8 4 rs442177 TG T/G/0.41 22.25 9 3 10212 1111
SLC39A8 4 rs13107325 HDL C/T/0.07 20.84 7 3 10211 Y 12?2
ARL15 5 rs6450176 HDL G/A/0.26 20.49 5 3 1028 2??1
MAP3K1 5 rs9686661 TG C/T/0.20 12.57 1 3 10210 1111
HMGCR 5 rs12916 TC LDL T/C/0.39 12.84 9 3 10247 111?
TIMD4 5 rs6882076 TC LDL, TG C/T/0.35 21.98 7 3 10228 111?
MYLIP 6 rs3757354 LDL TC C/T/0.22 21.43 1 3 10211 1221
HFE 6 rs1800562 LDL TC G/A/0.06 22.22 6 3 10210 11?1
HLA 6 rs3177928 TC LDL G/A/0.16 12.31 4 3 10219 Y 111?
rs2247056 TG C/T/0.25 22.99 2 3 10215 1112
C6orf106 6 rs2814944 HDL G/A/0.16 20.49 4 3 1029 Y 1112
rs2814982 TC C/T/0.11 21.86 5 3 10211 Y 221?
FRK 6 rs9488822 TC LDL A/T/0.35 21.18 2 3 10210 Y 111?
CITED2 6 rs605066 HDL T/C/0.42 20.39 3 3 1028 1121
LPA 6 rs1564348 LDL TC T/C/0.17 20.56 2 3 10217 Y 11?1
rs1084651 HDL G/A/0.16 11.95 3 3 1028 11?1
DNAH11 7 rs12670798 TC LDL T/C/0.23 11.43 9 3 10210 111?
NPC1L1 7 rs2072183 TC LDL G/C/0.25 12.01 3 3 10211 121?
TYW1B 7 rs13238203 TG C/T/0.04 27.91 1 3 1029 1???
MLXIPL 7 rs17145738 TG HDL C/T/0.12 29.32 6 3 10258 Y 1111
KLF14 7 rs4731702 HDL C/T/0.48 10.59 1 3 10215 Y 1111
PPP1R3B 8 rs9987289 HDL TC, LDL G/A/0.09 21.21 6 3 10225 Y 1111
PINX1 8 rs11776767 TG G/C/0.37 12.01 1 3 1028 2111
NAT2 8 rs1495741 TG TC A/G/0.22 12.85 5 3 10214 Y 2111
LPL 8 rs12678919 TG HDL A/G/0.12 213.64 2 3 102115 Y 1111
CYP7A1 8 rs2081687 TC LDL C/T/0.35 11.23 2 3 10212 111?
TRPS1 8 rs2293889 HDL G/T/0.41 20.44 6 3 10211 1111
rs2737229 TC A/C/0.30 21.11 2 3 1028 112?
TRIB1 8 rs2954029 TG TC, LDL, HDL A/T/0.47 25.64 3 3 10255 Y 1111
PLEC1 8 rs11136341 LDL TC A/G/0.40 11.40 4 3 10213 1111
TTC39B 9 rs581080 HDL TC C/G/0.18 20.65 3 3 10212 1211
Previous studies have suggested sex-specific heritability of lipid Teslovich, 2010

three types of human tissue samples from liver (960 samples),
traits15. A key challenge in addressing this issue is evaluating enough omental fat (741 samples) and subcutaneous fat (609 samples). We
NATURE | Vol 466 | 5 August 2010
How do effect sizes compare between GWAS and EWAS?

>100,000 individuals of European descent.
tions in GWAS
EWAS
trait Other traits Alleles/MAF Effect size
Table 1 | Meta-analysis of plasma lipid concentrations in >100,000 individuals of European descent. P
survey! N! eQTL CAD
P-value! FDR! Effect (mg/dL)! Ethnic
Locus Chr Lead SNP Lead trait Other traits Alleles/MAF Effect size P eQTL CAD Ethnic
4YY3 10211
211
C LDLRAP1
PABPC4 LDL 1
1
rs12027135
rs4660293
TC
HDL T/A/0.45
LDL T/A/0.45
A/G/0.23 21.22
21.22
20.48
4 3 10
4 3 10 210
228
111?
1111 Y 111?
PCSK9 1 rs2479409 LDL TC A/G/0.30 12.01 2 3 10 210
1111
DL ANGPTL3
EVI5
1
1
rs2131925
rs7515577
TG
TC
A/G/0.23
TC, LDL T/G/0.32
A/C/0.21
20.48
24.94
21.18
9 3 10
3 3 10
4 Y3 10 1111
243
28
111?
Y 1111
Y 228
DL SORT1
ZNF648 TC 1
1
rs629301
rs1689800
LDL
HDL A/G/0.30
TC T/G/0.22
A/G/0.35 12.01
25.65
20.47
1 3 10
3 3 10 2 3 10 1112
Y 2170
210
1111
1111
9 3 10243
213
G MOSC1
GALNT2 TC, LDL1
1
rs2642442
rs4846914
TC
HDL T/G/0.32
LDL
TG
T/C/0.32
A/G/0.40 24.94
21.39
20.61
6 3 10
4 3 10 221
214
111?
1111 Y 1111
IRF2BP2 1 rs514230 TC LDL T/A/0.48 21.36 5 3 10 28
111?
C APOB 2 rs1367117
rs1042034
LDL
TG
A/C/0.21
TC
HDL
G/A/0.30
T/C/0.22
21.18
14.05
25.99
4 3 10
1 3 10
3 3 10
2114
245
1111
1211
111?
2170
DL GCKR
ABCG5/8 TC 2
2
rs1260326
rs4299376
TG
LDL T/G/0.22
TC
TC
C/T/0.41
T/G/0.30 25.65
18.76
12.75
6 3 10
2 3 10 1 3 10 1111
Y 2133
247
1111
Y Y 1111
3Y3 10210
28
RAB3GAP1 2 rs7570971 TC C/A/0.34 2 3 10
DL COBLL1 2 rs10195252 TG A/G/0.35 T/C/0.40
11.25
20.47
22.01 2 3 10 210
210
12??
1111 1112
rs12328675 HDL T/C/0.13 10.68 3 3 10 213
11?1
C IRS1
RAF1
LDL 32 rs2972146
rs2290159
HDL
TC
T/C/0.32
TG T/G/0.37
G/C/0.22
21.39
10.46
21.42
3 3 10
4 3 10
6 3 10 111?
Y 29
Y
29
1111 111?
221
DL TG 34 A/G/0.40 20.61 4 3 10 1111
28
MSL2L1 rs645040 TG T/G/0.22 22.22 3 3 10 1121
212
KLHL8 rs442177 TG T/G/0.41 22.25 9 3 10 1111
214
211
C SLC39A8
ARL15
MAP3K1
LDL 554 rs13107325
rs6450176
rs9686661
HDL
HDL
TG
T/A/0.48 C/T/0.07
G/A/0.26
C/T/0.20
20.84
21.36
20.49
12.57
7 3 10
5 3 10
1 3 10
5 Y
3 10
28
210
12?2
2??1
1111
111?
2114
DL HMGCR
TIMD4
TC 55 rs12916
rs6882076
TC
TC
G/A/0.30
LDL
LDL, TG
T/C/0.39
C/T/0.35
14.05
12.84
21.98
9 3 10
7 3 10
4 3 10
247
228
111?
111?
1111
245
G HDL 66 T/C/0.22 25.99 1 3 10 11?1 1211
211
MYLIP rs3757354 LDL TC C/T/0.22 21.43 1 3 10 1221
210
HFE rs1800562 LDL TC G/A/0.06 22.22 6 3 10
HLA 6 rs3177928 TC LDL G/A/0.16 12.31 4 3 10 Y 2133
219
111?
G C6orf106
TC 6
rs2247056
rs2814944
TG
HDL
C/T/0.41 C/T/0.25
G/A/0.16
18.76
22.99
20.49
2 3 10
4 3 10
6 Y
3 10
215
29
1112
1112
Y 1111
247
DL FRK
TC 6 rs2814982
rs9488822
TC
TC
T/G/0.30
LDL
C/T/0.11
A/T/0.35
12.75
21.86
21.18
5 3 10
2 3 10
2Y3 10 111?
211
Y
210
221? 1111
28
C C/A/0.34 2 3 10
28
CITED2
LPA
6
6
rs605066
rs1564348
HDL
LDL TC
T/C/0.42
T/C/0.17 11.25
20.39
20.56
3 3 10
2 3 10 217
Y
1121
11?1 12??
rs1084651 HDL G/A/0.16 11.95 3 3 10 28
210
11?1
G DNAH11
NPC1L1
7
7
rs12670798
rs2072183
TC
TC
T/C/0.40
LDL
LDL
T/C/0.23
G/C/0.25
22.01
11.43
12.01
9 3 10
3 3 10
2 3 10 111?
210
211
121?
Y 1111
210
DL TYW1B
MLXIPL
7
7
rs13238203
rs17145738
TG
TG
T/C/0.13
HDL
C/T/0.04
C/T/0.12
10.68
27.91
29.32
1 3 10
6 3 10
3 Y
3 10
29
258
1???
1111
11?1
Y 29
3Y3 10
215
DL KLF14
PPP1R3B TG 7
8
rs4731702
rs9987289
HDL
HDL T/G/0.37
TC, LDL
C/T/0.48
G/A/0.09 10.46
10.59
21.21
1 3 10
6 3 10 225
28
1111
1111 Y Y 1111
PINX1 8 rs11776767 TG G/C/0.37 12.01 1 3 10 29
2111
C NAT2
LPL
8
8
rs1495741
rs12678919
TG
TG
G/C/0.22
TC
HDL
A/G/0.22
A/G/0.12
21.42
12.85
213.64
5 3 10
2 3 10
4 3 10
214
Y
Y
2111
2115
1111
111?
28
G CYP7A1
TRPS1
8
8
rs2081687
rs2293889
TC
HDL
T/G/0.22
LDL C/T/0.35
G/T/0.41
22.22
11.23
20.44
2 3 10
6 3 10
3 3 10 1111
212
211
111?
1121
28
212
G TRIB1 8
rs2737229
rs2954029
TC
TG T/G/0.41
TC, LDL, HDL
A/C/0.30
A/T/0.47 22.25
21.11
25.64
2 3 10
3 3 10 9 3 10255
Y
213
112?
1111 1111
7 3 10211
PLEC1 8 rs11136341 LDL TC A/G/0.40 11.40 4 3 10 1111
DL TTC39B 9 rs581080 HDL C/T/0.07
TC C/G/0.18 20.84
20.65 3 3 10 212
1211 Y 12?2
DL G/A/0.26 20.49 Teslovich, 1028
5 32010 2??1
G C/T/0.20 12.57 1 3 10210 1111
EWAS uncovers persistent pollutants
in people with Type 2 Diabetes, Higher Lipids:
How are these factors linked with these diseases?
•organochlorine pesticides
•found all over the world
•polychlorinated biphenyls
•persist in food chain
•dibenzofurans
Porta et al, Environ Int 2008
•dioxins
•arteriosclerosis,
•T2D/insulin resistance
Porta et al, Lancet, 2006
Lee et al, Diabetes Care, 2006
Lee et al, Diabetologia, 2007
Everett et al, Environ Res, 2010
capacitors
Lind et al, EHP, 2011
adhesives
(Korea, Japan, Europe)
How can we study the elusive environment in larger scale for biomedical
discovery?
Opinion
Opinion
VIEWPOINT
VIEWPOINT
Studying
Studying thethe
Opinion Viewpoint Elusive
Elusive Environment
Environment in Large Scale
in Large Scale
It is possible that more than 50% of complex disease risk the EWAS vantage point, intervening on β-carotene
•evaluate new ‘omics technologies
Chirag J. Patel, PhD
ChiragCenter
J. Patel, PhD
for Biomedical isItattributed
is possible
Figure. that more
Correlation
to differences than 50%environment.
inInterdependency
an individual’s ofGlobes
complex disease
for14 Environmental
(Figure, D) risk the
Exposures
seems EWAS
a futile high-throughput,
exercisevantage
(Cotinine, Mercury,
given point,rela-
Cadmium,
its complex intervening
Trans-β-Carotene)oninβ-carotene
National Hea
Informatics, Harvard Air Nutrition
is attributed
pollution, toExamination
smoking, differences Survey
and diet are in an(NHANES)
individual’s
documented Participants,
environ- 2003-2004
environment.
tionship with1 other(Figure,
nutrientsD) andseems a futile exercise given its complex rela
pollutants.
Center for Biomedical
Medical School,
Informatics, Harvard mental factors affecting health, yet these factors are but
non-targeted metabolomics
Giventhiscomplexity,howcanstudiesofenvironmen-
Boston, Massachusetts. Air pollution,
A Serumsmoking,
cotinine and diet are documented B Serum total environ-
mercury tionshipC with Serumother
cadmiumnutrients and pollutants.D Serum trans-β-carotene
Medical School, a fraction of the “exposome,” the totality of the exposure tal risk move forward? First, EWAS analyses should be ap-
mental
load factors
occurring affecting
37 Total correlations
throughout health,
a person’s lifetime.yet1 Investigat-
these 42 factors are but
Total correlations Giventhiscomplexity,howcanstudiesofenvironmen
pliedtomultipledatasets,andconsistencycanbeformally
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc a fraction
ing of the of
one or a handful “exposome,”
exposures at athe
timetotality
Pollutants
has led to exposure tal risk move forward? First, EWAS analyses should be ap
ofatheexaminedforallassessedcorrelations.Second,thetempo-
Stanford Prevention highly fragmentedthroughout
literature of epidemiologic associa- 1 Investigat-
ral relationship between exposure and changes in health
John P. A. Ioannidis, load occurring a person’s lifetime. pliedtomultipledatasets,andconsistencycanbeformally
Research
MD, DSc
Center,
Department of Health
tions. Much of that literature is not reproducible, and se- parameters may offer helpful hints about which of the sig-
ing one or a handful of exposures at a time has led to a examinedforallassessedcorrelations.Second,thetempo what causes what?
Research and Policy, lective reporting may be a major reason for the lack of re- nalsaremorethansimplecorrelations.Third,standardized
Stanford Prevention
Department of highly fragmented
producibility. A new model literature
is required of epidemiologic confounding
associa- ral relationship between exposure and changes in health
to discover adjustedanalyses,inwhichadjustmentsareperformedsys-
Research Center,
Medicine, Stanford
tions. Muchexposures
environmental of that literature
associated with is not reproducible,
disease and se-
while tematically parameters
and in the same way across maymultiple
offer helpful
data sets,hints about which of the sig
Department
UniversityofSchool
Health
of
Medicine, Stanford,
Research and Policy, lective reporting may be a major reason for themay
mitigating possibilities of selective reporting.
lackalso help. This is in stark contrast with the current
of re- nalsaremorethansimplecorrelations.Third,standardized
California, Department To remedy the lack of reproducibility and concerns of model,wherebymostepidemiologicstudiesusesingledata
Department of Stanford
of Statistics, producibility.
validity, multiple personalA new model
exposures canis berequired discover adjustedanalyses,inwhichadjustmentsareperformedsys
assessed si- tosetswithoutreplicationaswellasnon–time-dependentas-
Medicine, Stanford
University School of
environmental
multaneously in termsexposures associated
of their association with awith
Nutrients
condi-disease while
sessments, tematically
and reported adjustments andarein markedly
the same way across multiple data sets
differ-
Humanities
University School andof and vitamins
Sciences, Stanford, tion or diseasepossibilities
mitigating of interest; the ofstrongest
selective associations
reporting.can ent across reports and may data
alsosets,help.
even those
Thisperformed
is in stark by contrast with the curren
Medicine, Stanford,
California, and then be tentatively independent data sets the same team (different approaches increase validity but
validated inInfectious
California, Department
Meta-Research To remedy
Demographic
the lack of
in references 2 andagents
(eg, as doneattributes
reproducibility and concerns of model,wherebymostepidemiologicstudiesusesingledata
3).2,3 The main advan- must be reconciled and assimilated). Negative correlation Positive cor
of Statistics,
InnovationStanford
Center at
validity,
tages of thismultiple personal
process include exposures
the ability to searchcan be assessed
the list However, setswithoutreplicationaswellasnon–time-dependentas
si- eventually for most environmental cor-
Stanford (METRICS),
Stanford, California.
Humanities and
Sciences, Stanford,
•data mining and informatics to tackle complexity
ofmultaneously
exposures and adjust in terms of their
for multiplicity association
systematically andwith relates,
Each correlation interdependency globe includes 317 environmental exposures
there maysessments,
a condi- be unsurpassable
report all the probed associations instead of only the most ing potential causal inferences based on observational
tion orrepresented
diseaseby ofthe
interest;
nodes aroundthethestrongest
periphery ofassociations
the globe. Pairwisecan ent across
correlations The size
anddifficulty
reports
reported adjustments are markedly differ
establish-
nodes. Correlations with absolute values exceeding 0.2 are shown (strong
andis proportional
of each node data sets,toeven those
the number performed
of edges for a node,by
a
significant results. The term “environment-wide associa- data alone.
are depicted by edges (lines) between the node of interest (arrowhead) and other Factors that seem protective may some-
thickness of each edge indicates the magnitude of the correlation.
then be tentatively validated in independent databesets tested inthe same team trials.(different approaches increase validity bu
•longitudinal/linkable data & biorepositories
California, and tion studies” (EWAS) has been used to describe this ap- times randomized The complexity of
Meta-Research (eg, as(andone
proach analogy in references
to genome-wide 2 and 3).2,3 stud-
association The main must be
advan-correlations
the multiple reconciled
also highlights theand assimilated).
challenge
Innovation Center at ies). High-throughput
For example, 4 ascertainment of endogenous indicators of en-
Wang et alinclude
screenedthe more than 2000 that the
intervening to modify US federally
1 putative risk funded gene
factor also may expression experiment data be
tages of this process ability to search list However, eventually for most environmental cor
Stanford (METRICS), chemicals vironmental
of exposures
exposure
in serum to discover
and adjust
that may reflect
endogenous
forperformance
multiplicity
exposures the as-
exposome
systematically
increasingly
inadvertently at-multiple
affect
relates,
anda seemingly
ited inother
there
public
may
repositories
correlated JAMA, 2014
such as the Gene Expression Omni

factors.
beisunsurpassable difficulty establish
Stanford, California. sociatedtractwithattention, and their
risk for cardiovascular disease. needs to beEven
carefully
whenevaluated. repository
simple has been
intervention instrumental
tested in in development of techno
reportThere allare
These the probed
include
notable associations
chemical
hurdles detection
in analyzing instead of only
of indicators
“big” environ- of the mosttrials
exposure
randomized ing potential
through
(affectingmeasurement causal
a single risk inferences
of gene
factor among
JECH, 2014
the based
expression, dataon observationa
standardization, an
7
There is no “microarray” for E...
NIH National Institute of Environmental Health: $34M in FY 2015:
new technologies for ascertaining the exposome in children
E
Laboratory
E
Laboratory
E E Data Center
Laboratory
E
Laboratory
•Data repository
Exposome Laboratory Network •Analytic ecosystem
•Data standards
http://grants.nih.gov/grants/guide/rfa-files/RFA-ES-15-010.html
Possibilities of discovery with the exposome:
How do we proceed?
Opinion
Opinion
VIEWPOINT
VIEWPOINT
Studying
Studying thethe
Elusive Environment
in Large Scale
ChiragCenter
J. Patel, PhD
is possible
Figure. that more
Correlation
inInterdependency
complex disease
for14 Environmental
Exposures
seems EWAS
a futile exercisevantage
(Cotinine,
metabolomics
Mercury,
given point,rela-
Cadmium,
National Hea
is attributed
individual’s
environ- 2003-2004
environment.
pollutants.
Medical School,
Informatics, Harvard mental factors affecting health, yet these factors are but Giventhiscomplexity,howcanstudiesofenvironmen-
A Serumsmoking,
mental
load factors
occurring affecting
throughout health,
MD, DSc a fraction
ing of the of
exposures at athe
timetotality
Pollutants
Research
MD, DSc
Center,
Stanford Prevention
Research Center,
Medicine, Stanford
disease and se-
offer helpful
Department
UniversityofSchool
Health
of
Medicine, Stanford,
Medicine, Stanford
environmental
Nutrients
condi-disease while
differ-
Humanities
alsosets,help.
even those
Thisperformed
Medicine, Stanford,
Demographic
the lack of
of Statistics,
InnovationStanford
Center at
validity,
Stanford (METRICS),
Humanities and
Sciences, Stanford,
ofmultaneously
there maysessments,
tion orrepresented
diseaseby ofthe
interest;
anddifficulty
reports
establish-
andis proportional
a
advan-correlations
challenge
Wang et alinclude
of exposures
exposure
and adjust
that may reflect
endogenous
forperformance
multiplicity
exposures the as-
exposome
systematically
increasingly
affect
relates,
anda seemingly
ited inother
there
public
may
repositories

factors.
carefully
simple has been
reportThere allare
These the probed
include
chemical
hurdles detection
of indicators
exposure
through
of gene
factor among
JECH, 2014
the based
standardization, an
7
Accelerating discoveries with publicly-accessible, population-scale data:
a dbGaP for environmental exposures?
758,000 individuals
>400 studies
>>1B datapoints (genotypes and phenotypes)
controlled-access (by application)

BD2K Patient-Centered Information Commons
NHANES exposome browser
40K participants
>1000 indicators of exposure
Data and API available now
http://nhanes.hms.harvard.edu
with Paul Avillach, Michael McDuffie, Jeremy Easton-Marks,
Cartik Saravanamuthu and the BD2K PIC-SURE team

Possibilities of discovery with the exposome:
How do we proceed?
Opinion
Opinion
VIEWPOINT
VIEWPOINT
Studying
Studying thethe
Elusive Environment
in Large Scale
ChiragCenter
J. Patel, PhD
is possible
Figure. that more
Correlation
inInterdependency
complex disease
for14 Environmental
Exposures
seems EWAS
a futile exercisevantage
(Cotinine,
metabolomics
Mercury,
given point,rela-
Cadmium,
National Hea
is attributed
individual’s
environ- 2003-2004
environment.
pollutants.
Medical School,
Informatics, Harvard mental factors affecting health, yet these factors are but Giventhiscomplexity,howcanstudiesofenvironmen-
A Serumsmoking,
mental
load factors
occurring affecting
throughout health,
MD, DSc a fraction
ing of the of
exposures at athe
timetotality
Pollutants
Research
MD, DSc
Center,
Stanford Prevention
Research Center,
Medicine, Stanford
disease and se-
offer helpful
Department
UniversityofSchool
Health
of
Medicine, Stanford,
Medicine, Stanford
environmental
Nutrients
condi-disease while
differ-
Humanities
alsosets,help.
even those
Thisperformed
Medicine, Stanford,
Demographic
the lack of
of Statistics,
InnovationStanford
Center at
validity,
Stanford (METRICS),
Humanities and
Sciences, Stanford,
ofmultaneously
there maysessments,
tion orrepresented
diseaseby ofthe
interest;
anddifficulty
reports
establish-
andis proportional
a
advan-correlations
challenge
Wang et alinclude
of exposures
exposure
and adjust
that may reflect
endogenous
forperformance
multiplicity
exposures the as-
exposome
systematically
increasingly
affect
relates,
anda seemingly
ited inother
there
public
may
repositories

factors.
carefully
simple has been
reportThere allare
These the probed
include
chemical
hurdles detection
of indicators
exposure
through
of gene
factor among
JECH, 2014
the based
standardization, an
7
Complexity of exposome-phenome associations:
Many more potential biases vs. GWAS
?
Reverse causality:
γ-tocopherol low HDL Could the disease “lead” to
exposure?
tocopherol (vitamin e) supplements for
CHD individuals?
Confounding bias:
mercury high HDL Ice cream and drowning deaths
? ? Mercury and HDL-C

fish consumption
confounders
ρ
β-carotene hydrocarbons
Independence of association:
Web of exposure of the exposome?

γ-tocopherol
Longitudinal Study:
“Gold Standard” for Validation
?
Exposure Disease
• exposure changing through time
Disease Risk
[high]
• reverse causality bias
• compute disease risk [low]
time
EWAS to search for
exposures and behaviors associated with all-cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001 NHANES: 2003-2004

N=330 to 6008 (26 to 655 deaths)
N=177 to 3258 (20-202 deaths)
~5.5 years of followup ~2.8 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5% p < 0.05
IJE, 2013
All-cause mortality:
253 exposure/behavior associations in survival
8
1 (11) (69) 1 1 Physical Activity
2 Does anyone smoke in home?
replicated factor 3 Cadmium
4 Cadmium, urine
2 5 Past smoker
6 Current smoker
sociodemographics
7 trans-lycopene
2
6
3
1 age (10 year increment)
3
-log10(pvalue)
4 2 SES_1
3 male
4 4 SES_0
5 5 black
4
FDR < 5% 7 6
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
5 6 11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
2
16 other_hispanic
8 7
9
10 11
12
13 14
1516
0
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
Adjusted Hazard Ratio
age, sex, income, education, race/ethnicity, occupation [in red]

IJE, 2013
EWAS (re)-identifies factors associated with all-cause mortality:
Volcano plot of 200 associations
age (10 years)

8
1 (11) (69) 1 1 Physical Activity
physical activity 2 Does anyone smoke in home?

3 Cadmium
4 Cadmium, urine
[low, moderate, high activity]* 2 any one smoke in home? 5 Past smoker
6 Current smoker
7 trans-lycopene
income (quintile 2)
2
6
serum and urine cadmium

[1 SD] 3 male 1 age (10 year increment)
3 income (quintile 1)
-log10(pvalue)
4 2 SES_1
3 male
past smoker?5 4 4 SES_0
5 black
4
serum lycopene 6 current smoker?

6 SES_2
7 7 SES_3
[1SD] 8 education_hs
black income (quintile 3)

9 other_eth
10 mexican
5 6 11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
2
16 other_hispanic
8 7
9
10 11
12
13 14
1516 R2 ~ 2%
0
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
Adjusted Hazard Ratio

age, sex, income, education, race/ethnicity, occupation [in red]
*derived from METs per activity and categorized by Health.gov guidelines
NATURE | Vol 445 | 22 February 2007 ARTICLES
Correlation Structure of the Exposome?

Table 1 | Confirmed association results
SNP Chromosome Position Risk Major MAF MAF Odds ratio Odds ratio PAR ls Stage 2 Stage 2 pMAX Stage 1 Stage 1 pMAX Nearest
(nucleotides) allele allele (case) (ctrl) (het) (hom) pMAX (perm) pMAX (perm) gene
rs7903146 10 114,748,339 T C 0.406 0.293 1.65 6 0.19 2.77 6 0.50 0.28 1.0546 1.5 3 10234 ,1.0 3 1027 3.2 3 10217 ,3.3 3 10210 TCF7L2
rs13266634 8 118,253,964 C C 0.254 0.301 1.18 6 0.25 1.53 6 0.31 0.24 1.0089 6.1 3 1028 5.0 3 1027 2.1 3 1025 1.8 3 1025 SLC30A8
ρ
26
rs1111875 10 94,452,862 G G 0.358 0.402 1.19 6 0.19 1.44 6 0.24 0.19 1.0069 3.0 3 10 7.4 3 1026 9.1 3 10 26
7.3 3 1026 HHEX
26
rs7923837 10 94,471,897 G G 0.335 0.377 1.22 6 0.21 1.45 6 0.25 0.20 1.0065 7.5 3 10 2.2 3 1025 3.4 3 10 26
2.5 3 1026 HHEX
24
rs7480010 11 42,203,294 G A 0.336 0.301 1.14 6 0.13 1.40 6 0.25 0.08 1.0041 1.1 3 10 2.9 3 1024 1.5 3 10 25
1.2 3 1025 LOC387761
Independence of association:
24
2.8 3 1024 25
1.3 3 1025
β-carotene hydrocarbons
rs3740878 11 44,214,378 A A 0.240 0.272 1.26 6 0.29 1.46 6 0.33 0.24 1.0046 1.2 3 10 1.8 3 10 EXT2
24
rs11037909 11 44,212,190 T T 0.240 0.271 1.27 6 0.30 1.47 6 0.33 0.25 1.0045 1.8 3 10 4.5 3 1024 1.8 3 10 25
1.3 3 1025 EXT2
24
rs1113132 11 44,209,979 C C 0.237 0.267 1.15 6 0.27 1.36 6 0.31 0.19 1.0044 3.3 3 10 8.1 3 1024 3.7 3 10 25
2.9 3 1025 EXT2
How to untangle “web” of

Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allele
frequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated using
exposure?
stage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higher
frequency in controls; pMAX, P-value of the MAX statistic from the x2 distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX and
pMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls.
γ-tocopherol
Identification of four novel T2DM loci The most significant of these corresponds to rs13266634, a non-
Our fast-track stage 2 genotyping confirmed the reported association synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkage
for rs7903146 (TCF7L2) on chromosome 10, and in addition iden- disequilibrium block on chromosome 8, containing only the 39 end
tified significant associations for seven SNPs representing four new of this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressed
T2DM loci (Table 1). In all cases, the strongest association for the solely in the secretory vesicles of b-cells and is thus implicated in the
MAX statistic (see Methods) was obtained with the additive model. final stages of insulin biosynthesis, which involve co-crystallization
Analogy: “Linkage Disequilibrium” Correlation between

a b
4 4 occurrence of genetic loci
–log10[P]
–log10[P]
2 2
0 0
SLC30A8 IDE KIF11 HHEX
* * *
*
In GWAS, allows
rs 1 2 2 5 7 0 5 3
rs 1 0 7 8 6 0 4 4
**
one to trace
rs 1 1 1 8 7 0 2 5
rs 1 1 1 8 7 0 6 0
rs 1 1 1 8 7 0 6 4
rs 1 2 2 5 6 4 3 5
rs 1 0 8 8 2 0 8 8
rs 1 0 8 8 2 0 9 1
rs 1 0 5 0 9 6 4 6
rs 1 1 1 8 7 1 7 3
rs 1 1 5 9 2 0 6 7
rs 1 1 1 8 7 1 8 2
rs 1 0 5 0 5 2 9 2
rs 1 1 7 8 1 5 1 9
rs 1 0 5 0 5 2 9 3
rs 1 0 50 5 3 1 4
rs 1 0 5 0 5 3 1 0
rs 1 3 2 6 6 6 3 4
rs 1 0 2 8 2 9 4 0
rs 1 0 5 05 3 0 9
rs 2 2 5 9 0 4 9
rs 2 9 0 1 5 8 7
rs 7 0 8 6 2 8 5
rs 7 9 1 0 9 7 7
rs 1 8 8 7 9 2 2
rs 2 1 4 9 6 3 2
rs 2 4 2 1 9 4 0
rs 3 7 3 7 2 2 5
rs 6 5 8 3 8 2 0
rs 7 0 7 8 4 1 3
rs 1 8 3 2 1 9 7
rs 2 4 2 1 9 4 3
rs 7 9 0 8 1 1 1
rs 1 9 9 9 7 6 3
rs 3 7 5 8 5 0 5
rs 6 5 8 3 8 2 6
rs 3 8 2 4 7 3 5
rs 4 6 0 4 7 9 1
rs 2 2 7 5 2 1 9
rs 7 9 1 4 8 1 4
rs 7 0 7 0 9 9 0
rs 6 5 8 3 8 3 0
rs 7 9 0 2 4 3 6
rs 7 9 1 7 3 5 9
rs 2 2 7 5 7 2 9
rs 1 1 1 1 8 7 5
rs 7 9 2 3 8 3 7
rs 2 4 9 7 3 1 1
rs 2 4 9 7 3 0 4
rs 2 4 8 8 0 7 1
rs 1 5 3 9 3 3 0
rs 9 4 2 0 5 9 2
rs 2 4 9 7 3 5 1
rs 2 4 8 8 0 6 2
rs 1 9 3 5 4 9 2
rs 1 4 1 8 3 8 8
rs 4 2 4 4 9 3 2
rs 2 4 9 0 7 5 1
rs 2 4 2 2 0 6 7
rs 2 4 9 0 7 4 5
rs 29 3 8 8 6 4
rs 3 0 1 9 8 8 0
rs 6 4 6 9 6 6 8
rs 3 0 1 9 8 8 5
rs1 0 0 1 6 4 6
rs 2 0 4 7 9 6 2
rs 7 0 1 1 0 5 7
rs 13 9 4 8 7 4
rs 7 8 3 3 7 3 4
rs 1 5 0 5 5 2 1
rs 2 0 6 2 9 4 7
rs 7 0 0 0 5 0 5
rs 7 8 3 3 7 1 2
rs 1 3 9 4 8 7 5
rs 6 4 6 9 6 7 4
rs 7 8 1 7 7 5 4
rs 6 4 6 9 6 7 5
rs 2 4 6 4 5 9 2
rs 2 4 6 6 2 9 9
rs 2 4 6 6 2 9 5
rs 2 46 6 2 9 3
rs 1 5 7 8 9 7 8
rs 6 4 6 9 6 8 1
rs 2 4 6 6 3 1 8
rs2 4 6 6 3 1 6
rs 1 9 9 5 2 2 2
rs7 0 0 5 1 4 0
rs1 4 9 9 4 3 0
rs 2 6 4 9 1 0 2
rs 1 4 9 9 4 3 3
rs 1 6 2 2 1 0 8
rs1 7 9 3 7 3 3
rs 1 7 9 3 7 3 2
rs 2 4 6 4 5 9 4
rs 5 5 1 2 6 6
rs 9 4 7 5 9 1
rs 9 6 1 6 3 0
rs 8 6 8 6 5 1
rs9 2 4 3 8 8
rs 9 0 4 5 4 4
to the “causal” locus.

**
*
*
**
c d
Sladek
4 et al., Nature Genetics (2007) 4
–log10[P]
–log10[P]
2 2
0 0
EXT2 ALX4 LOC387761
**
* * **
Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
permuted data to produce
“null ρ”
sought replication in > 1

cohort
Red: positive ρ
Blue: negative ρ
thickness: |ρ|
Pac Symp Biocomput 2015
JECH 2015
Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
permuted data to produce
“null ρ”
sought replication in > 1

cohort
Red: positive ρ
Blue: negative ρ
thickness: |ρ|
Effective number of
variables:
500 (10% decrease)
Pac Symp Biocomput 2015
JECH 2015
Estimating the LD of the exposome:
Diabetes vs. death have distinct globes (PoPs vs. smoking?)...
Diabetes All-cause mortality
Pac Symp Biocomput. 2015

Browse these and 82 other phenotype-exposome globes!
http://www.chiragjpgroup.org/exposome_correlation
ρ
What nodes have the most connections?
(“hubs”)
What factor(s) is(are) correlated with many other exposures?
sex, age, and income

Lower income associated with 43 of 330 (>13%) exposures
and biomarkers in the
overall income/poverty ratioUS population
effects (per 1SD)
validated results
Segmented neutrophils number
30
Cotinine Blood Benzene Cadmium Lead White blood cell count
C-reactive protein
20
-log10(pvalue)
2-fluorene
Blood 2,5-Dimethylfuran Blood Ethylbenzene
Blood Toluene Mono-benzyl phthalate

3-fluorene
Hepatitis A Antibody Red cell distribution width

g-tocopherol Homocysteine
Albumin, urine
Herpes I Lead, urine Glucose, serum Lymphocyte number

1-pyrene
10
Globulin Gamma glutamyl transferase
Eosinophils number
Herpes II Cadmium, urine Blood m-/p-Xylene Glucose, plasma
2-phenanthrene Triglycerides Pulse rate
Floor, GFAAS
Alkaline phosphotase
Blood Styrene 3-phenanthrene Monocyte Protoporphyrin
Blood 1,4-Dichlorobenzene
Glycohemoglobin
-0.3 -0.2 -0.1 0.0

Effect Size per 1SD of income/poverty ratio
Higher income: lower levels of biomarkers

(Another 23 associated with higher levels=20%) AJE, 2015
EWAS:
Possible to accelerate the pace of discovery of exposures
4
●
●
●
●●
●
●●
●
3
−log10(pvalue)
● ●
●
●
●
2
● ●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ●●
1
● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●● ●●
● ●
● ● ● ● ● ●● ●
● ●●● ● ●
● ● ● ● ●● ● ● ●
● ●●●● ● ●
● ●
● ●
● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●
•
● ● ● ●
●●● ●●● ● ●● ● ● ● ● ●
●● ● ● ● ● ●● ● ● ●
●
● ● ● ●
● ●● ●● ● ● ● ● ●
generalizable, comprehensive,
● ● ●
●● ●●● ● ●● ● ●● ● ● ● ●● ● ●
● ● ●
● ● ●●
●● ● ● ●
● ●
●● ● ●●●
●● ●● ●●
● ● ● ● ● ●
●
●
●
●● ● ● ●● ●
● ● ●
●
nutrients minerals
nutrients vitamin A
nutrients vitamin B
nutrients vitamin C
nutrients vitamin D
nutrients vitamin E
phytoestrogens
cotinine
hydrocarbons
volatile compounds
allergen test
viral infection
bacterial infection
latex
phenols
phthalates
polyflourochemicals
acrylamide
heavy metals
perchlorate
pcbs
dioxins
pesticides atrazine
pesticides carbamate
diakyl
furans dibenzofuran
transparent, and systematic study of
environment
HDL-C: 1-10 mg/dL
T2D: ~2-3 OR
• Created hypotheses for T2D, CVD, death, mortality: ~1.5-2 HR

and others
• What is LD of the environment?
• Needles among needles
• Confounding, reverse causality...

Can exposure enable re-classification of phenotypes?
The use of multiple molecular parameters to
characterize disease [P] may lead to a more
accurate and find-grained classification of
Committee on A Framework for Developing a
disease
New Taxonomy of [P]…
Disease
Board on Life Sciences
Division on Earth and Life Studies
NRC, National Academy of Sciences 2011
“multiple molecular parameters” must include E!

An icon for “precision medicine”?:
Linnaeus: classification of phenotypes (P) for

treatment and prevention (18th century)
Class 5: MENTALES (mental

disturbances)
Order 1: IDEALIS (faulty judgment)
Order 2: IMAGINI (imagination disorder)
Order 3: PATHETICI (irregular desires)
L-5-3: CITTA (eat the inedible)
L-5-3: TARANTISMUS (dancing via tarantula

bite)
Cogn Behav Neurol 2012
signs (signa), symptoms

essensia (essence of symptoms; e.g., inflammation)
causa (what caused the disease; e.g. pathogen)
Related diseases: common cause and treatment.

Classification of phenotypes (P) and disease today for via
International Classification of Disease
We are many phenotypes simultaneously:
Can we better categorize these P?

Metabolic
Body Measures
Glucose
Body Mass Index

LDL-Cholesterol
Height Triglycerides
Blood pressure & fitness

Kidney function
Systolic BP
Creatinine
Diastolic BP
Sodium
Pulse rate
Uric Acid
VO2 Max Aging
Telomere length
Inflammation
C-reactive protein
Liver function
white blood cell count Aspartate aminotransferase
Gamma glutamyltransferase
EWAS-derived phenotype-exposure association map:
A 2-D view of phenotype-exposure associations for re-
classification
n e
0 e t e
17 at o
C B o l c ar
P f β-
Glucose
BMI
Height
Cholesterol
http://bit.ly.com/pemap
Creation of a phenotype-exposure association map:
A 2-D view of 83 phenotype by 252 exposure associations
252 exposures
Association Size:
83 phenotypes
>0
<0
Clusters of exposures associated with clusters of phenotypes?
252 biomarkers of exposure × 83 clinical trait phenotypes
NHANES 1999-2000, 2001-2002, 2005-2006
~21K regressions: replicated significant (FDR < 5%) in 2003-2004
adjusted by age, age2, sex, race, income, chronic disease
Hugues Aschard, JP Ioannidis

Count
0 50 100 150
-
-0.4
phenotypes
-0.2
0
Value
Color Key
and Histogram
0.2
0.4
Alpha-carotene
Alcohol
Vitamin E as alpha-tocopherol
Beta-carotene
+
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folic acid
Folate, DFE
Food folate
Dietary fiber
Iron
Energy
Lycopene
Lutein + zeaxanthin
MFA 16:1
MFA 18:1
MFA 20:1
Magnesium
Total monounsaturated fatty acids
Moisture
Niacin
PFA 18:2
PFA 18:3
PFA 20:4
PFA 22:5
PFA 22:6
Total polyunsaturated fatty acids
Phosphorus
Potassium
Protein
Retinol
SFA 4:0
SFA 6:0
SFA 8:0
SFA 10:0
SFA 12:0
SFA 14:0
SFA 16:0
SFA 18:0
Selenium
Total saturated fatty acids
Total sugars
Total fat
Theobromine
Vitamin A, RAE
Thiamin
Vitamin B12
Riboflavin
Vitamin B6
Vitamin C
Vitamin K
Zinc
No Salt
Ordinary Salt
a-Carotene
Vitamin B12, serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate, serum
g-tocopherol
Iron, Frozen Serum
Combined Lutein/zeaxanthin
trans-lycopene
Folate, RBC
Retinyl palmitate
Retinyl stearate
Retinol
Vitamin D
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
Estimated VO2max
Physical Activity
Does anyone smoke in home?
Total # of cigarettes smoked in home
Cotinine
Current Cigarette Smoker?
Age last smoked cigarettes regularly
# cigarettes smoked per day when quit
# cigarettes smoked per day now
# days smoked cigs during past 30 days
Avg # cigarettes/day during past 30 days
Smoked at least 100 cigarettes in life
Do you now smoke cigarettes...
number of days since quit
Used snuff at least 20 times in life
drink 5 in a day
drink per day
days 5 drinks in year
days drink in year
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c] phenanthrene
3-benz[a] anthracene
Mono-n-butyl phthalate
Mono- phthalate
Mono-cyclohexyl phthalate
Mono-ethyl phthalate
Mono- phthalate
Mono--hexyl phthalate
Mono-isobutyl phthalate
Mono-n-methyl phthalate
Mono- phthalate
Mono-benzyl phthalate
Cadmium
Lead
Mercury, total
Barium, urine
Cadmium, urine
Cobalt, urine
Cesium, urine
Mercury, urine
Iodine, urine
Molybdenum, urine
Lead, urine
Platinum, urine
Antimony, urine
Thallium, urine
Tungsten, urine
Uranium, urine
Blood Benzene
Blood Ethylbenzene
Blood o-Xylene
Blood Styrene
Blood Trichloroethene
Blood Toluene
Blood m-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
Heptachlor Epoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
exposures
2,5-dichlorophenol result
2,4,6-trichlorophenol result
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138 & 158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoic acid
Perfluorohexane sulfonic acid
Perfluorononanoic acid
Perfluorooctanoic acid
Perfluorooctane sulfonic acid
Perfluorooctane sulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
Hepatitis A Antibody
Hepatitis B core antibody
Hepatitis B Surface Antibody
Herpes II
Insulin
MCHC
Ferritin
Weight
Sodium
Albumin
Globulin
Chloride
Uric acid
Total Fat
Trunk Fat
PSA. total
Creatinine
Osmolality
Potassium
60 sec HR
Total BMD
Hematocrit
Head BMD
Phosphorus
Bicarbonate
Hemoglobin
Total protein
Triglycerides
C-peptide: SI
Total calcium
Total bilirubin
mean systolic
60 sec. pulse:
Homocysteine
Albumin, urine
mean diastolic
Protoporphyrin
Glucose, serum
LDL-cholesterol
Triceps Skinfold
Standing Height
Platelet count SI
Glucose, plasma
Total Cholesterol
Body Mass Index
Glycohemoglobin
Mean cell volume

C-reactive protein
Monocyte percent
Monocyte number
Basophils number
Upper Leg Length
Recumbent Length
Methylmalonic acid
Blood urea nitrogen

Eosinophils percent
Lumber Spine BMD
Eosinophils number
Lumber Pelvis BMD
Lymphocyte percent
Red blood cell count

Lymphocyte number
TIBC, Frozen Serum

Head Circumference
Total Lean excl BMC

Waist Circumference
Thigh Circumference
Subscapular Skinfold
Mean platelet volume
Transferrin saturation
Trunk Lean excl BMC
Mean cell hemoglobin

White blood cell count
Direct HDL-Cholesterol
Red cell distribution width

Bone alkaline phosphotase
Maximal Calf Circumference
Lactate dehydrogenase LDH

Gamma glutamyl transferase
Prostate specific antigen ratio
Alanine aminotransferase ALT

Segmented neutrophils percent
Aspartate aminotransferase AST

A 2-D view of connections between P and E
Count
0 50 100 150
-
-0.4
phenotypes
-0.2
0
Value
Color Key
and Histogram
0.2
0.4
Alpha-carotene
Alcohol
Vitamin E as alpha-tocopherol
Beta-carotene
+
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folic acid
Folate, DFE
Food folate
Dietary fiber
Iron
Energy
Lycopene
Lutein + zeaxanthin
MFA 16:1
MFA 18:1
MFA 20:1
Magnesium
Total monounsaturated fatty acids
Moisture
Niacin
PFA 18:2
PFA 18:3
PFA 20:4
PFA 22:5
PFA 22:6
Total polyunsaturated fatty acids
Phosphorus
Potassium
Protein
Retinol
SFA 4:0
BMI, weight,
SFA 6:0
SFA 8:0
SFA 10:0
SFA 12:0
SFA 14:0
SFA 16:0
SFA 18:0
renal function Selenium
Total saturated fatty acids
Total sugars
Total fat
BMD
Theobromine
Vitamin A, RAE
Thiamin
Vitamin B12
Riboflavin
Vitamin B6
Vitamin C
Vitamin K
Zinc
No Salt
Ordinary Salt
a-Carotene
Vitamin B12, serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate, serum
g-tocopherol
Iron, Frozen Serum
Combined Lutein/zeaxanthin
trans-lycopene
Folate, RBC
Retinyl palmitate
Retinyl stearate
Retinol
Vitamin D
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
Estimated VO2max
nutrients
Physical Activity
Does anyone smoke in home?
Total # of cigarettes smoked in home
Cotinine
Current Cigarette Smoker?
Age last smoked cigarettes regularly
metabolic
# cigarettes smoked per day when quit
# cigarettes smoked per day now
# days smoked cigs during past 30 days
Avg # cigarettes/day during past 30 days
Smoked at least 100 cigarettes in life
Do you now smoke cigarettes...
number of days since quit
Used snuff at least 20 times in life
drink 5 in a day
drink per day
days 5 drinks in year
days drink in year
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c] phenanthrene
3-benz[a] anthracene
Mono-n-butyl phthalate
Mono- phthalate
Mono-cyclohexyl phthalate
Mono-ethyl phthalate
Mono- phthalate
Mono--hexyl phthalate
Mono-isobutyl phthalate
Mono-n-methyl phthalate
blood parameters Mono- phthalate
Mono-benzyl phthalate
Cadmium
Lead
Mercury, total
Barium, urine
Cadmium, urine
Cobalt, urine
Cesium, urine
Mercury, urine
Iodine, urine
Molybdenum, urine
Lead, urine
Platinum, urine
Antimony, urine
Thallium, urine
Tungsten, urine
Uranium, urine
Blood Benzene
Blood Ethylbenzene
Blood o-Xylene
Blood Styrene
Blood Trichloroethene
Blood Toluene
Blood m-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
Heptachlor Epoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
exposures
2,5-dichlorophenol result
2,4,6-trichlorophenol result
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138 & 158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoic acid
Perfluorohexane sulfonic acid
hydrocarbons
Perfluorononanoic acid
Perfluorooctanoic acid
Perfluorooctane sulfonic acid
Perfluorooctane sulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
Hepatitis A Antibody
Hepatitis B core antibody
Hepatitis B Surface Antibody
Herpes II
Insulin
MCHC
Ferritin
Weight
Sodium
Albumin
Globulin
Chloride
Uric acid
Total Fat
Trunk Fat
PSA. total
Creatinine
Osmolality
Potassium
60 sec HR
Total BMD
Hematocrit
Head BMD
Phosphorus
Bicarbonate
Hemoglobin
Total protein
Triglycerides
C-peptide: SI
Total calcium
Total bilirubin
mean systolic
60 sec. pulse:
Homocysteine
Albumin, urine
mean diastolic
Protoporphyrin
Glucose, serum
LDL-cholesterol
Triceps Skinfold
Standing Height
Platelet count SI
Glucose, plasma
Total Cholesterol
Body Mass Index
Glycohemoglobin
Mean cell volume

C-reactive protein
Monocyte percent
Monocyte number
Basophils number
Upper Leg Length
Recumbent Length
Methylmalonic acid
Blood urea nitrogen

Eosinophils percent
Lumber Spine BMD
Eosinophils number
Lumber Pelvis BMD
Lymphocyte percent

Lymphocyte number
TIBC, Frozen Serum

Head Circumference
Total Lean excl BMC

Waist Circumference
Thigh Circumference
Transferrin saturation
Trunk Lean excl BMC




metabolic
pcbs
A 2-D view of connections between P and E
Toward a phenotype-exposure association map:
(Re)-categorizing phenotypes with E
metabolic:Glucose, serum
metabolic:Glucose, plasma
metabolic:Glycohemoglobin
metabolic:C-peptide: SI
blood pressure:mean diastolic
liver:Alkaline phosphotase
metabolic traits
bone:Bone alkaline phosphotase
blood:Red cell distribution width
blood:Protoporphyrin
blood:Platelet count SI
cancer:PSA. total
blood pressure:60 sec HR
kidney:Albumin, urine
kidney:Sodium
kidney:Chloride
kidney:Osmolality
kidney function
nutrition:Methylmalonic acid
heart:Homocysteine
blood:Mean platelet volume
immunological:Eosinophils number
immunological:Basophils number
immunological:White blood cell count
immunological:Lymphocyte number
immunological:Segmented neutrophils number
immunological:Monocyte number
liver:Globulin
immunological:C-reactive protein
inflammation
blood pressure:mean systolic
body measures:Subscapular Skinfold
body measures:Trunk Fat
body measures:Total Fat
metabolic:Insulin
blood pressure:60 sec. pulse:
liver:Gamma glutamyl transferase
body measures:Thigh Circumference
body measures:Maximal Calf Circumference
body measures:Triceps Skinfold
body measures:Waist Circumference
body measures:Body Mass Index
body measures:Trunk Lean excl BMC
adiposity
body measures:Total Lean excl BMC
blood:Segmented neutrophils percent
body measures:Weight
blood:Red blood cell count
blood:Ferritin
kidney:Creatinine
kidney:Total calcium
kidney:Blood urea nitrogen
kidney:Uric acid
blood:Mean cell volume
blood:Mean cell hemoglobin
kidney:Potassium
blood:Hemoglobin
blood:Hematocrit
blood:TIBC, Frozen Serum
blood:MCHC
heart:Total Cholesterol
heart:LDL-cholesterol
heart:Triglycerides
bone:Lumber Pelvis BMD
bone:Lumber Spine BMD
bone:Total BMD
body measures:Upper Leg Length
body measures:Standing Height
bone:Head BMD
immunological:Monocyte percent
heart:Direct HDL-Cholesterol
liver:Total bilirubin
blood:Transferrin saturation
cancer:PSA, free
cancer:Prostate specific antigen ratio
liver:Lactate dehydrogenase LDH
body measures:Recumbent Length
body measures:Head Circumference
liver:Alanine aminotransferase ALT
liver:Aspartate aminotransferase AST
liver:Total protein
kidney:Phosphorus
immunological:Eosinophils percent
immunological:Lymphocyte percent
immunological:Basophils percent
kidney:Bicarbonate
liver:Albumin
7 6 5 4 3 2 1 0
Distance
cancer:PSA. total
kidney:Sodium
kidney:Chloride
kidney:Osmolality
heart:Homocysteine
liver:Globulin
metabolic:Insulin
blood:Ferritin
kidney:Creatinine
kidney:Uric acid
kidney:Potassium
blood:Hemoglobin
blood:Hematocrit
blood:MCHC
heart:Triglycerides
“bad” cholesterol
bone:Total BMD
bone:Head BMD
“good” cholesterol
cancer:PSA, free
liver:Total protein
kidney:Phosphorus
kidney:Bicarbonate
liver:Albumin
7 6 5 4 3 2 1 0
Distance
cancer:PSA. total
kidney:Sodium
kidney:Chloride
kidney:Osmolality
heart:Homocysteine
liver:Globulin
metabolic:Insulin
blood:Ferritin
kidney:Creatinine
kidney:Uric acid
kidney:Potassium
blood:Hemoglobin
blood:Hematocrit
blood:MCHC
heart:Triglycerides
bone:Total BMD
bone:Head BMD
height + BMD
cancer:PSA, free
liver:Total protein
kidney:Phosphorus
kidney:Bicarbonate
liver:Albumin
7 6 5 4 3 2 1 0
Distance
H 2 vs. σ 2
E
σ 2
E?
Basophils percent
PSA. total
Eosinophils percent
PSA, free
Sodium
Basophils number
60 sec HR
mean diastolic
Head BMD
Monocyte percent
Eosinophils number
Recumbent Length
Lymphocyte percent
mean systolic
Monocyte number
Osmolality
MCHC
Platelet count SI
Lumber Spine BMD
Glucose, plasma
Glucose, serum
Potassium
Total BMD
Upper Leg Length
60 sec. pulse:
Bicarbonate
Chloride
Globulin
Glycohemoglobin
Lumber Pelvis BMD
1 to 66 exposures identified for 81
Phosphorus
TIBC, Frozen Serum
phenotypes

Total Lean excl BMC
Lymphocyte number
Triceps Skinfold
Methylmalonic acid
Trunk Lean excl BMC
Creatinine
Additive effect of E factors:
Describe < 20% of variability in P

1/Creatinine
Standing Height
Weight
Hematocrit
Waist Circumference
Total protein
Hemoglobin
Protoporphyrin
(On average: 8%)

Uric acid
Mean cell volume
Total bilirubin
Total calcium
Thigh Circumference
Ferritin
Body Mass Index
C-reactive protein
C-peptide: SI
Homocysteine
Albumin
Blood urea nitrogen
Head Circumference
Total Fat
Insulin
Albumin, urine
Trunk Fat
LDL-cholesterol
Total Cholesterol
Triglycerides
0 10 20 30 40
R^2 * 100
Emerging technologies to ascertain exposome will enable
biomedical discovery
e.g., EWASs in complex disease through the life course
High-throughput E standards:
mitigate fragmented literature of associations
Confounding, reverse causality:
how to handle at large dimension?
Enable more precise definitions of P

Complex traits are a function of genes and environment...
P=GxE
...but what about interaction between these factors?
Do a combination of genetic and environmental factors impart different

risk for disease than either alone?
Gene-Environment Interactions: Combination of G and E
different than of variant or factor alone
Bladder Cancer
smoke?
Find additional disease risk (variance)

NAT2 variant E+ E-
G+
Posit biological mechanisms G-
er
r
ce
nc
an
ca
c
n-
Analytically complex
no
• How do you select which G and E to test???
• Need a lot of samples (power!)
Few studies exist that measure G & E together

Environmental Toxicology. 2012
Bioinformatics. 2012
Curr Op Env Health (in press)

Why not investigate genes and environment simultaneously:
Analytic complexity and large numbers of interactions!
C
ta E
di e
n
ta ?
ra icid
io
vi oke
...............................
vi in
in
at
m
m
st
sm
pe
10 genetic variants 1 2 3 4 5 6 7 8 9 10 10 exposures
rs13266634 (SLC30A8) 1
rs7903146 (TCF7L2) 2
.............................................
3
4
5
= 100 possible interactions
6
7
8
9
rs1807292 (PPARγ) 10
G genetic variants and E exposures = G × E possible pairs


Why not investigate genes and environment simultaneously:
Analytic complexity and large numbers of interactions!
C
ta E
di e
n
ta ?
ra icid
io
vi oke
...............................
vi in
in
at
m
m
st
sm
pe
10 genetic variants 1 2 3 4 5 6 7 8 9 10 10 exposures
rs13266634 (SLC30A8) 1
rs7903146 (TCF7L2) 2
.............................................
3
4
5
= 100 possible interactions
6
7
8
9
rs1807292 (PPARγ) 10
G genetic variants and E exposures = G × E possible pairs


Combining EWAS and GWAS:
Select pairs by their main effects
●
● ●
●
● ●●
γ-tocopherol
●
−log10(pvalue)
●● ● ● ●
●● ●
β-carotene
● ●●● ●
● ● ● ●
● ● ●
● ●
●●
●
● ●
● ●
heptachlor
● ● ●
● ● ● ●● ●
●
● ● ● ● ●
● ●
1
● ● ● ● ● ● ●
●
● ● ●
● ● ●● ● ●
●
PCB170
● ● ● ● ● ●●
●
●
●● ●
● ●
● ● ●● ●
● ● ● ●
● ● ●●
●
●● ●
● ● ● ●● ● ● ●
● ● ● ●● ●● ● ● ●
●
● ● ●●
● ● ● ● ●●
●● ● ● ●●● ●
● ●
●
●
● ●● ●●
● ● ● ●● ● ●●
●
●
●
● ● ●●
● ●● ●
●●
●
● ● ●
● ●
●● ●
● ●● ● ●
● ●
● ●● ●● ● ● ●
●● ●
● ●
●●
●
●
●● ● ●● ●
0
ex: PLOS ONE (2010)

nutrients minerals
nutrients vitamin A
nutrients vitamin B
nutrients vitamin C
nutrients vitamin D
nutrients vitamin E
phytoestrogens
cotinine
hydrocarbons
volatile compounds
allergen test
viral infection
bacterial infection
latex
phenols
phthalates
polyflourochemicals
acrylamide
heavy metals
perchlorate
pcbs
dioxins
pesticides atrazine
diakyl
furans dibenzofuran
ARTICLES
50 Locus established previously

Locus identified by current study
+TCF7L2
rs7903146 (TCF7L2)
rs13266634 (SLC30A8)
Unconditional analysis
Locus not confirmed by current study

40 rs1801282 (PPARG)
HHEX/IDE
ex: Voight et al., Nature Genetics (2010)
30
WTCCC, Nature (2007)
–log10(P)
KCNQ1 (2 signals*: )
CDC123/CAMK1D
CHCHD9 KCNJ11
CDKAL1
20
CDKN2A/2B
SLC30A8
CENTD2
MTNR1B Sladek et al., Nature (2007)
ADAMTS9 IGF2BP2 HMGA2 ZFAND6
TP53INP1
BCL11A PPAR TSPAN8/LGR5 PRC1
WFS1 JAZF1
10 IRS1 ZBED3 HNF1A FTO
THADA HNF1B DUSP9
KLF14
NOTCH2
Suggestive statistical association (P < 1 10 )

Association in identified or established region (P < 1 10 )
–5
–4
Human Genetics. 2013
Conditional analysis
(P)
10
Prototype G-EWAS Methodology
GxE in association to T2D
l e te ne Logistic Regression
5 EWAS factors ero roten caro

p h a - 0 h l or Fasting Blood Glucose ≥ 126 mg/dL
o β
-toc s-β-c ans- B17 ptac
18 GWAS loci γ ci tr PC he (age, BMI, sex, race)
rs10923931(NOTCH2)
(2)
rs7903146(TCF7L2) rs13266634 (#) risk alleles
logit(diabetes)
rs13266634(SLC30A8) (1)
rs7901695(TCF7L2)
rs2383208(Unknown) (0)
rs1260326(GCKR)
rs780094(GCKR)
rs2237895(KCNQ1)
rs10811661(Unknown) z(γ-tocopherol)
rs4712523(CDKAL1)
rs4607103(Unknown) Bonferroni Correction False Discovery Rate
rs1111875(Unknown)
Number of Effective Tests1 ≅80
Parametric Bootstrap
rs7578597(THADA)
rs4402960(IGF2BP2) α=0.05/80 = 0.0006 of Null Model2
rs1801282(PPARG)
rs12779790(Unknown) 4.4 17.8
rs8050136(FTO)
rs864745(JAZF1) 1.) Nyholt. AJHG 2004
2.) Bůžková et al. Annals of Human Genetics. 2010

total: 90
Per-risk allele OR for rs13266634 (SLC30A8) Stratified by E
Increase or decrease up to 30-40% vs. marginal effect!
marginal OR=1.1
OR (95% CI)
rs13266634(SLC30A8)
trans-β-carotene (low(-1SD)) 1.8 [1.3,2.6]
trans-β-carotene (mean) 1.1 [0.79,1.5]
trans-β-carotene (high(+1SD)) 0.65 [0.4,1.1]
p-value:5e-05
N(cases):1702(164)
FDR=2%
rs13266634(SLC30A8)
γ-tocopherol (low(-1SD)) 0.82 [0.52,1.3]
γ-tocopherol (mean) 1.1 [0.87,1.5]
γ-tocopherol (high(+1SD)) 1.6 [1.3,2]
p-value:0.0094
N(cases):2925(274)
FDR=18%
0 0.5 1 1.5 2 2.5
Per risk allele OR
Adjusted for race, sex, BMI, age
Human Genetics. 2013

It is possible to detect GxE by combining EWAS and GWAS
Detected interaction effect changes

between EWAS and GWAS factors
What is the biological mechanism of

interaction?
Need to replicate these results in

diverse populations.
Re-capture GWAS “investment” by

considering prevalent E-factors?!
Possible to utilize the XWAS approach for general

purpose discovery…
rai
n 5 ro be
t gi lu
ob s
titi ratos
is eu dia An Occ ids
et a 2 4 ly n 2 r r ho
erm ke Po ype mo
l ar d rheic nia
T He
So e b o r
o pe 3
S te
Os 1 2
PheWAS: dissecting the shared genetic

0
1
0
architecture (pleiotropy) of disease!
Ps opo lic
Ps opo lic
an op us
H me stic
N iat c
eu ric
m ary
om s gic
In igns
s
pl s
em et tic
N iat c
ro c
c
o m o s g ic
In gns
s
d tal
D ar y
Sy Mu ma ry
ic
In gns
s
Sy Mu ma ry
D ar y
r
D ar y
d tal
d al
Pu cula
Pu cula
h i
gi
r ie
eo iou
h i
eu r i
gi
r ie
ti v
t iv
r ie
t iv
yc iet
yc iet
pt ul og
a
er rina
at abo
at abo
as
an t
N ctio
an le
a n le
lo
m scu tolo
lo
o
s kele
D u r in
on
D u r in
la
on
es
on
es
es
si
ju
ju
si
ju
ct
m sc tol
s ke
m c ol
s ke
ro
s
s
m
ig
ig
em t
u
fe
fe
lm
ig
lm
va
va
a
om os
ito
ito
ito
e
In
In
m
pt lo
io
io
pt ul
en
en
en
er
er
rd
d
d
rd
an
D
s
G
G
G
Ca
Ca
H
Sy u
e
e
M
in
in
cr
cr
do
do
En
RA GWAS SNP
En
c MI GWAS SNP
rat
os
is 16 rs4977574 (CDKN2BAS) d 11 rs660895 (HLA-DRB1)
cid
os
is
e 15 10 toa
ei ck os
is ke
rrh ler tic
bo 14 ro sc iab
e
Se the 9 1d
a e
13 ary Typ
or four SNPs. Each panel represents 1,358 phenotypes oron e ete
s
ar t
hriti
s
12 e
C
om 8 iab m.
h a particular SNP, 11
using logistic regression is
t d ary
eassuming
as yndr
s an Typ
e1
d
ro pa
thy
R he
u
a r u
s a n 7 ne thy
djusted
l mu
co for age, sex,
10 study site and the e mfirst
ic e co three principal
he
t
ro
n iab
e tic
hro pa
r a h ia tio 1d ep
fo Isc rmed
–log10(P)
ac sn
are grouped along 9 the x axis by categorization e f r within ies 6
Typ
e te
–log10(P)
Int in ter be
d i al a s e a r dia itis
ca
r
−6 ise ral 1 tes es ar ter
chy. The upper red8 lines indicate P = M4.6 yo × 10 he
ar t d (FDR
rec
e reb = 0.1 5 Typ
e
in
dia
b e it
ter el
i d
Ar ant c fectio ts
l us ia
7 ic fp hy s i mon
r blue lines indicate P = 0.05; dashed be ath lines
tes y
isc
h e m
are os a
i s o
4
a
rop ab
eu 2 di
t ete G , i n
itis de pn s
tiv field iral olyp
c
fe e u
6 dia hrop n ic toris sten ly n nc V al p k
orrection (P = 0.05/1,358). Diamonds n
y i ep Chroencircling ec &
phenotype Po Typ
e nju ual s oc
5 ath etic n a p sion Co Vis Na Sh
u o p
r iab n g in
c l u s 3
c id
NHGRI Catalog associations.
4 Po ype (a) PheWAS massociations
e
lyn 2
d A O
orr
ho for
T He 2
eviously associated 3 with hair and eye color, freckling and
2 1
palsy. (b) PheWAS associations for rs2853676 in TERT,
1
h glioma. (c) PheWAS 0
associations for rs4977574 near 0
previously associated with myocardial infarction, and in
In gns
an op us
ro c
e
H m stic
s
Ps opo lic
N iatr c
om os gic
D ary
Sy Mu ma ry
lm ar
d tal
i
gi
rie
h ti
tiv
at abo
er rina
yc ie
N ctio
Pu cul
an le
lo
o
Ps opo lic
om s gic
In igns
s
pl s
H me stic
N iat c
ro c
c
om os gic
In gns
s
la
on
es
r
Sy Mu ma ry
D ar y
si
ju
d tal
d al
m sc tol
Pu cula
rie
eo iou
h i
eu r i
gi
rie
s ke
tiv
(d) PheWAS associations for rs660895 near HLA-DRB1,

yc iet
s
a
at abo
an t
ig
fe
em et
u
an le
m scu olo
lo
o
s kele
va
eu
D urin
a
on
es
si
ju
ju
ito
s
ct
m sc tol
e
In
s ke
s
io
ig
em t
pt ul
fe
lm
va
en
d
rd
ito
h rheumatoid arthritis. Results and plots for all SNPs

In
pt lo
D
io
G
N
Ca
pt ul
en
er
rd
d
e
an
G
Ca
rin
tudy are available at http://phewascatalog.org/.
e
c
in
do
cr
En
do
En
d 11
PheWAS:
rs660895 (HLA-DRB1)
is
10−12),acute myocardial infarction (OR = os
Our study replicated the association between rheumatoid arthritis
cid
Phenome-wide
= 1.29, and rs660895association study
10 etoa
ck
eti
d abdominal aortic aneurysm (OR 9 near HLA-DRB1 (Fig.
e 1 dia
b
3d; OR = 1.56, P = 6.7 × 10−8).
Typ
Denny
with prior publications3, but also with other et Nature
al,SNP was also Biotech 2013
s
This8 strongly associated
di abe
tes
with type 1 diabetes (OR =
y u m.a
r th
riti
e1 ath he
ular” phenotypes7 such as unstable angina, −8 Typ
eu
ro p
y 1.44,
R
P = 7.1 × 10 ) and potentially associated with inflammatory
i cn ath
et rop
dia
b
ne
p h
−5
)
www.nature.com/psp
p_full
a
MWAS
1.0E-001
1.0E-002
1.0E-003
1.0E-004
1.0E-005
1.0E-006
1.0E-007
1.0E-008
1.0E-009
1.0E-010
1.0E-011
1.0E-012
ANTIDIARRHEALS,INTES... Sulfasalazine
ANTIEMETICS AND ANTI... Tetrahydrocannabinol
ALIMEN DRUGS FOR ACID RELA... Sucralfate
TARY
TRACT DRUGS FOR FUNCTIONAL Dicyclomine
AND GASTROINTESTINAL DIS... Hyoscyamine
METAB
DRUGS USED IN Acarbose
OLISM
DIABETES
Color by
Sitagliptin
LAXATIVES Lactulose
ANTINF Clindamycin
ECTIVE ANTIBACTERIALS FOR
S FOR Methenamine
SYSTEMIC USE
atc1_concept_name
SYSTE Penicillin V
MIC ANTIMYCOTICS FOR
AND REPELLENTS
SYSTEMIC USE Ketoconazole
USE
ANTIPA ANTIVIRALS FOR SYSTE... Nevirapine
RASITIC ANTHELMINTICS Mebendazole
PRODU
ANTIPROTOZOALS Tinidazole
CARDIOVASCULAR SYSTEM
CTS, I...
ANTIANEMIC Darbepoetin alfa
BLOOD
AND PREPARATIONS Epoetin Alfa
BLOO... ANTITHROMBOTIC AGE... Dipyridamole
ANTIINFECTIVES FOR SYSTEMIC USE

ALIMENTARY TRACT AND METABOLISM
CARDIO AGENTS ACTING ON TH... Moexipril
BLOOD AND BLOOD FORMING ORGANS

VASCUL CALCIUM CHANNEL Amlodipine
ANTIPARASITIC PRODUCTS, INSECTICIDES

AR SY... BLOCKERS Nifedipine
DERMA ANTIFUNGALS FOR DER... Terbinafine
TOLOGI
EMOLLIENTS AND Urea
CALS
PROTECTIVES
Estradiol
SEX HORMONES AND
GENITO MODULATORS OF THE Estrogens, conjugated (USP)
URINAR GENITAL SYSTEM
Y Estropipate
Darifenacin
NULL
SYSTE
M AND UROLOGICALS Flavoxate
SEX
HORMO Oxybutynin
NES Etodolac
Fenoprofen
NERVOUS SYSTEM
Indomethacine
ANTIINFLAMMATORY AND
DERMATOLOGICALS
ANTIRHEUMATIC Ketorolac
PRODUCTS Nabumetone
MUSCU
LO- Oxaprozin
SKELET
MUSCULO-SKELETAL SYSTEM
Sulindac
AL
SYSTE Metaxalone
M MUSCLE RELAXANTS
Methocarbamol
Flurbiprofen
TOPICAL PRODUCTS FOR Ketoprofen
JOINT AND MUSCULAR
Piroxicam
PAIN
Tolmetin
GENITO-URINARY SYSTEM AND SEX HORMONES

Almotriptan
MarketScan CCAE
Diflunisal
Eletriptan
Frovatriptan
ANALGESICS Naratriptan
OMOP acute myocardial infarction 1
Rizatriptan
Salicylsalicylic acid
Sumatriptan
atc1_concept_name, atc3_concept_name, rxnorm_concept_name
Zolmitriptan
NERVO
US ANESTHETICS Prilocaine
SYSTE ANTIEPILEPTICS Primidone
M
ANTI-PARKINSON DRUGS Bromocriptine
SENSORY ORGANS Desipramine
PHYCHOANALEPTICS Imipramine
RESPIRATORY SYSTEM
Nortriptyline
Chlorazepate
Droperidol
PSYCHOLEPTICS Prochlorperazine
Ramelteon
SYSTEMIC HORMONAL PREPARATIONS,
Temazepam
Amylases
EXCLUDING SEX HORMONES AND INSULINS
Endopeptidases
NULL NULL
Lipase
Sodium phosphate, monobasic
ANTIHISTAMINES FOR
SYSTEMIC USE Loratadine
RESPIR COUGH AND COLD PRE... Benzonatate
ATORY
SYSTE DRUGS FOR Salmeterol
M OBSTRUCTIVE AIRWAY... Zafirlukast
NASAL PREPARATIONS Fluticasone
Acetazolamide
Bromfenac
SENSO
RY OPHTHALMOLOGICALS Gatifloxacin
ORGAN Ketotifen
S
Scopolamine
MWAS:
SYSTE OTOLOGICALS Miconazole

MIC
PITUITARY AND HYPOTH... Cosyntropin
HORMO
NAL P... THYROID THERAPY Methimazole
1
0
Shape by
Ryan, PB., CPT 2013

P < 0.05
Horizontal line:
Horizontal line:
GROUND_TRUTH
Bonferroni adjustment: P
Medication-wide association study
3
In conclusion:
on GWAS and EWAS
●
● ●
●
●●
●
2
●
−log10(pvalue)
●● ● ● ●
●● ●
● ●●● ●
● ● ● ●
● ● ●
● ●
●●
●
● ●
● ● ● ● ●
● ● ● ●● ●
●
● ● ● ● ●
● ●
1
● ● ● ● ● ● ●
●
● ● ●
● ● ●● ● ●
● ● ● ● ●
●
● ●
● ● ● ●●
●● ●● ● ●
● ●● ●
● ● ● ●●
● ● ● ● ●●
●● ●
● ● ● ●●● ●● ●● ●
●
● ● ● ●
●
● ● ● ●●● ● ● ●●
●● ● ● ●
● ●
●
●
● ●● ●●
● ● ● ●● ● ●●
●
●
●
● ● ●●
● ●● ●
●●
●
● ● ●
● ●
●● ●
● ● ●
● ●● ●● ● ● ●
● ●
● ●
●● ●● ●
●● ● ● ● ●
●
●● ●
nutrients minerals
nutrients vitamin A
nutrients vitamin B
nutrients vitamin C
nutrients vitamin D
nutrients vitamin E
phytoestrogens
cotinine
hydrocarbons
volatile compounds
allergen test
viral infection
bacterial infection
latex
phenols
phthalates
polyflourochemicals
acrylamide
heavy metals
perchlorate
pcbs
dioxins
pesticides atrazine
diakyl
furans dibenzofuran
Figure 1. GWAS Discoveries over Time
Data obtained from the Published GWAS Catalog (see Web
Resources). Only the top SNPs representing loci with association
p values < 5 3 10!8 are included, and so that multiple counting
is avoided, SNPs identified for the same traits with LD r2 > 0.8 esti-
mated from the entire HapMap samples are excluded.
to a doubling of the number of associated variants discov-

ered. The proportion of genetic variation explained by
significantly associated SNPs is usually low (typically less
than 10%) for many complex traits, but for diseases such
as CD and multiple sclerosis (MS [MIM 126200]), and for
quantitative traits such as height and lipid traits, between
GWAS has been unparalleled in biological discovery...
10% and 20% of genetic variance has been accounted for

(Table 1). In comparison to the pre-GWAS era, the propor-
... coupled with EWAS, will lead to precise and personal

tion of genetic variation accounted for by newly discov-
ered variants that are segregating in the population is large.
It is clear that for most complex traits that have been
medicine.
investigated by GWAS, multiple identified loci have
genome-wide statistical significance, and thus it is likely
Figure 2. Increase in Number of Loci Identified as a Function of
Experimental Sample Size
that there are (many) other loci that have not been identi- (A) Selected quantitative traits.
(B) Selected diseases.
fied because of a lack of statistical significance (false nega-
The coordinates are on the log scale. The complex traits were
tives). Recently, researchers have developed and applied selected with the criteria that there were at least three GWAS
methods to quantify the proportion of phenotypic varia- papers published on each in journals with a 2010–2011 journal
Thanks...
Chirag Lakhani Harvard HMS Stanford CDC/NCHS
Adam Brown Isaac Kohane
John Ioannidis
Ajay Yesupriya
Arjun Manrai Susanne Churchill

Atul Butte (UCSF)
Erik Corona Stan Shaw

Imperial
Nam Pho Nathan Palmer
U Queensland Ioanna Tzoulaki
Jenn Grandfield
Jian Yang
Paul Elliott
Sunny Alvear
Peter Visscher
Michal Preminger
Lund (Sweden)
Cochrane Jan Sundquist
Harvard Chan Belinda Burford Kristina Sundquist

Hugues Aschard
Francesca Dominici
NIH Common Fund
Big Data to Knowledge
Chirag J Patel
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org

Bmi 701 12 1 2015 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bmi 701 12 1 2015 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Bioinformatics for discovery:

Introduction to GWAS and EWAS

BMI 701:Introduction to Biomedical Informatics

Phenotype Genome Environment

Gene expression Drugs

Genome-wide Association Studies (GWAS)

>16,000 SNP-trait associations

Hypothesis-free “search engine” for genetic variants

associated with a complex trait or disease

Vol 447 | 7 June 2007 | doi:10.1038/nature05911

GWAS is relevant today (even with NGS) around the corner

Complex Human Diseases because the numbe

>3,000,000,000 bases in human genome

SNPs appear ~1000 bases

40-60% have minor allele frequency <5%

GWAS focus on frequency >5%

Bush and Moore, 2012

additional studies to map the precise needed to capture the variation

~100,000 - ~1,000,000 association tests

>1: increased odds (increased risk)

<1: decreased odds (decreased risk)

Odds Ratio d/c

How would you interpret an OR of 2?

Cox survival regression

•Direct measure of risk vs. odds ratio

SNP rs1234 SNP rs123456

100 xCT=1 if individual is CT

β: change in height for 1 risk allele

Many tests: some can be significant (low p-value by chance)!

100 tests at a p-value of 0.05...

Correct the 0.05 significance level by number of tests

e.g., 1M SNPs: 0.05/1x10-6 = 5x10-8

random uniform distribution Global Lipids Consortium, 2012

0.0 0.2 0.4 0.6 0.8 1.0

Examining the QQplot of test statistics in WTCCC

Confounder is correlated to both the “risk” factor and disease,

leading to invalid inference.

Common source of bias in observational studies (e.g., case-control,

Ancestry correlated with allele frequency and disease

GWAS are done on specific populations separately.

(most have been done in populations of European ancestry)

FTO Body Mass

Association between FTO and Type 2 Diabetes via BMI?

... or does FTO have a independent role in Type 2 Diabetes...?

•cited >9000 times since 2007

•linkage disequilibrium (LD)

•association: allelic, genotypic models

GWASs in Type 2 Diabetes

•insulin sensitivity (BMI)

•Moves glucose from blood into

•diagnosed by blood glucose body weight, diet, lifestyle, age

family history: 25%

A genome-wide association study

Supporting Online Material

Genome-Wide Association Analysis

Downloaded from www.sciencemag.org on February 8, 2010

and Triglyceride Levels

A Genome-Wide Association Study of

Association Signals in UK Samples

Multiple Susceptibility Variants

Reveals Risk Loci for Type 2 Diabetes we identified

How many SNPs (p-value?)

intense research1. Although the genetic causes of many monogenic