Case 1

Statistical Analysis in Case-Control studies
Liu Tian Genome Institute of Singapore

liut2@gis.a-star.edu.sg
Summer International Workshop Aug, 09, Beijing
Outline
Introduction Basic Statistical Methods of Case-control Study GWAS A Novel Epistasis-testing Procedure
Aim of Genetic Studies

Dramatic variation do exist within a same spice Almost every biological phenomenon involves a genetic component There is always a keen need for us to seek the genetic variation relates to complex traits.
Different Design Strategies

Intervention studies
Clinic trials
Observational studies
1. Case-control studies 2. Cohort studies
Cohort Studies
A cohort study is a study where a group of individuals are followed. Cohort studies can be either prospective or retrospective
exposed
population
non-exposed
Disease +/-
Case-Control Studies
Case-control studies are used to identify factors that may contribute to a medical condition by comparing subjects who have that condition (the cases) with patients who do not have the condition but are otherwise similar (the controls) Case-control studies are retrospective and nonrandomized
Case-Control Studies
exposed
Disease population
non-exposed
exposed
Disease +
non-exposed
Selection of Cases
Population-based cases: include all subjects or a random sample of all subjects with the disease at a single point or during a given period of time in the defined population.
Hospital-based cases: All patients in a hospital department at a given time
Selection of Controls
Principles of Control Selection: Study base: Controls can be used to characterise the distribution of exposure Comparable-accuracy: Equal reliability in the information obtained from cases and controls (to avoid systematic misclassification) Overcome confounding: Elimination of confounding through control selection (matching or stratified sampling)
Selection of Controls
General population controls: registries, households, telephone sampling costly and time consuming recall bias eventually high non-response rate Hospitalised controls: Patients at the same hospital as the cases Easy to identify; less recall bias; higher response rate
Case-Control Studies vs. Cohort Studies

Cohort study Rare exposure Examine multiple effects of a single exposure Minimizes bias in the in exposure determination Direct measurements of incidence of the disease Case-control study Quick, inexpensive Well-suited to the evaluation of diseases with long latency period Rare diseases Examine multiple etiologic factors for a single disease
Case-Control Studies vs. Cohort Studies

Cohort study
Not rare diseases Prospective: Expensive and time consuming Retrospective: in adequate records Validity can be affected by losses to follow-up
Case-control study
Not rare exposure Incidence rates cannot be estimated unless the study is population based retrospective, nonrandomized nature limits the conclusions that can be drawn from them.
Data Structure of Casecontrol studies

individual 1 2 affection 1 1 gender F M SNP 1 2 2 SNP 2 1 2 SNP n 2 1
3
4
0
1
F
F
1
1
2
1
2
2
-9
sample id
case/control
genotypes
Outline
Population-Based Case-Control Study

Individuals are unrelated To test if marker genotypes distribute differently between the cases and controls By comparing within cases and controls, we identify those genetic factors correlated with a pre-defined phenotype
Coding Genotypes
For one marker with two alleles, there can be three possible genotypes:
Genotype AA Aa aa Coding 2 1 0
Genetic Models and Underlining Hypotheses

Genotypic Model
Genotype Genotypic Value
Genotypic value is the expected phenotypic value of a particular genotype
AA
Aa aa
AA
Aa aa
Hypothesis: all 3 different genotypes have different effects

AA vs. Aa vs. aa

Dominant Model
AA Aa aa
AAaa
Hypothesis: the genetic effects of AA and Aa are the same (assuming A is the minor allele)
AA and Aa vs. aa

Recessive Model
AA Aa aa
Aaaa
Hypothesis: the genetic effects of Aa and aa are the same (A is the minor allele)
AA vs. Aa and aa

Allelic Model
Genotype AA Aa aa Genotypic Value 2A A+ a 2a
Hypothesis: the genetic effects of allele A and allele a are different

A vs. a
Pearsons Chi-squared Test

Genotypic Model: Null Hypothesis: Independence
H 0 : ij i. . j
AA nAA mAA Aa nAa mAa
df = 2
cases controls
aa naa maa

Dominant Model: Null Hypothesis: Independence
H 0 : ij i. . j
AA+Aa nAA + nAa mAA + mAa
df = 1
cases controls
aa naa maa

Recessive Model: Null Hypothesis: Independence
H 0 : ij i. . j
AA nAA mAA Aa +aa nAa + naa mAa + maa
df = 1
cases controls

Allelic Model: Null Hypothesis: Independence
H 0 : ij i. . j
A 2nAA + nAa 2mAA + mAa a nAa + 2naa mAa +2 maa
df = 1
cases controls
Test Statistic
Chi-squared Test Statistic:
(O E ) E all cells
2
O is the observed cell counts E is the expected cell counts, under null hypothesis of independence (row total column tot al) E N
Example
The following table summarize the genotype counts of marker M :
AA Aa aa
cases
controls
36
18
100
84
64
98
Different tests can be performed: - Allelic test - Dominant gene action - Recessive gene action - Genotypic test
Example (Dominant Gene Action)

Using
R:
dominant_table <- matrix(c(80,90,20,10), ncol = 2)
print(dominant_table )
chisq.test(dominant_table ,correct=FALSE)
Example (Recessive Gene Action)

Using
R:
recessive_table <- matrix(c(36,18,164,182), ncol = 2)
print(recessive_table)
chisq.test(recessive_table,correct=FALSE)
Example (Genotypic Test)

Using
R:
genotypic_table <- matrix(c(36,18,100,84,64,98), ncol = 3)
print(genotypic_table)
chisq.test(genotypic_table,correct=FALSE)
Example (Allelic Test)

Using
R:
allelic_table <- matrix(c(172,120,228,280), ncol = 2)
print(allelic _table)
chisq.test(allelic_table,correct=FALSE)
Logistic Regression Analysis

A General Model:
pdisease logit( pdisease ) log( ) 0 1 X 1 J X J 1 pdisease
Where: pdisease is the probability that an individual has a particular disease.
0 is the intercept
1, 2 J are the effects of genetic factors X1, X2 XJ are the dummy variables of genetic factors
Logistic Regression Analysis

Logistic regression describes the relationship between a dichotomous response variable and a set of explanatory variables. Logit model is the only model under which , the effect parameter, can be estimated in retrospective studies as same as in prospective studies. If the sampling rate for cases is 10 times that for controls, the intercept estimated is log(10) =2.3 than the one estimated with a prospective study.
Inference and Interpretation

Significant test focus on:
H 0 : i 0
(i 1, .... , J )
is the estimated odds ratio for Estimator i genetic factor i. determines whether pdisease is The sign of i increasing or decreasing when the effect of genetic factor i exists.
An Example of R output
Other Options
Fishers Exact Test:
When sample size is small, the asymptotic approximation of null distribution is no longer valid. By performing Fishers exact test, exact significance of the deviation from a null hypothesis can be calculated.
For a 2 by 2 table, the exact p-value can be calculated as:
a c
b d
Other Options
Cochram-Armitage Trend Test
-- An advantage of the Cochran-Armitage test is that it
does not assume Hardy-Weinberg equilibrium
-- Typically used to test a 2 k contingency table, when the effects of AA, Aa, and aa are thought to be ordered. -- In genome-wide association studies, the additive (or codominant) version of the test is often used.
Outline
Genome-wide Association Study

In genetic epidemiology, a genome-wide association study (GWAS) - also known as whole genome association study (WGA study) - is an examination of genetic variation across a given genome, designed to identify genetic associations with observable traits. In human studies, this might include traits such as blood pressure or weight, or why some people get a disease or condition.
From: http://en.wikipedia.org

Technology makes it feasible -- Affymetrix: 500K; 1M chip arrives in early 2007.
(Randomly distributed) -- Illumina: 550K chip costs (gene-based)
Requires little on sample, Case-control data, case-parents trio data are enough. Good for moderate effect sizes ( odds ratio < 1.5). Particularly useful in finding genetic variations that contribute to common, complex diseases.
What Is A SNP?
TTCAGTCAGATCCTAGCCC AAGTCAGTCTAGGATCGGG TTCAGTCAGATCCCAGCCC
Chromosome 2
Chromosome 1
AAGTCAGTCTAGGGTCGGG
Single Nucleotide Polymorphism
Handling GWAS
Storing and converting large amounts of genotype data Quality control Generating initial association analysis Specialized analysis
Quality Control Of SNPs

Exclude SNPs that failure the Hardy-Weinberg test -- Expected proportions of genotypes are not consistent with observed allele frequency -- HWE p-value < 10-4 to 10-6 Genotyping success rate < 95% Differential missingness in cases and controls
Quality Control Of Samples

Poor quality samples
-- Sample genotype success rate < 95 to 97.5% -- Greater proportion of heterozygous genotypes than expected
Related individuals (if independent samples)

-- Based on pair-wise comparisons of similarity of genotypes
Samples with miss specified gender
Genetic Stratification
Assess population structure Adjust both phenotypes and genotypes for possible stratification using
-- principal component analysis (Prices method) -- cluster analyses (Plink)
Genomic Control
Software Demonstration
Plink
-- Case/control, TDT, quantitative traits
-- Develop by Shaun Purcell http://pngu.mgh.harvard.edu/~purcell/plink/
Results From GWAS
Results From GWAS
Software Demonstration
Haploview:
-- LD and haplotype block analysis -- tag SNP selection algorithm -- visualization and plotting GWAS results from PLINK
http://www.broadinstitute.org/haploview/haploview

Case 1

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Case 1

Hochgeladen von

Copyright:

Verfügbare Formate

Statistical Analysis in Case-Control studies

Liu Tian Genome Institute of Singapore

Summer International Workshop Aug, 09, Beijing

Aim of Genetic Studies

Different Design Strategies

Hospital-based cases: All patients in a hospital department at a given time

Case-Control Studies vs. Cohort Studies

Case-Control Studies vs. Cohort Studies

Data Structure of Casecontrol studies

Population-Based Case-Control Study

Genetic Models and Underlining Hypotheses

Hypothesis: all 3 different genotypes have different effects

Genetic Models and Underlining Hypotheses

Genetic Models and Underlining Hypotheses

Genetic Models and Underlining Hypotheses

Hypothesis: the genetic effects of allele A and allele a are different

Pearsons Chi-squared Test

Pearsons Chi-squared Test

Pearsons Chi-squared Test

Pearsons Chi-squared Test

Example (Dominant Gene Action)

dominant_table <- matrix(c(80,90,20,10), ncol = 2)

Example (Recessive Gene Action)

recessive_table <- matrix(c(36,18,164,182), ncol = 2)

Example (Genotypic Test)

genotypic_table <- matrix(c(36,18,100,84,64,98), ncol = 3)

Example (Allelic Test)

allelic_table <- matrix(c(172,120,228,280), ncol = 2)

Logistic Regression Analysis

Logistic Regression Analysis

Inference and Interpretation

For a 2 by 2 table, the exact p-value can be calculated as:

Genome-wide Association Study

Genome-wide Association Study

Genome-wide Association Study

Quality Control Of SNPs

Quality Control Of Samples

Related individuals (if independent samples)

Samples with miss specified gender

-- Develop by Shaun Purcell http://pngu.mgh.harvard.edu/~purcell/plink/

Results From GWAS

Results From GWAS

Das könnte Ihnen auch gefallen