Sie sind auf Seite 1von 35

About the course

Introduction to Biostatistics

To provide an overview of basic concepts in Design and Analysis of Biostatistical Investigations. Unify the thought process as many students take courses under different circumstances at various times Kindle imaginations for your course work and refresh memory Grade

BIOSTATISTICS 600 Instructor: T. E. Raghunathan (Raghu) E-mail: teraghu@umich.edu

Letter grade will be based on a multiple choice test on the last day of the lecture

What is biostatistics?

Key Concepts Goal of statistical analysis is to draw inference or conclusions in an unbiased fashion The target population should be clearly defined (that is, the population for which inference is drawn should be clearly stated) It is important recognize the variations in the population and should be reflected in the inferences .

Biostatistics, as a field of science, is concerned with:


The design and conduct of experiments (or studies) to collect observations (or data). Display and analyze the data to infer about the population Duly acknowledge the uncertainty in the stated conclusions while inferring about the population. Study of populations Study of variations Study of methods to reduce the data

The same experiment conducted on two different populations may yield different results (systematic component of variation) Two different conditions or experiments conducted on the same population may yield different results (systematic component of variation) The experiment replicated under the same conditions on the same population may yield different results (random component of variation).

R. A. Fisher defined statistics as


Always find ways to succinctly describe the results using graphical and numerical summaries

Examples

Example 4

Example 1

Medi-Cal is Californias medical assistance program funded by federal, state and county taxes. Roughly 50% of sick visits to a pediatric clinic are made by children covered under the Medi-cal program and the remaining 50% are covered by other insurance programs or by private payment. The objective is to study whether the healthcare given to Medi-Cal patients at the clinic differed from health care given to non-Medi-Cal patients. Observations have indicated that stimulation of walking and placing reflexes in the newborn promotes increased walking and placing. That is, if a newborn infant is held under his arms and his bare feet are permitted to touch a flat surface, he will perform well-coordinated walking movements similar to those of an adult. If dorsa of his feet are drawn against the edge of a flat surface, he will perform placing movements much like those of a kitten (Zelazo, Zelazo and Kolb (1972), Science). How do we conduct a study to test the generality of this observation? Increasing understanding of the etiology of disease has lead to development of new and improved drugs to treat them. Through clinical observations and experience the evidence is mounting that a new drug is better than the current standard drug. How do we test the generality of this observations? What are the consequences of switching to this new drug? How much does the immediate cost of switching to a new drug offsets the benefits over a long haul?

Improvements in medical technology and public health practice during the last one hundred years has increased longevity and improved health. Despite such progress, the disparities in health among various racial and ethnic groups continues to be a daunting problem. One of the goals of Healthy People 2010 is to eliminate health disparities. How do we measure health disparities? How do we decide that we have eliminated health disparities?

Biostatistical Investigations
1.

Example 2

Design an experiment or study to collect information on a set of individuals from the target population to address a set of issues or hypotheses.

2.

Randomized experiments Observational studies Explore the data by computing graphical and numerical summaries

Example 3

3.

Numerical measures to quantify central tendency, spread and shape Graphical summaries to study the distributions and associations Translate these findings into inference about the target population.

Estimates, standard errors, confidence intervals and p-values Are these inferences pertain to Causation or simply Associations?

We will discuss (2), (1) and then (3)

Graphical Displays A picture is worth thousand words

Stem-and-Leaf

Random blood glucose levels (mmol/liter) from a groups of first year medical students

Stem-and-Leaf

Stem is all the digits except the last Leaf is the last digit Multiply the numbers by 10, if there are 2 decimal places, 100 for 3 decimal places etc. Split the stems if there the number of leaves is large

4.7 3.6 3.8 2.2 4.7 4.1 3.6 4.0 4.4 5.1 4.2 4.1 4.4 5.0 3.7 3.6 2.9 3.7 4.7 3.4 3.9 4.8 3.3 3.3 3.6 4.6 3.4 4.5 3.3 4.0 3.4 4.0 3.8 4.1 3.8 4.4 4.9 4.9 4.3 6.0 An alternative Stem-and-Leaf Display 2 3 3 4 4 29 333444 6666778889 00011123444 56777899 01 0

Stem-and-Leaf Display 2 3 4 5 6 29 3334446666778889 0001112344456777899 01 0

Stem-and-leaf is useful for small data sets, to look at the shape of the distribution and identify any outliers. For large data sets, its close cousin, histogram is used (but it is coarser) Displays the entire data Stem-and-leaf can also be used to compare distributions

Survival times (beyond day 10) for guinea pigs in treatment group (high vitamin C)

14,14,16,16,20,24,24,26,28, 28,28, 30,31,32 21,23,24,25,31,33,33,54

Survival times in control group

Stem

Leaf

5 6

Two Displays
1 4466 5431 2 0446888 331 3 012 4 4 5 Treatment Control Such displays are useful to visually inspect the extent to which distributions overlap as well as the magnitude of the differences 1 4466 2 0134456888 3 4 5 4 011233
Neoplasms Respiratory system Injury/Poisoning Digestive system Nervous system Others Total

Displays for qualitative variables


Distribution of Cause of Death Circulatory system 137,165 69,948 33,223 6,427 10,779 5,990 30,695 294,227

Dot chart

Bar chart
You can create these and other types plots using PROC GPLOT in SAS or in Excel. These graphs were produced by using a freeware called R which can be downloaded from www.r-project.org

We will discuss some more graphical displays after introducing some numerical summaries Numerical Summaries

Central tendency: Represents a typical or middle value Spread: Extent of variability across observations Shape: The structure of the distribution Mean: Sum all the observations and divide by the number of observations being summed (arithmetic mean)

Central tendency

Mean survival time of guinea pigs in control group

(21+23+24+25+31+33+33+54)/8=244/8=30.5

Median: The number such that 50% of the observations are less than the number and 50% are greater than the number

Median survival time of guinea pigs in the control group

21,23,24,25,

31,33,33,54

Technically a number between 25 and 31. Sometimes the average (25+31)/2=28 is used.

Some not so popular measures


Mode: The most frequent value among the set of observations Geometric mean: 1/n the power of the product of numbers

Standard Deviation Observations : x1 , x2 ,..., xn


Mean : x = xi / n
i =1 n

Equivalently, the anti-log of the arithmetic mean of logarithms of the observations

Harmonic mean: Reciprocal of the arithmetic mean of the reciprocals (useful when rates are being averaged)

Variance : s 2 = ( xi x ) 2 /(n 1)
i =1

Spread: Measures the extent of variability in the data

Quartiles: Values such that 25% of observations lie below the first quartile, 25% lie between the first and the second quartile, 25% between the second and third quartile and 25% lie above the third quartile.

Standard deviation : s = Variance 3*SD are used to detect outliers

Some other measures

Median of absolute deviations of the observations from the median


Median = M u1 =| x1 M |, u2 =| x2 M |,..., un =| xn M | MAD = Median of (u1 , u2 ,..., un )

The second quartile is the same as the median

Interquartile range (IQR): Third quartile-First Quartile 1.5*IQR is used to detect outliers

Box plot

Median, IQR and MAD are robust to outliers. They are less influenced by extreme observations. For example, suppose the observation 54 in control group in guinea pig example were 94 then median remains unchanged, but mean changes dramatically Mean and Standard deviations are popular because they are tied to a popular normal distribution. This is a typical distribution used in many statistical analysis. If the normality is not valid then more robust methods involving median, quartiles are used Some more graphical summaries based on these additional summaries

Median+1.5*IQR Box plot captures and displays most important features of the data: Measures of central tendency, Spread and Shape There are other versions of box plots by adding more features such as mean Some box plots extend the line from the edges of the box to the minimum and maximum values (M&M)

First quartile Second quartile or median Third quartile Median-1.5*IQR Points outside the 1.5 IQR range

Box plot Histogram Normal curve

Histogram

Divide the data into groups choosing intervals (Mutually exclusive, equal width and exhaustive) Some rules: n/3 or n1/2log10(n) Count the observations in each group Draw a bar chart with area of the bar proportional to the count

Shape of the distribution:

Symmetric: Mean, Median and Modes are the same. Values on the either side of the mean are equally likely Skewed: Mean is larger or smaller than the median

Normal distribution

If you fix the width of the bar to 1 then the height of the bar is proportional to the count Characterized by the mean and standard deviation: 68% of observations lie with 1 SD 95% of observations lie within 2 SD 99.7% of observations lie within 3SD

Key Concepts

Probability

Population: It is a collection of units in whom we are interested


Meaning of probability is controversial.

All people living in the United States In the study of treatment for diabetes, all people with diabetes Blood pressure for a person: All possible measurements blood pressure in that person

Empirical definition: Probability of an event A is the relative frequency with which the events occur in a long sequence of trials in which A is one of the outcomes.

Generally, it is impossible to measure each and every unit in the population (if we could,it is called a Census).

It only makes sense to talk about the probability when the event under question can be thought of as a result of an experiment that could be performed repeatedly Tossing of a coin or throwing a dice Suppose the median height in the population is 168 cm. Suppose we keep drawing one individual at a time and measuring height. Over the long run, as the sample size gets large, half the people in our sample will have heights below 168 cm.

A practical approach: Sample is usually a very small subset units in the population. The sample is measured and studied to draw conclusions about the population. Method used to draw the sample is the key step in a biostatistical investigation.

Sample should be representative of the population (probability or random sampling designs assure such unbiased representativeness)

A random person chosen from this population will have height below 168cm with probability

Due to sampling from the population there is uncertainty in the inferences. Statistical analysis expresses these uncertainties in terms of probabilistic statements

Probability (contd.)

Subjective interpretation of probability


Key Concepts

It is a degree of belief expressing the certainty with which the event is expected to occur. This broader definition allows probabilistic statements without necessarily contemplating a series of trials Anything that is not known to you means that you are uncertain about it. The probability is simply an expression of that uncertainty.

A collection of all possible outcome from an experiment is call sample space


Tossing a coin: S={H,T} Study on health insurance: Random sample of n subjects and assessing how many have health insurance: S={0,1,2,,n} Tossing a coin: E={H} Study on health insurance: None have insurance (E={0}) At least 60% have health insurance

Statistical inference based on the empirical definition of probability is called a frequentist or repeated sampling inference Statistical inference based on the subjective interpretation of the probability is called Bayesian inference. Fortunately, for large samples the numerical results under both system of inferences are very similar but the interpretation differ. Frequentist inference is the focus of this course

An event is a subset of sample space


E = { X S | X 0.6 n}

Probability Distribution

Experiment may involve measuring a continuous variable on an individual


Rule for assigning probability to all possible events

Sample space: An interval on the real line Assume (0, ) or ( , ) with almost zero mass outside the appropriate interval Example: X=Systolic blood pressure Mathematical convenience

Probability Mass function: Probability assignment to each individual element of the sample space (Discrete Sample Space) Pr( X = x) = f ( x), x S Pr( X = x) = 0, x S Probability density function: Probability assignment to an arbitrarily small interval around each potential value of a continuous variable (Continuous Sample Space)

An event is a subset of the real (or positive real) line

Pr( X dx) = f ( x)dx

E={X > 140}

Distribution function

Pr( X u ) = F (u ) = f ( x)dx
a

Rules of Probability
A = Event 0 Pr( A) 1 Pr( A) = 0 A will not occur in the entire sequence of experiments Pr( A) = 1 Only A will occur in the entire sequence of experiments Ac or A = Complement of A (or Not A) Pr( Ac ) = 1 Pr( A)
Two events A and B are mutually exclusive when occurrence of A rules out the occurrence of B in a trial Pr (A or B)=Pr(A)+Pr(B) Two events A and B are independent when occurrence of A has no bearing on the occurrence or non-occurrence of B Pr (A and B)=Pr(A)*Pr(B)

Example 1: Median height of the population is 168cm. Two individuals are chosen at random independently. What is the probability that height of both is less than 168cm? A= First persons height is less than 168cm B= Second persons height is less than 168cm A and B=Both have height less than 168cm Because of independence, Pr(A and B)=Pr(A)*Pr(B)=1/2 *1/2=1/4 Example 2: Suppose that 10% of the population has height exceeding 180cm. What is the probability that exactly one persons height exceeds 180cm? Two possible scenarios: C1: A <= 180 and B > 180 C2: A>180 and B <= 180 Both C1 and C2 cannot occur Pr(C1 or C2)=Pr(C1)+Pr(C2)=9/10*1/10+1/10*9/10=2*9/100=9/50 Note: Pr(C1) was calculated using independence of A and B. Class exercise: What is the probability that at least one persons height exceeds 180cm? What is the probability that at most one persons height exceeds 180cm?

Extension

Binomial Distribution

Suppose that 5 individuals are selected at random. What is the probability that there is only one person among 5 whose height exceeds 180cm?

Suppose that a very large population has unknown proportion of subjects, P, who have a disease. Suppose that a random sample of size n is drawn from this population. The number of diseased subjects in the sample will be x with probability,

T=Persons height exceeds 180cm. F=Persons height less than or equal to 180cm Possible scenarios: TFFFF FTFFF FFTFF FFFTF FFFFT Total 1/10*9/10*9/10*9/10*9/10 9/10*1/10*9/10*9/10*9/10 9/10*9/10*1/10*9/10*9/10 9/10*9/10*9/10*1/10*9/10 9/10*9/10*9/10*9/10*1/10 5*(9*9*9*9)/(10*10*10*10*10)

n! P x (1 P) n x x !(n x)! where r ! = r (r 1) (r 2) ... 1


Example on the previous slide: P=1/10, n=5,x=1

This is called Binomial distribution.

Properties of the Binomial distribution


Poisson Distribution
Poisson distribution is a close cousin of Binomial distribution. If P is very small (rare disease) and n is very large then probability that in a sample of size n, you will find x diseased subjects is

In a typical sample of size n, you may expect nP subjects to have disease If you take several samples of size n from this population and note down the number of diseased subjects, the variance among these numbers will be nP(1-P). Inferential problem: Given n and x, how do we infer about P? Intuitively the estimate of P is x/n. This estimate turns out to be really a good estimate. How do we decide that it is good? We will see later.

(nP ) x e nP x e = x! x! where = nP
= Expected number of diseased people in the population
If you take a large number of very large samples and count the number of diseased people in each sample, the variance among these numbers will be approximately
Inferential problem: Given x how do we draw inference about

Normal Distribution

Conditional Probability

A popular model for many continuous variables It is a symmetric bell shaped curve characterized by two parameters: Mean and Standard Deviation Mean: Center of the distribution (the same as the median and mode) 90% of observations lie between mean-1.64*SD and mean+1.64*SD 95% of observations lie between mean-1.96*SD and mean+1.96*SD

How likely that an Event A will happen given that the event B has occurred?
P( A | B) = Pr( A I B ) Pr( B )

If A and B are independent then


Pr( A | B ) = Pr( A I B) Pr( A) Pr( B) = = Pr( A) Pr( B) Pr( B)

Inverse Problem (Bayes Rule)


Pr( B | A) = P ( A I B ) Pr( A | B) Pr( B) = Pr( A) Pr( A)

Measures used in Diagnostic tests

Key Concept in statistical inference: Sampling Distribution How do we judge whether the estimator, (x/n), (sample proportion) is a good estimator of the population proportion P? Imagine that you draw several samples each of size n. Each will give you different estimate. Variation in the estimates from sample to sample is called the sampling variance. The square root of the sampling variance is called the standard error. Two important criteria:

Diagnostic test indicates T+ or T True state of the disease: D+ or DProperties of diagnostic tests

Sensitivity: Pr(T+|D+) Specificity: Pr(T-|D-) Positive Predictive Value (PPV): Pr(D+|T+) Negative Predictive Value (NPV): Pr(D-|T-)

You would want the estimates to be the same as the estimand, on the average. Such estimates are called unbiased The sample to sample variation in the estimates should be as small as possible. That is, the standard error should be as small as possible The most desirable estimate: An unbiased estimate that has the smallest sampling variance. In this sense the sample proportion is the most desirable estimate of the population proportion

Usefulness or value of diagnostic tests


Sample Proportion

The sampling variance of the sample proportion, p, is approximately p(1p)/n. The standard error:

Confidence intervals

90% confidence intervals: 99% confidence intervals:

p 1.64 SE
p 2.57 SE

p (1 p ) / n

Instead of using the single value to estimate the population proportion, sometimes it is desired to provide a range of plausible values for the unknown population proportion with reasonable degree of confidence. Confidence interval is a summary measure that provides such set of plausible values. Usually confidence levels are 90%,95% or even 99%. An approximate 95% confidence interval for the unknown population proportion is

Example: In a random sample of size 2,837children in the State of Michigan, 118 said they usually coughed, first thing in the morning. What can you infer about the prevalence of this condition in the entire state? Sample prevalence is= 118/2837=0.0416, the estimated prevalence rate for entire state. The uncertainty in the estimate is

0.0416 (1 0.0416) / 2837 = 0.0037 95% confidence interval: 0.0416 1.96 0.0037=(0.034,0.049) With reasonable confidence one could conclude that the population prevalence rate is between 3.4% to 4.9%

p 1.96 SE p 1.96 p (1 p) / n

Continuous Outcome Measure

Confidence intervals
90% : mean 1.64 SE 95% : mean 1.96 SE 99% : mean 2.57 SE
These results are valid even if the outcome measures in the population are not normal Property called Central limit theorem: Suppose you take several samples of size n from the population. Compute the mean from each sample The histogram of these means will look normal as the sample size gets large, regardless of the distribution of values in the population.

The same principles apply when the outcome measure is continuous. Suppose that a population consists of a very large number of individuals. The objective is to infer about the mean glucose level across all subjects in the population. Suppose that the glucose levels in the population are reasonably normally distributed. A random sample of size n is taken and their glucose levels are measured. The mean of these n individuals is the best estimate of the population mean The standard error of the sample mean is SD / n

Types of Studies

Framework for Causal Inference

Observational studies: An existing situation are observed as in Survey or a clinical case reports.

There are two treatments A and B for a disease. Is a patients survival longer if treatment A were administered instead of treatment B?

The data is used to infer how the observed state of affairs has come about.

Two possible outcomes for each patient: Survival time under treatment A (say, YA) and survival time under B (say, YB). Only one survival time can be observed. But we want to infer about YA-YB. More importantly, if there are N subjects in the population, the quantity of interest is the population average of the differences

Experimental studies: One or more conditions are manipulated under which the state of affairs can be observed

The state of affairs under each manipulated condition are then studied to infer about how the population will change when the conditions changes.

Population averaged Causal Effect: Avg (YA ) Avg (YB )

A direct way to estimate this difference is as follows:

Causal inferences are direct in experimental studies whereas Association or correlational inferences are more natural in observational studies

Though a strong association in the absence of any other explanation can be construed as causal

Take a random sample of size n subjects from the population administer A and observe the mean survival Take another random sample of size n subjects from the same population administer B and observe the mean survival The difference between the two sample means is the estimate of the population averaged causal effect

An equivalent study: Take a sample of 2n subjects from the population, assign at random treatment A to n subjects and treatment B to remaining n subjects (Completely randomized design) Sometimes several other factors other than treatment can affect the outcome.

More elaborate before-after designs: Cross-over designs.

Suppose you have two treatments A and B. Take a sample of n subjects. For half the subjects administer treatment A and measure outcome. After a wash-out period, administer treatment B and measure outcome. For the remainder half, administer treatment B and measure outcome. After the the same washout period, administer treatment A and measure outcome Question: Why cant we simply administer A for all n subjects observe outcome. After a wash-out period, administer treatment B and measure outcome?

While drawing 2n subjects, n pairs are selected so that subjects in each pair are similar in terms of these other factors. A random subject in each pair is given Treatment A and the other is given Treatment B. That is, member in each pair are equal except for treatment. The difference in outcome within each pair is then the effect of treatment. (Randomized Block or matched design).

Before-after designs: A sample n subjects are drawn from the population, outcome is measured, they are then given a treatment, and again outcome is observed. The difference between post and pre treatment outcomes is the effect of treatment. That is each person is acting as his/her controls.

This design makes a number of assumptions: The effect of treatment is reversible. The washout period is enough so that there is no carry-over effect of Treatment A on B or treatment B on A.

Matched or block designs, before-after designs and cross-over designs are trying to control for factors other than treatment that could affect the outcome

An alternative approach is to measure all the factors that could affect the outcome and use statistical models to adjust for differences in these factors between treatment groups. Regression analysis are useful to achieve this goal. Actual design will be a mix of two. Some factors will be used in blocking and some variables will be used in adjustment. Nevertheless, thinking about extraneous factors or variables, other than the variable under question (treatment), is very important at the design stage. Non compliance with treatment regimen can be a problem Randomization is the key step in experimental studies and is the justification for interpreting the observed mean difference, for example, as a causal effect.

Randomization

Random number tables (Several books provide this as an appendix) Computer programs to generate random orderings.

Suppose that you want to assign two treatments, A and B, to 20 patients at random with 10 getting treatment A and the other 10 getting treatment B. Most statistical packages have routines to generate random ordering of numbers 1,2,,20. Assign on of these ordering to 20 subjects. One such ordering generated using SAS is (1,5,9,3,13,18,16,12,4,2,11,10,8,17,6,15,14,7,20,19) Subjects assigned 1 to 10 get treatment A and subjects assigned to 11 to 20 get treatment B. Alternatively, subjects assigned odd number get treatment A and those assigned even number get treatment B.

Randomization does not mean haphazard or arbitrary. Randomization uses a formal chance or probabilistic mechanism to determine who gets treatment A and who gets treatment B

Observational study designs

Cross-sectional studies: These are based on a sample drawn from a population at one time point.

Cohort studies: This starts with a sample of subjects from the population. They are then followed over time to assess one or more disease outcomes. Subjects in the study are periodically assessed on several key risk factors.

These are useful to study associations between disease status (say, self-report on hypertension) and a risk factor (say, race/ethnicity). It is important that we select the sample to be a miniature population in several key aspects. It is important that we use a probabilistic mechanism to select which members of the population are included in the sample and which are not. This is like randomization in experimental studies. Population can be defined as a geographical area or a collection of geographical areas.

These are also called prospective studies or longitudinal studies. If the sample is a probability sample, then the results are generalizable to the entire population. Sometime special cohorts such as nurses willing to participate (Nurses Health Study), medical professionals returning the questionnaires (British Medical Doctors Study) are used in such studies. Generalizability of results from such studies is questionable People lost during the follow-up can make cohort loose its representativeness of the population

Cohort study is practically not feasible for rare diseases. Case-Control studies: Two samples. Sample diseased subjects (cases) from the population. Take a sample of non-diseased subjects (controls) from the same population. Compare the exposure to risk factors in the two groups

Inference from experimental and observational studies Statistical analysis technique is similar for both experimental or observational studies

Population-based case-control studies: Sample diseased subjects from a well defined geographical area. Probability sample of controls from the same population Typically, a census of all cases is used in population-based casecontrol studies Hospital-based case-control studies: Sample diseased subjects from one or more hospital and sample non-diseased subjects from the same hospital who visit hospital for some other reason

Two sample or treatment problems

Large versus small sample.

If the sample size in each group is 50 or larger, we will call it as large sample

Unmatched versus Matched or before-after (cross-over designs will be considered much later) Continuous (normal), binary and count outcomes Non-normal but continuous outcomes

Bad idea!! One of the reasons for notoriety of case-control studies

Example

Questions of Interest

The following data was collected in a study of plasma magnesium in diabetic patients. The diabetic subjects were all insulin dependent subjects attending a diabetic clinic over a 5 month period. The non-diabetic controls were mixture of blood donors and people attending day centers for elderly, to give wide age distribution. Plasma magnesium follows a Normal distribution very closely. The summary data is as follows:

Calculate an interval which would include 95% of plasma magnesium measurements from the control population. This is called reference interval. It give information about the distribution of plasma magnesium in the population.

Number of diabetic subjects=227 Mean plasma magnesium=0.719 Standard deviation =0.068 Number of non-diabetic controls=140 Mean plasma magnesium=0.810 Standard deviation=0.057

Given that the distribution of plasma magnesium is normal, the mean and standard deviation completely specify the distribution. Thus we would expect 95% of the observations to lie between 0.810-1.96*0.057 and 0.810+1.96*0.057. That is, between 0.698 and 0.922.

What proportion of diabetic subjects do we expect to lie in the reference interval?

The plasma magnesium level for diabetic subject is normal with mean 0.719 and standard deviation 0.068. What is the area under this normal curve between 0.698 and 0.922?

0.922 0.719 0.698 0.719 Pr(0.698 X 0.922) = Pr Z 0.068 0.068

P = = =

r( 0 .3 1 Z 2 .9 9 ) P r( Z 2 .9 9 ) P r( Z 0 .3 1) 0 .9 9 8 6 0 .3 7 8 3 0 .6 2 0 3

What are the standard errors of population mean estimates?

Diabetic population: n1 = 227, s1 = 0.068; SE1 = s1 / n1 = 0.068 / 227 = 0.0045 Non-diabetic population: n2 = 140, s2 = 0.057; SE2 = s2 / n2 = 0.057 / 140 = 0.0048

Only about 62% of diabetic patient will lie in the reference interval. What are the estimates of the population mean of plasma magnesium for diabetic and non-diabetic populations? Estimate of the population mean for diabetic subjects is 0.719 mmol/liter Estimate of population mean for non-diabetic subjects is 0.810 mmol/liter.

Sample-to-sample variation in estimated mean for the population diabetic subjects is 0.0045 and for the control population it is 0.0048.

Find 95% confidence interval for the population mean for the control population.

Find the standard error of difference in the mean plasma magnesium between diabetic and non-diabetic population?

x2 = 0.810 SE2 = 0.0048 95% confidence interval: ( x2 1.96 SE2 , x2 + 1.96 SE2 ) = (0.810 1.96 0.0048, 0.810 + 1.96 0.0048)

Estimated difference : 0.719 0.810 = 0.091


2 SE (diff ) = SE12 + SE2 = 0.00452 + 0.00482

= 0.0066
Find 95% confidence interval for the difference in the means between diabetic and non-diabetic populations.

= (0.801, 0.819) How does the confidence interval differ from the 95% reference interval? Why are they different?

(0.091 1.96 0.0066, 0.091 + 1.96 0.0066) = (0.104, 0.078)

More than 95% confident that the difference in the population means is negative. That is, the mean magnesium for diabetic subjects is smaller than the mean magnesium level for non-diabetic subjects.

Would plasma magnesium be a good diagnostic test for diabetes? The method discussed so far can be used to compare two population proportions. Note that the proportion is simply the average of 0s and 1s. The proportion is the mean of the binary variable. Example: A study was conducted to determine to what extent children with bronchitis in infancy get more respiratory symptoms in later life than others. 273 children who had bronchitis before age 5 (group 1) were compared to 1046 children who did not(group 2). The outcome was whether or not these children coughed during the day or night at age 14. 26 of 273 reported coughing in group 1 and 44 of 1046 reported coughing in group 2.

p1 = 26 / 273 = 0.095 p2 = 44 /1046 = 0.042 p1 p2 = 0.095 0.042 = 0.053 SE ( p1 p2 ) = = p1 (1 p1 ) p2 (1 p2 ) + n1 n2

0.095 (1 0.095) 0.042 (1 0.042) + 273 1046 = 0.0188

95% confidence interval: 0.053 1.96 0.0188=(0.016,0.090)

Adjustments for small samples

Data

When the sample size is large, the central limit theorem applies and the sample mean has a normal distribution regardless of the original distribution of the the outcome variables. When the same size is small, the distribution of the sample is not normal even if the distribution of the outcome is normal. Adjustment is needed when the sample size is small Example: Does increasing the amount of calcium in our diet reduce blood pressure? In a randomized experiment 10 black men were a calcium supplement for 12 weeks and 11 black men received placebo that appeared identical. The experiment was double blind. The outcome was the change in the blood pressure over a 12 week period.

Calcium group: n=10, mean=5 and standard deviation =8.743 Placebo group: n=11, mean=-0.273 and standard deviation=5.901 Suppose the population standard deviations in the two populations are the same

Two situations

Pooled standard deviation


2 (n1 1) s12 + (n2 1) s2 s = (n1 1) + (n2 1) 2 p

9 (8.743) 2 + 10 (5.901) 2 = = 54.536 9 + 10 s p = 54.536 = 7.385

SE ( x1 x2 ) =

s2 p n1

s2 p n2

What if the two population standard deviations are not the same?
S E ( x1 x 2 ) = = s 12 s2 + 2 = n1 n2
2

= 7.385 1/ 10 + 1/ 11 = 3.227 95% confidence interval (5.273 t 3.227) Degrees of freedom = 9 + 10 = 19 t = 2.093 (5.273 2.093 3.227,5.273 + 2.093 3.227) = (1.48,12.027)

(8 .7 4 3 ) 2 (5 .9 0 1) 2 + 10 11

7 .6 5 + 3 .1 7 = 3 .2 9

s 12 s2 + 2 n2 n1 df = 2 2 2 2 s1 s2 1 1 + n1 1 n1 n2 1 n2 [(8 .7 4 3 ) 2 / 1 0 + (5 .9 0 1) 2 / 1 1]2 = (8 .7 4 3 2 / 1 0 ) 2 / 9 + (5 .9 0 1 2 / 1 1) 2 / 1 0 = 1 1 6 .6 4 ( 7 .6 4 + 3 .1 6 ) 2 = = 1 5 .5 7 ( 7 .6 4 2 / 9 + 3 .1 6 2 / 1 0 ) 6 .4 9 + 1 .0 0

Given the considerable uncertainty, no change in the population mean difference between calcium and placebo groups is plausible. Based on this data, we are confident that the mean difference that one would observe is between 1.5 to 12.0

t 2 .1 3 1 + ( 2 .1 2 0 2 .1 3 1) * 0 .5 7 = 2 .1 2 5 9 5 % c o n f id e n c e in te r v a l: (5 .2 7 3 2 .1 2 5 3 .2 9 , 5 .2 7 3 + 2 .1 2 5 3 .2 9 ) = ( 1 .7 2 ,1 2 .2 6 )

So far we have concentrated on estimation of population quantities such as proportions, means, differences in two means or proportions. Sample is used to derive single estimate (point estimate) Expressed uncertainty through standard errors or confidence intervals (interval estimates) Point and interval estimates are very important quantities to communicate inference about the population or scientific phenomenon Sometimes a decision to be made is explicitly tied to the inferential process.

Testing of Statistical Hypothesis

Decisions

Based on the data in hand usually the decisions are of the type Yes/ No. Sometimes, additional new information may be sought before making the decision. Ultimately the decision has to be made either choosing Yes or No Inferential technique that explicitly leads one to make a decision is called Tests of Statistical Hypothesis or Tests of Statistical Significance. Use of this approach when no such explicit decision making process is involved is questionable. Sometimes people tend to frame every inferential problem as decision making process so that they can use these techniques which is also questionable. Note also the implicit adversarial nature of the problem

FDA has to decide whether to approve or disapprove a drug A new intravenous procedure is touted to reduce infection rate. The hospital has to decide whether to implement the new procedure or stick with the current one.

A company that conducts coaching for an examination claims that its new method of learning results in higher score than any of its competitor. A competitor disputes this claim. Who is right?

Tale of two hypotheses

Decisions are tied to statistical hypotheses. Null hypothesis


It is the hypothesis against which one wants to find evidence A candidate hypothesis for rejection Usually the null hypothesis represents a null or no effect Opposite of null hypothesis. The hypothesis that is favored in light of the evidence against the null hypothesis

An experiment was conducted to assess the effect of pronethalal in the treatment of angina pectoris. A sample of 12 patients were assessed on number of attacks during a two-week period were observed. They were then put on pronethalal for next two weeks. The number of pain attacks were assessed while on pronethalal. The claim is that pronethalal reduces the number of pain attacks. Data

Example

Baseline: 71, 323, 8, 14, 23, 34, 79, 60, 2, 3, 17, 7 Pronethalal: 29, 348, 1, 7, 16, 25, 65, 41, 0, 0, 15, 2 Difference: 42, -25, 7, 7, 7, 9, 14, 19, 2, 3, 2, 5

Alternative hypothesis

On the average suppose that the expected difference in the number of pain attacks in the population is Null hypothesis: Ho:

=0

Alternative hypothesis: HA: 0 (This is called two-sided hypothesis) One sided alternatives: (1) HA: > 0 (2) HA: < 0

How can we check whether the population standard deviations are the same?

How to make the decision?

We will discuss this in the context of hypothesis testing.

General principle: Check whether the data is consistent with the null hypothesis.

It can be argued that the equality of population standard deviations can never be empirically verified, especially, if the sample size is small. One should always, therefore, use the procedure which does not assume equality of population standard deviations.

We will answer the question by assessing how likely the observed data would have been generated if the null hypothesis were true. If the null hypothesis were true then the difference between pronethalol and baseline values will be on the average 0 with roughly half of them positives and the rest of them negatives. That is, under the hypothesis the number of negative signs will occur with probability 0.5. But only one negative value has occurred.

A simple procedure

Probability of observing 1 negative and 11 positives is

12 1 11 0.5 0.5 = 0.00293 1

The above test is called sign test. Also we performed a one sided test because only smaller values of the number of negative values were considered. If one were to find large number of negative values, say 11 or 12 then that is also evidence against the null hypothesis The probability of obtaining 11 or 12 negatives is also 0.00317 (verify!). The two sided p-value is 0.00317+0.00317=0.00634 How do we define extreme values that constitute evidence against the null hypothesis?

Even more extreme observation will be 0 negative and 12 positives which has the probability

Test statistic: Number of negative values P-value: The probability of observing the test statistic which is as or more extreme than the observed, if the null hypothesis were true

12 0 12 0.5 0.5 = 0.00024 0

P-value=0.00293+0.00024=0.00317

A Tale of Two Errors

Revisiting the pronethalol example


Type 1 error: Rejecting the null hypothesis when it is actually true. Type 2 error: Failure to reject the null hypothesis when it is actually false. (Equivalently, accepting the null hypothesis when it is actually false) Chances of making type 1 error is called significance level and is denoted by Chances of making type 2 error is denoted by .
1- is called power. Power is the chance of Rejecting the null hypothesis when alternative is true

Suppose we specify that the chances of making type 1 error is 0.05 For two sided alternatives: The extreme value then is determined by choosing values, c and d, so that the number of negative values less than or equal to c or greater than or equal to d is 0.05. For one sided alternatives: The extreme value is determined by choosing a value, c, so that the number of negative values less than or equal to c is 0.05. Looking at the binomial table on Page T-9 (last column, n=12). If we choose c = 2 and d = 10. The probability of type 1 error is 0.0161+0.0161=0.0322. It is not possible to determine c and d to achieve significance level 0.05. For one sided hypothesis, choose c=3. (It gives slightly larger than the specified significance level 0.05

Objective is to control chances of making either types of errors Strategy: For a fixed significance level, we will define the extreme values.

Usually power is calculated instead of chances of making type 2 error. We need a specific alternative value which is the truth. Suppose that 1/12 (approximately, 8%) is the indeed the true value. The chances of rejecting the null hypothesis is: 0.3677+0.3837+0.1835+0.0000+0.0000+0.0000=0.9349 Try calculating power for alternatives 0.05, 0.10, 0.20 etc. Power curve is a plot of power against the alternative values. The sign test so far has considered only the sign of the difference and not the magnitude of the difference. Let us consider some alternatives.

Suppose that the differences can be assumed to be normally distributed. The estimate of the mean difference in the population is 7.7 and standard deviation is 15.1. The standard error is 4.4. If the null hypothesis is true then the sample mean should be distributed around 0. The extent to which the observed sample mean is different from 0 is the evidence against the null hypothesis. One way to measure the distance between the observed sample mean and the null hypothesis value is in terms of standard error unit.

t=

x 0 SE ( x )

Calculated value of t-statistic is 7.7/4.4 = 1.75 What is the probability that one would observe this extreme or even more extreme values of the test statistic under the null hypothesis?

Two-Sample Tests
Revisit the plasma magnesium example
The following data was collected in a study of plasma magnesium in diabetic patients. The diabetic subjects were all insulin dependent subjects attending a diabetic clinic over a 5 month period. The non-diabetic controls were mixture of blood donors and people attending day centers for elderly, to give wide age distribution. Plasma magnesium follows a Normal distribution very closely. The summary data is as follows:
Number of diabetic subjects=227 Mean plasma magnesium=0.719 Standard deviation =0.068 Number of non-diabetic controls=140 Mean plasma magnesium=0.810 Standard deviation=0.057

If the null hypothesis were true then the statistic has a tdistribution with 11 degrees of freedom.

Shaded area: 0.1079 Computed using a computer

Frame the question as testing of statistical hypothesis:

-1.75

1.75

For a fixed significance level, say 0.05. The value of the test statistic considered to be large is 2.201

Are the means of plasma magnesium in the two populations (diabetic and non-diabetic) the same?
n1 = 227 x1 = 0.719 s1 = 0.068

Diabetic

Non Diabetic

n2 = 140 x2 = 0.810 s2 = 0.057

Mean for Diabetic population: 1 Mean for Non-diabetic population: 2 Null Hypothesis: H o : 1 = 2 Alternative Hypothesis: H A : 1 2 x1 x2 is an estimate of 1 2 If the Null hypothesis were true then x1 x2 should be distributed around mean 0. The extent to which it is away from 0 is evidence against the null hypothesis. Test statistic:
t= = ( x1 x2 ) ( 1 2 ) x1 x2 = SE ( x1 x2 ) SE ( x1 x2 )

0.091 = 13.78 0.0066

Sampling distribution is normal given the large sample sizes from each population. If the null hypothesis were true, 68% of the samples should result in the value of the test statistic to be between -1 and 1, 90% of the samples between -1.64 and 1.64 and 95% of samples between -1.96 and 1.96. What we have observed is very unlikely under the null hypothesis. Therefore, the null hypothesis is a suspect

- 13.78

13.78

Small sample example revisited Example: Does increasing the amount of calcium in our diet reduce blood pressure? In a randomized experiment 10 black men were given a calcium supplement for 12 weeks and 11 black men received placebo that appeared identical. The experiment was double blind. The outcome was the change in the blood pressure over a 12 week period. Data Calcium group: sample size =10, mean=5 and standard deviation =8.743 Placebo group: sample size=11, mean=-0.273 and standard deviation=5.901 Population mean if everybody in the population were given calcium supplement: 1 Population mean if everybody in the population were given only Placebo: 2 Null hypothesis: H o : 1 = 2 Alternative hypothesis: H o : 1 > 2 Large positive mean difference x1 x2 is evidence against the null hypothesis in favor of the alternative hypothesis Alternative hypothesis: H A : 1 2 Large positive or negative mean difference is evidence against the null hypothesis in favor of alternative hypothesis

Two situations: Population variances are equal or unequal


Equal

Pooled standard deviation


s2 = p =
2 (n1 1) s12 + (n2 1) s2 (n1 1) + (n2 1)

9 (8.743) 2 + 10 (5.901) 2 = 54.536 9 + 10 s p = 54.536 = 7.385

Standard error of the difference in the means


SE ( x1 x2 ) = s2 p n1 + s2 p n2

= 7.385 1/ 10 + 1/ 11 = 3.227

Test statistic ( x x ) ( 1 2 ) 5.273 t= 1 2 = = 1.63 SE ( x1 x2 ) 3.227 Degrees of freedom = 9 + 10 = 19

Sampling distribution: t with 19 degrees of freedom

P-value One sided alternative From Table D on page T-11, the shaded area is between 0.05 and 0.10 Computer software:0.0598
1.63

Variances are unequal


2 (8.743) 2 (5.901) 2 s12 s2 + = + 10 11 n1 n2

SE ( x1 x2 ) =

= 7.65 + 3.17 = 3.29


2 s12 s2 + n1 n2 2 2

Test statistic

df =

t=
2

2 1 s12 1 s2 + n1 1 n1 n2 1 n2

5.273 = 1.603 3.29 P value :

[(8.743)2 /10 + (5.901) 2 /11]2 (8.7432 /10) 2 / 9 + (5.9012 /11) 2 /10

One sided : 0.065 Two sided : 0.13

Two sided alternative P-value=2*0.0598=0.1196


-1.63 1.63

116.64 (7.64 + 3.16) 2 = = 15.57 = 2 2 (7.64 / 9 + 3.16 /10) 6.49 + 1.00

Paired Design An experimenter was interested in dieting and weight losses among men and women. It was believed that in the first two weeks of a standard dieting program, women would tend to loose more weight than men. As a check on this notion, a random sample of 10 husband-wife pairs were put on the same strenuous diets. Their weight losses after two weeks showed in the table.

Pair 1 2 3 4 5 6 7 8 9 10

Husband 5.0 lbs 3.3 4.3 6.1 2.5 1.9 3.2 4.1 4.5 2.7

Wife 2.7lbs 4.4 3.5 3.7 5.6 5.1 3.8 3.5 5.6 4.2

There are numerous aspects of shared environment (extraneous factors) that could affect the weight gain for husband-wife pairs. Assuming that these effects are additive, the difference between husband and wife in any given pair represent the gender effect On the other hand if the extraneous effects are multiplicative, then the ratio will represent the gender effect.
d = d s /10 = 0.45
s =1 10 2 sd = (d s d ) 2 / 9 = 3.88 s =1 10

Pair

Difference (d)

1 2 3 4 5 6 7 8 9 10

2.3 -1.1 0.8 2.4 -3.1 -3.2 -0.6 0.6 -1.1 -1.5

t = d /( sd / 10) = 0.72 t0.025,9 = 2.262 Fail to reject, H o : d = 0

Some issues

Power Calculations
Consider two-sample problem Assume significance level =0.05 Assume Equal variances Assume two sided alternative Assume sample sizes are large enough to use standard normal curve. Assume that the standard deviation is known Assume that the alternative is = 1 2 The cut-off value of the test statistic is then 1.96. That is, if the value of the test statistic based on a data set is outside the interval (-1.96,1.96) then hypothesis will be rejected.

P-value: The probability of observing outcomes (or test-statistics) that are more inconsistent with the null hypothesis relative to alternative hypothesis.

It depends upon the null hypothesis, alternative hypothesis and the teststatistic

Sometimes it is wrongly interpreted as the probability of the null hypothesis being true. Of course, one will never know whether any hypothesis is true or not. In fact, it is very unlikely that any null hypothesis will be every true. Failure to reject the null hypothesis at a pre-specified significance level, is to be used as guidance for action as though the null hypothesis is true. One should choose significance level to small enough so that failing to reject the null will make her/him comfortable to act as though the null hypothesis is true. 5% is one example and by no means the only number. If a sample size is very large then standard error will be small. The teststatistic, being a ratio of the difference in the means divided by the standard error, can be large even though the mean difference may have no practical or clinical consequence.

The question is how likely that we will reject the null hypothesis when actually some alternative is true. Pr(test-statistic > 1.96 or test-statistic < -1.96|Alternative is true)
1 1.96 + 1.96 1 1 1 1 + + n1 n2 n1 n2

Area under the normal curve when the Alternative is true

If the sample size is sufficiently large any null hypothesis will be rejected!

Analysis of Variance

Blood Pressure Example


= 7.4
n1 = n2 = n

Suppose now that we want to compare more than 2 populations One could do pair-wise comparisons. This is cumbersome and is not easy to summarize when the number of populations compared is large. The analysis of variance is used by framing question in terms of in-depth investigation of variations in the observed data Analysis variance basically partitions the overall variability into one or more assignable causes or reasoning. What is left unassigned is called residual variability. Based on the partition of the variability relative merits of assignable causes are investigated. Generally, the variation due to an assignable cause relative to the residual variability is used as a yard stick for judging the importance of the assignable cause.

n
30 40 50 60 70 100

2.5 5 7.5 25.6 74.4 97.5 32.7 85.6 99.4 39.3 92.2 99.9 45.6 95.9 100 51.5 97.9 100 66.6 99.8 100

The assignable causes can be carefully planned or manipulated through an experimental design The assignable causes are based on substantive reasoning in an observational study design Example:

Data
Active Exercise 9.00 9.50 9.75 10.00 13.00 9.50 Passive Exercise 11.00 10.00 10.00 11.75 10.50 15.00 No Exercise Control

11.50 12.00 9.00 11.50 13.25 13.00

13.25 11.50 12.00 13.50 11.50

A randomized study was conducted to test the generality of the observation that stimulation of the walking and placing reflexes in the newborn promotes increase walking and placing (Zelazo, Zelazo and Kolb (1972, Science, pages 314-315)). A total of 29 one-week old males were randomized to four groups. 1: Active exercise, 2: Passive exercise, 3: Noexercise and 4: 8-week control group. Age of infants walking alone (in months) was the outcome variable of interest. The assignable cause is the levels of exercise Is the variation caused by this assignable cause substantial?

yij = Observation for subject j in group i j = 1, 2,..., ni i = 1, 2,..., k y++ = Overall mean Total variation= ( yij y++ ) 2
i =1 j =1 k ni

Exam ple : k =4 n1 = n 2 = n 3 = 6, n 4 = 5 y + + = 261 / 23 = 11.34 Total variation = 58.47

ANOVA

yij y++ = ( yij yi + ) + ( yi + y++ )


Within Groups or Between-subjects nested within groups
k ni k ni 2 2

An alternative Expression

yij = y++ + ( yi + y++ ) + ( yij yi + )


Overall deviation
k ni

Between Groups

yij = + i + ij
Overall mean Deviation of the group i mean from overall mean

Residual

( yij y++ ) = ( yij yi+ ) + ( yi+ y++ )


i =1 j =1 i =1 j =1 i =1 j =1

TotalSS = WithinSS + BetweenSS 58.47 = 43.69 + 14.78


Degrees of freedom: Number of independent statistics used to compute the sum of square

ij = s j (i ) = Effect of subject j nested within group i

Df for Total SS=22 ( Every observation is used but sum of deviations is zero) Df for Within SS=19 (Every observation is used but sum of deviations within each group is zero) Df for Between SS=3 (Four means are used but sum of deviations from the overall mean is zero)

N = ni = Total sample size


i =1

ANOVA Example
MS ( Between) = 14.78 / 3 = 4.93 MS (Within) = 43.69 /19 = 2.30 MS ( Between) / MS (Within) = 2.14
Is 2.14 large?

Df (TotalSS ) = N 1 Df ( BetweenSS ) = k 1 Df (WithinSS ) = N k

To compare the Sums of squares, differences in the degrees of freedom has to be taken into account. Mean square =Sum of square/Degrees of Freedom

Use F-distribution with (numerator df=3, denominator df=19) to determine how likely is 2.14 or even larger F when in actuality there are no differences among the four groups? P-value: 0.1228

Regression Analysis

Terminology
X= Independent variable.

Bulk of scientific investigations are concerned with relationships.

A variable that an investigator can change in an experiment Amenable to intervention in an observational study Simply the variable whose impact is to be assessed. It is possible that there can be more than one independent variable of interest Other names: Predictors, correlates, right-had-side variables, exogenous variables Variable for which you want to assess effect of X. Other names: Outcome, endogenous variables, left-hand-side variables

Causal relationship: If one changes the variable X by a certain amount how much does the variable Y change? Association or correlational relationship: Are subjects with different values of X also tend have different values of Y? What is the nature of these relationships in the population? How do you quantify these relationships in the population? How to estimate the quantities describing these population relationships? How accurately are those estimates? How much uncertainty is there in assessing these relationships?

Y= Dependent variable of interest.


The two sample tests and ANOVA also fit into this category

Are the population means related to treatments assigned or the observed grouping ? We will later see that the two-sample t-tests and ANOVA F-tests are particular cases of the general regression framework.

Impact of different values of X on differences in Y expressed in some meaningful terms is of interest

Example

Scatter plot
Scatter plot: A graphical device to assess the type of relationship. Each point is a pair (X,Y) Dependent variable on the vertical axis Independent variable on the horizontal axis Inspection of the graph suggests a linear relationship

The following table gives data collected by a group of medical students in a physiology class. The objective is to assess association between height and FEV1. Height FEV1 164.0 167.0 170.4 171.2 171.2 171.3 172.0 3.54 3.54 3.19 2.85 3.42 3.20 3.60 Height FEV1 172.0 174.0 176.0 177.0 177.0 177.0 177.4 3.78 4.32 3.75 3.09 4.05 5.43 3.60 Height FEV1 178.0 180.7 181.0 183.1 183.6 183.7 2.98 4.80 3.96 4.78 4.56 4.68

Linear relationship

( y1 y2 ) ( x1 x2 )

Method of Least Squares: Find a and b that minimizes the residual sum of squares:

Representation

y = a + b x

Clearly not (none) every observations will satisfy this equation

yi = a + b xi + ei
Line-value or the expected value

i =1

2 i

= ( yi a b xi ) 2
i =1

Residual

( x x )( y y ) b= (x x )
i i i 2 i i

How to determine a and b?

a = y b x

Simplified formulas

Example

Slope

b=

x y x y /n
i i i i i i i

y=FEV1, x=Height

x xi / n i i
2 i

x
i i

= 3,507.6, yi = 77.12
i

x y
i

= 13,568.18, xi2 = 615, 739.24


i

Intercept

a = yi / n b xi / n i i

Needed quantities

13,568.18 3,507.6 77.12 / 20 = 0.074389 b= 615, 739.24 (3,507.6) 2 / 20 a = 77.12 / 20 0.074389 3,507.6 / 20 = 9.19 Prediction equation
FEV 1 = 9.19 + 0.0744 Height

x , y , x y , x
i i i i i i i i

2 i

Interpretation

Interpretation (Contd.)

Slope

Intercept

b = Expected difference in y for unit positve difference in x Two Individuals Individual 1: x = h Individual 2: x = h + 1 Expected or line-value for individual 1: a + b h Expected or line-value for individual 2: a + b (h + 1) Difference = b

Expected value of y when x=0. It is not very interpretable in this particular problem. Value of FEV1 when Height is 0!

Modification: Centering

d =b c = Expected value of y for average height

y = c + d (x x )

Residual

Computational formulas
s =
2 e

Residuals from the estimated line

ei = y i a b x i

e
i

2 i

Residuals represent deviations from the expected value. Large residuals reflect unreliability or uncertainty. One way to measure this uncertainty is through variance of the residuals (or the standard deviation of the residuals).

n2

( y y)
i i

b 2 ( xi x ) 2
i

n2

2 2 (n 1)( s y b 2 sx )

(n 2) s y = SD of y ' s sx = SD of x ' s

= sxy b 2 sx sxy =

s =
2 e

e
i

2 i

( x x )( y y )
i i

n2

n 1

Covariance

Example
sx = 5.51, s y = 0.71 se2 = 19 (0.71 0.0744 5.51 ) = 0.35 18
2 2 2

How useful is x in predicting y?


2 Total variance = 0.712 = 0.504 = s y

(Residual variance from a horizontal line) Degrees of freedom = 19 Residual variance = 0.35 = se2 (Residual variance from the regression line on x) Degrees of freedom = 18
2 (n 1) s y (n 2) se2

Correlation Coefficient: Another measure of strength of linear relationship

r=

sxy sx s y

2.26 = 0.56 5.51 0.71

R =
2

(n 1) s

2 y

19 0.50 18 0.35 19 0.50

Measure of linear association between x and y.

= 0.34 34% (in percentage)

Another Form of R2

Inference
How much the slope and intercept estimates vary from sample to sample? Standard error of the estimates

R2 =

Variance( y ' s ) s = Variance( y ' s ) s

2 y 2 y

SE (b) =
R-square is a simple measure to assess how much variability in y is explained by the variation in x. Large values of R-square indicates substantial variation in y is due to variation in x. Small R-square indicates the opposite This measure also has disadvantages and we will discuss those when we consider multiple preidtors

se2 2 (n 1) sx

95% confidence interval

b t0.025, n 2 SE (b )

Estimated Line Value

Suppose x=f and it is not one of the observed values in the data set. What would one expect y to be on the average?

Test the

hypothesis Ho: b is equal to 0 versus HA: b is not equal to zero

b b t= SE (b) Under the null , b t= SE (b)

yf = a + b f 1 ( f x )2 SE ( y f ) = s + 2 n (n 1) sx
2 e

f = 175 y f = 9.19 + 0.0744 175 = 3.83 1 (175 175.38) 2 SE ( y f ) = 0.35 + = 0.133 20 19 5.512

Prediction Interval

This refers to a confidence interval for a single observation on outcome variable for a given value of the independent variable x = f.

In the example considered so far, the independent variable was a continuous variable. Suppose now the independent variable is a binary coded as x = 0 or 1. Interpretation of regression coefficients:

Discrete Predictors

yf = a + b f 1 ( f x )2 Prediction SE ( y f ) = s 1 + + 2 n (n 1) sx y f t1 / 2,n 2 PSE


2 e

E ( y | x) = a + bx E ( y | x = 0) = a

a = Mean for the reference group defined as subjects with x = 0 b = Difference in the mean between two groups x=1 versus x=0

E ( y | x = 1) = a + b b = E ( y | x = 1) E ( y | x = 0) Test for significance of b is identical to two sample

t - test

Multiple Predictors

E (Y | X ) = a + bX b = Ignores the influence of C on X and Y E (Y | X , C ) = c + dX + eC E (Y | X = x + 1, C = f ) E (Y | X = x, C = f ) [c + d ( x + 1) + ef ] [c + dx + ef ] = d


d = Difference in the expected values of Y associated with one positive unit difference in X holding C constant.

Often in practice several variables might influence the dependent variable. Some common examples

b= Unadjusted estimate d= Estimate adjusted for C; Usually d will be smaller than b (but it can be larger than b) C=Confounding variable

Part of X, Y relationship is due to common relationship with C

E (Y | X ) = a + bX
E (Y | X , I ) = c + dX + eI d 0

b=representing the effect of actually the variable I I: Intervening variable X

M=0

Y0

X may act through I as well as act independently on Y

M=1

Y1

Statistical effect of I or C will be the same on the regression coefficient of X. The conceptual understanding has to distinguish between confounding and Intervening variables

The effect of X depends on M. The effect of X is modified by the presence or absence of M

E (Y | X , M ) = a + bX + cM + dX M E (Y | X , M = 0) = a + bX E (Y | X , M = 1) = (a + c) + (b + d ) X
d=The extent to which the effect of X is modified by the presence of M (that is, M=1) d=0 c arbitrary: Parallel lines for two groups M=1 and M=0 d=0,c=0: Coincidental lines c=0,d arbitrary: Same intercept, lines for M=0 and M=1 are fanning out d=0,c=0 c=0,d arbitrary d=0,c arbitrary d,c arbitrary

Analysis of Cross-classifed Data

Other combinations are also possible

So far we have concentrated on analyzing relationships between a continuous dependent variable and continuous or discrete (or categorical) independent variables.

Discrete dependent (Yes/No for a disease), continuous independent (Age, BMI etc)

Logistic regression

Regression ANOVA

Many times the dependent and independent variables are both discrete.

Number of events as dependent (Number of seizures among epileptic patients over a fixed or variable period of time)

Poisson Regression

Qualitative categories (Such as Gender, Race/Ethnicity, geographical location, type of health insurance) Quantitative or ordered categories

low, medium and high socioeconomic status none, very low, low, medium, high doses of environmental exposure

Continuous dependent but truncated (or censored). For example, failure time, time to death, time to symptoms. These may be known for some individuals and for others it is only known to exceed some known value.

Survivial analysis

Situtation 1: Dependent and independent variables are discrete

Example:

Such a table based on classification of subjects on two or more variables is called contingency table or crossclassification Each entry is a frequency, or the number of individuals having these characteristics. If 1443 is representative sample from the population, 50/1443 is the estimated population proportion of owner-occupier births that is also preterm. 50/1443 can be viewed as an estimated probability that a subject is owner-occupier and has preterm birth

A study was conducted to assess the relationship between socio-economic position (as measured by home ownership) and whether they had preterm delivery. A sample of 1443 births were chosen and the following table was created.
Preterm 50 29 11 6 3 99 Term 849 229 164 66 36 1344 Total 899 258 175 72 39 1443

Housing Tenure Owner-occupier Council Tenent Private Tenent Lives with parents Other Total

A : Homeownership B : Delivery status p ( A = Owner Occupier , B = preterm) = 50 /1443

Similarly,

p ( A = owner occupier ) = 899 /1443 p ( B = preterm) = 99 /1443


p ( A and B) = p ( A) p( B)

Suppose that A and B are not associated or related. Then The observed estimate 50/1443 and the expected (under independence or no association) is (899/1443)*(99/1443) Equivalently, the observed and expected frequencies are 50 and 1443*(899/1443)*(99/1443)=61.68 If the observed and expected frequencies are discrepant, then the hypothesis of no association is questionable. One way to assess the reasonableness hypothesis of no association is through measuring distance between the expected and observed frequencies.

Housing Tenure Owner-occupier Council Tenent Private Tenent Lives with parents Other Total Housing Tenure Owner-occupier Council Tenent Private Tenent Lives with parents Other Total

Preterm 50 29 11 6 3 99 Preterm 61.7 17.7 12.0 4.9 2.7 99

Term Total 849 899 229 258 164 175 66 72 36 39 1344 1443 Term Total 837.3 899 240.3 258 163.0 175 67.1 72 36.3 39 1344 1443

Observed frequencies

Expected frequencies

Chi-square statistic is one of the distance measure: 2


T =
i =1 j =1 r c

T has a chi-square distribution with df=(r-1)(c-1) degrees of freedom. r=5, c=2; df=4. Critical value for significance level is 0.05 is 9.49. The data are not consistent with the hypothesis of no association between housing tenure and time of delivery. That is, there is a good evidence of association between housing tenure and time of delivery The chi-square statistics is not a measure of association. If we double the frequencies in each cell, the association will remain unchanged but chi-square will double. Chi-square is a large sample test and is questionable if any expected frequency is less than 5. Alternatives are

(Oij Eij ) Eij

r = Number of rows c = Number of columns Oij = Observed frequency in row i and column j Eij = Expected frequency in row i and column j

(50 61.7)2 (849 837.3) 2 (29 17.7) 2 + + + 61.7 837.3 17.7 (3 2.7) 2 (36 36.3) 2 ... + + = 10.5 2.7 36.3 T=

Yates correction Fishers exact test

Yates Correction

Radiological assessment Streptomycin Control Improvement Deterioration Death 13 2 0 5 7 5

(| O E | 0.5) 2 TY = E

Fishers exact test. It is based on computing the probability of observing a particular contingency table or tables that are more inconsistent with the hypothesis of no association. It is a complicated algorithm and usually is performed using a computer. Example: The following table is from a trial investigating the efficacy of streptomycin for the treatment of pulmonary tuberculosis. The data is for subgroup for patients with an initial temperature of 100100.9F. The two variables are radiological assessment of the disease 6 months later and treatment.

Pooled table
Radiological assessment Streptomycin Improvement Deterioration or Death Odds of improvement in the streptomycin group=13/2=a/c Odds of improvement in the control group=5/12=b/d Odds ratio=(13/2)/(5/12)=15.6 =(ab)/(cd) Confidence interval: exp[Log(OR)-z*SE(log(OR)), Log(OR)+z*SE(log(OR))] 13 (a) 2 (c) Control 5 (b) 12 (d)

Log(OR)=2.75 var(log(OR))=1/a+1/b+1/c+1/d=1/13+1/2+1/5+1/12=0.86 SE=0.93 95% confidence interval for log-odds-ratio:


(2.75-1.96*0.93,2.75+1.96*0.93) =(0.93,4.57) (2.53,96.54)

95% confidence interval for odds-ratio:

Analysis so far involved only two variables. What to do if we have more than two variables? For example, suppose we want to adjust for Age and other confounding variables while assessing association between treatment and outcome (or home ownership and time of delivery).

Technique is called logistic regression

Matched study: Binary Outcome


A questionnaire was administered to Severe Severe colds at age 1,319 school children at ages 12 and colds at 14 14. One question asked was whether age 12 the prevalence of reported symptoms Yes No was different at the two ages. The following two by two table gives the Yes 212 144 result No 256 707 As in the paired t-test example, we want to exploit the fact that the same children were asked at ages 12 and then again when they were 14 H o : The prepvalence of reported symptoms The concordant pairs are (yes, yes) is the same at two ages and (no, no). The discordant pairs (yes, no), (no, Under the null hypothesis, yes) proportion of subjects answering (yes,no) should

This is called McNemars test

Observed : f yn = 144, f ny = 256 Expected : f yn + f ny 2 = 200


2 2

f yn + f ny f yn + f ny f yn f ny 2 2 + 2 = f yn + f ny f yn + f ny 2 = 31.4 df = 1
The chi-square value is highly significant. The proportions at two ages are not the same.

Odds of transition from No to Yes is 256/144=1.78 Conditional analysis (conditional on transition)

be same as the subjects answering (no,yes)

Nonparametric Approaches

Most statistical approaches discussed so far assume some distribution for the population (mostly normal). The approaches such as one and two sample t-tests, linear regression etc. are valid unless the departure from normality is very severe. Nevertheless, it will be useful to have a set of techniques that can be applied without any distributional assumptions. The sign test discussed earlier in the course is an example of a nonparametric test. However, this procedure can have low power because it uses only the signs and not the magnitude. An alternative is to use magnitude in some way but still maintaining the nonparametric nature of the tests. Rank-based procedures are quite popular

Paired Designs

Revisit the husband-wife pair example


Pair

Rank (r)

Difference (d) 2.3 -1.1 0.8 2.4 -3.1 -3.2 -0.6 0.6 -1.1 -1.5

Wilcoxon signed rank procedure Step 1: Rank the absolute values of the differences Step 2: Take the difference in the sums of the ranks of the positive and negative differences

1 2 3 4 5 6 7 8 9 10

7 4.5 3 8 9 10 1.5 1.5 4.5 6

w = (6 + 3 + 7 + 1.5) (4.5 + 8 + 9 + 1.5 + 4.5 + 5) = 17.5 32.5 = 15 E ( w) = 0; Var ( w) = n(n + 1)(2n + 1) / 6 = 10 11 21/ 6 = 385 z = ( w E ( w)) / var( w) = 15 / 385 = 0.76
Null hypothesis: Median of the distribution of differences is zero. All nonparametric procedures formulate hypotheses in terms of medians rather than mean

Two sample nonparametric tests


These are analog of two-sample t-tests Mann-Whitney-Wilcoxon test

Sample of size n from population 1 Sample of size m from population 2 Rank (n+m) units regardless of the populations Sum the ranks of subjects in sample 1and call it T. Define U=T-n(n+1)/2

z = [U mn / 2] / mn(m + n + 1) /12

Alternatively, one can sum the ranks of subjects in sample 2 and then replace n by m

Null hypothesis: The distribution Crohns in the two populations are the disease same 1.8,2.8,4.2, Example:The following table 6.2,2.2,3.2,4. gives biceps skinfold 4,6.6,2.4,3.6, measurements for 20 patients 4.8,7.0,2.5,3. with Crohns disease and 9 8,5.6,10.0,2.8 patients Coelic disease. The , 4.0,6.0,10.4 objective is to assess whether the distribution of the bicep measurements are the same

Coeliac disease 1.8,2.0,2.0, 2.0, 3.0, 3.8, 4.2,5.4, 7.6

Rank all 29 observations


1.8 1.5 3.0 11 1.8 1.5 3.2 12 2.0 4 2.0 4 2.0 4 2.2 6 2.4 7 2.5 8 3.6 13 5.4 21 5.6 22 6.0 23
Rank sum=104.5 Circled numbers are from Sample 2

3.8 14.5 6.2 24 3.8 14.5 6.6 25 4.0 16 7.0 26 4.2 17.5 7.6 27 4.2 17.5 10.0 28 10.6 29

2.8 9.5 4.4 19 2.8 9.5 4.8 20

U = 104.5 - 9 10 / 2 = 59.5 z = (59.5 9 10 / 2) / 9 20 (9 + 20 + 1) /12 = 1.44 Two sided p value = 0.15


This is very similar to the result one obtains using two sample t-test

Generalizations
What if you have more than two groups? Rank all the observations regardless of group and then perform the one-way analysis of variance of the ranks. The null hypothesis: The distributions for the various populations defined by the groups are the same. You can get ranks by using PROC RANK in SAS. See the handout for example

Das könnte Ihnen auch gefallen