Sie sind auf Seite 1von 16

# BIOSTATISTICS NOTES

by Dr PHUA Kai Lit Community Medicine Section International Medical University Kuala Lumpur, Malaysia

DESCRIPTIVE STATISTICS
POPULATION: Universe of all units being studied. If we want to study lung cancer among Malaysians, then the study population will be all Malaysians. If we want to study lung cancer among Malaysian women, then the population will be all Malaysian women. SAMPLE: Subset of the population RANDOM SAMPLE: Each member of the population has an EQUAL CHANCE of being chosen for the sample (simple random sample) SAMPLING METHODS (#5 is nonrandom) #1 Simple Random Sample #2 Systematic Sample Example: Rank 100 people by age Beginning with the 5th person, choose every tenth i.e. Choose the 5th,15th,25th ... 85th,95th persons #3 Stratified Sample: Composition of the sample reflects composition of the population (proportional stratified sample) Malaysian Population Malays 60% Chinese 25% Indians 10% Others 5% Proportional Stratified Sample of 1000 Malaysians Malays 600 (60%) Chinese 250 (25%) Indians 100 (10%) Others 50 (5%) #4 Cluster Sample

Divide population into groups Random sample of groups is chosen Count every unit in each and every group selected Example: Divide entire city into "city blocks" Random sample of blocks is chosen Count every person in each city block selected #5 Nonrandom/Convenience Sample Example: Interview people in a shopping mall. This is nonrandom because not everyone in the population goes shopping at the mall. People who are wheelchair-bound are less likely to visit the mall. This is also the case with people who have problems walking. Also, the mall may be too far away for those who don't have cars. If the interview is done during a weekday, people who work are unlikely to be at the mall (housewives and retirees are more likely to be interviewed on weekdays). VALIDITY: You are actually measuring what you want to measure. Example: if IQ really measures intelligence, then IQ is high in validity. If it does not actually measure intelligence, then it is low in validity. RELIABILITY: Refers to stability of measurement. Example: A research instrument is high in reliability if it gives consistent readings when it is used on the same subject even if the subject is measured at different times. Example of a "research instrument" is a survey questionnaire. MEASURES OF CENTRAL TENDENCY (used to summarise a dataset) 1. Arithmetic Mean - affected by extreme values 2. Median 3. Mode MEASURES OF DISPERSION (used to determine how spread out a dataset is) 1. Range - difference between highest and lowest values 2. Inter-quartile range - difference between the first and third quartiles 3. Variance 4. Standard Deviation - the higher the Standard Deviation, the more spread out the data. The Standard Deviation is the square root of the Variance. THE STANDARD DEVIATION IS A VERY IMPORTANT MEASURE - Under a Standardised Normal Curve, 68.3% of the data are found +1 or -1 standard deviation from the mean 95.5% of the data are found +2 or -2 standard deviations from the mean 99.7% are found +3 or -3 standard deviations from the mean

LEVEL OF MEASUREMENT OF DATA 1. Nominal data: qualitative, categorical data. Example: ethnicity, sex, religion. 2. Ordinal data: Rank-ordered data. Data are grouped from low to high. But we cannot say how much lower or how much higher. Example: "low anxiety", "moderate anxiety" and "high anxiety". 3. Interval data: quantitative data with arbitrary zero. Example of interval data: temperature measured using the Celsius scale. 4. Ratio level data: Similar to Interval Data but in addition, it has an absolute zero (true zero) e.g. length, income, temperature measured using the Kelvin scale. NOTE: For Ratio Data, we can use ratio level, interval level, ordinal level and nominal level statistical tests. For Interval Data, we can use interval level, ordinal level and nominal level tests. For Ordinal Data, we can use ordinal level and nominal level tests. But for Nominal Data, we can only use nominal level statistical tests.

## INFERENTIAL STATISTICS/STATISTICAL INFERENCE

If we are doing research on a large population, we need not study each and every individual in the population. All we need to do is choose a sample (RANDOM and REPRESENTATIVE) from the population. We can use our findings from the sample to infer (draw conclusions) about the population. Research Hypothesis/Alternative Hypothesis: the hypothesis we wish to confirm. Also called H-one and written as H subscript one. Null Hypothesis: opposite of the Research Hypothesis. Also called H-nought and written as H subscript zero. Examples: Research Hypothesis - there is a statistically significant association between X and Y Null Hypothesis - there is no statistically significant association between X and Y Any association seen is due to chance Research Hypothesis - there is a statistically significant difference between the two population means Null Hypothesis - there is no statistically significant difference between the two population means Any difference seen is due to chance Significance Level/Confidence Level (denoted by Greek symbol alpha)

Usually set at 0.05 or 0.01 An alpha of 0.05 means we wish to test the statement "the probability of what we see occuring by chance is 5% or less" An alpha of 0.01 means we wish to test the statement "the probability of what we see occuring by chance is 1% or less" (An alpha of 0.05 can also be interpreted as the probability of rejecting a true Null Hypothesis). Type 1 Error and Type 2 Error Type 1 Error (or alpha): reject a true Null Hypothesis Type 2 Error (or beta) : accept a false Null Hypothesis Power of a test Power = 1 - beta (i.e. 1 - probability of Type 2 Error) Normally, the power of a test should be at least 80% Thus, the probability of detecting an effect is 80% and the probability of not detecting the effect is 20% Power of a study can be raised by increasing the sample size. Interpretation of Statistical Output We reject the Null Hypothesis if the test statistic e.g. the calculated chi-square or the calculated t equals or exceeds the critical value. We accept (strictly: "fail to reject")the Null Hypothesis if the test statistic e.g. the calculated chi-square or the calculated t does not equal or exceed the critical value p-value: If p < 0.05, we conclude that there is less than a 5% probability that what we see occurred by chance If p < 0.01, we conclude that there is less than a 1% probability that what we see occurred by chance Thus, the p-value is the probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred purely by chance, and that in the population from which the sample was drawn, no such relationship or differences actually exist. One-tail test and Two-tail test Two-tail test: we test to see if there is a difference between X and Y One-tail test: we test to see if there is a difference between X and Y in one particular direction Example: We test to see if X > Y Example: We test to see if X < Y

Standardised Normal Distribution: See description under "Standard Deviation" In a normal distribution, the mean, median and mode are equal. The curve is bell-shaped and symmetrical The standardised normal distribution curve has a mean of zero and a standard deviation of 1. Degree of freedom: the degree of freedom depends on the sample size or number of categories. The critical value of a statistical test changes with changes in the degree of freedom. COMMON STATISTICAL TESTS 1. Chi-square test of goodness-of-fit, single sample (NOMINAL DATA). Degree of freedom is n-1 (where n = number of categories) 2. Chi-square test of independence (NOMINAL DATA). Degree of freedom is (r-1) X (c-1) where r = number of rows and c = number of columns in a contingency table. To test if there is a statistically significant association between two variables. 3. t-test for two independent samples (INTERVAL DATA). Degree of freedom = (n11) + (n2-1) To test if there is a statistically significant difference between the two population means from the two samples. 4. t-test for two matched samples (INTERVAL DATA). Degree of freedom is n-1 where n = the number of pairs To test if there is a statistically significant difference between the two population means from the two matched samples. 5. If there are more than two samples, we use the ANOVA test (Analysis of Variance) NOTE: IF THERE ARE MORE THAN TWO SAMPLES, IT IS INCORRECT TO USE THE T-TEST TO MAKE PAIRWISE COMPARISONS Example: if there are 3 samples, it is incorrect to compare mean #1 with mean #2, mean #1 with mean #3, mean #2 with mean #3. The ANOVA test should be done on the three means instead. CHOOSING A TEST 1. What is the level of measurement? Nominal, ordinal or interval? 2. How many samples? One, two or more? 3. If two samples, are they independent or paired/matched? 4. Choose the test. Make sure the assumptions of the test are not violated ASSUMPTIONS OF CHI-SQUARE TEST OF INDEPENDENCE 1. Nominal data (ordinal data also OK) 2. 25 =< n =<250 (preferably) 3. Random sample

4. Expected value of each cell is at least 5 (if not, you should combine some of the categories) INTERPRETING RESULTS OF CHI-SQUARE TEST H-nought is "There is no association between X and Y. Any association seen is due to chance alone" H-one is "There is a statistically significant association between X and Y" Reject H-nought if the calculated chi-square equals or exceeds the critical value Reject H-nought if p is less than or equal to 0.05 (if testing at alpha = 0.05) ASSUMPTIONS OF T-TEST FOR TWO INDEPENDENT SAMPLES 1. Random 2. Interval data 3. Normal distribution in both groups 4. Preferably n < 30 (for each sample). INTERPRETING RESULTS OF T-TEST H-nought is "There is no difference between population mean X and population mean Y. Any difference seen is due to chance alone" H-one is "There is a difference between population mean X and population mean Y" Reject H-nought if the calculated t statistic equals or exceeds the critical value Reject H-nought if p is less than or equal to 0.05 (if testing at alpha = 0.05) IMPORTANT: What is STATISTICALLY SIGNIFICANT may not be CLINICALLY SIGNIFICANT.

## CORRELATION AND REGRESSION

Correlation: a measure of how two variables go together. Pearson's r (also called Pearson's correlation coefficient) is a measure of linear relationship between two variables. A value of +1 means a perfect positive linear relationship. A value of -1 means a perfect negative linear relationship. A value of 0 means no linear relationship. Assumptions for using Pearson's r: Randomness Linear relationship exists Both variables are normally distributed Variables measured at Interval level It is incorrect to use r for variables measured at nominal or ordinal level

Correlations can also be nonlinear. For nonlinear correlations, we do not use Pearson's r but some other correlation coefficient. NOTE: CORRELATION DOES NOT IMPLY CAUSATION Just because two variables are correlated does not necessarily mean that one causes the other. Regression: used to predict how independent variables (X1, X2 etc) affect a dependent variable (Y). Simple Regression: Has only 1 dependent variable Y and 1 independent variable X Multiple Regression: Has 1 dependent variable Y but two or more independent variables X1, X2 etc Example Simple regression - predict INCOME (Y) from YEARS OF EDUCATION (X) Multiple regression - predict INCOME (Y) from YEARS OF EDUCATION (X1) and YEARS OF WORKING EXPERIENCE (X2) (Variables measured at the nominal level such as "Sex" can also be used as independent variables in regression. They are used as "dummy variables"). r-square: An indicator of the "amount of variance of the dependent variable accounted for" by the regression equation. Also called the coefficient of determination. The higher the r-square, the better the regression line fits the data. regression coefficient: If we have a regression equation Y = 0.3X1 + 4X2, then the regression coefficient of X1 is 0.3 and the regression coefficient of X2 is 4. This means that when X1 increases by 1 unit, Y will increase by 0.3 Also, when X2 increases by 1 unit, Y will increase by 4 units. Example Diastolic blood pressure of sample of men aged 30-50 are plotted against their age Y = 40 + 1.5X (Y = diastolic blood pressure, X = age) Interpretation: For these men, each year of increase in age raises the diastolic bp by 1.5 mm Hg If man is 50 years old, the predicted diastolic bp is 40 + 1.5(50) = 115 mm Hg NOTE: It is incorrect to extrapolate in regression analysis i.e. if your sample consists of men aged 30 - 50, you should not use the regression model to predict the blood pressure of men whose ages are below 30 or above 50

WHAT TO LOOK FOR IN A GOOD REGRESSION MODEL 1. Are the dependent variable and independent variables properly "operationalised" (defined and measured)? 2. Do the relationships between the dependent variable and independent variables make sense? 3. Is the relationship between the dependent variable and each independent variable linear in nature? Do scatterplots 4. Examine the r-square. The higher the r-square, the "better" the model. 5. Examine the sign of each regression coefficient. Do they make sense? 6. Check if each of the regression coefficients are statistically significant (p =< 0.05 or p =< 0.01) 7. If there are more than one regression model, compare and contrast between them

## OTHER TERMS TO KNOW

Confidence Interval: The interval within which something is likely to be found A 95% Confidence Interval for the population mean indicates (loosely-speaking)that there is a 95% probability that the population mean actually lies within that particular Confidence Interval. Skew: If a curve is slanted to the right, it is skewed to the left. If a curve is slanted to the left, it is skewed to the right Nonparametric Tests: Statistical tests which make no assumptions about the parent distribution. Tests involving ranks of data are nonparametric. Parametric Tests: Statistical tests which assume that the population distribution has a particular form e.g. a normal distribution. The t-test is a parametric test as it assumes normal distribution. Standard Error of the Mean: We take samples from a population. For each sample, we calculate its mean. We then plot the sample means and we will get a curve. The curve will have a standard deviation. This standard deviation is the standard error of the mean. It is used to calculate confidence intervals. The smaller the standard error of the mean, the more closely the sample mean estimates the true population mean. Odds Ratio: This is increasingly used in research nowadays

BIOSTATISTICS NOTES POPULATION: Universe of all units being studied. If we want to study lung cancer among Indians, then the study population

will be all Indians. If we want to study lung cancer among Indian women, then the population will be all Indian women. SAMPLE: Subset of the population RANDOM SAMPLE: Each member of the population has an EQUAL CHANCE of being chosen for the sample SAMPLING METHODS (#5 is nonrandom) #1 Simple Random Sample #2 Systematic Sample Example: Rank 100 people by age Beginning with the 5th person, choose every tenth i.e. Choose the 5th,15th,25th ... 85th,95th persons #3 Stratified Sample: Composition of the sample reflects composition of the population Indian Population Bengalis 60% Punjabis 25% Kashmiris 10% Others 5% Stratified Sample of 1000 Indians Bengalis 600 (60%) Punjabis 250 (25%) Kashmiris 100 (10%) Others 50 (5%) #4 Cluster Sample (used for immunization studies) Divide population into groups Random sample of groups is chosen Count every unit in each and every group selected Example: Divide entire city into "city blocks" Random sample of blocks is chosen Count every person in each city block selected #5 Nonrandom/Convenience Sample

Example: Interview people in a shopping mall. This is nonrandom because not everyone in the population goes shopping at the mall. People who are wheelchair-bound are less likely to visit the mall. This is also the case with people who have problems walking. Also, the mall may be too far away for those who don't have cars. If the interview is done during a weekday, people who work are unlikely to be at the mall (housewives and retirees are more likely to be interviewed on weekdays. VALIDITY: You are actually measuring what you want to measure. Example: if IQ really measures intelligence, then IQ is high in validity. If it does not actually measure intelligence, then it is low in validity. RELIABILITY: Refers to stability of measurement. Example: A research instrument is high in reliability if it gives consistent readings when it is used on the same subject even if the subject is measured at different times. Example of a "research instrument" is a survey questionnaire. MEASURES OF CENTRAL TENDENCY (used to summarise a dataset) 1. Arithmetic Mean - affected by extreme values 2. Median 3. Mode MEASURES OF DISPERSION (used to determine how spread out a dataset is) 1. Range - difference between highest and lowest values 2. Variance 3. Standard Deviation - the higher the Standard Deviation, the more spread out the data. The Standard Deviation is the square root of the Variance. THE STANDARD DEVIATION IS A VERY IMPORTANT MEASURE Under a Standardised Normal Curve, 68.3% of the data are found +1 or -1 standard deviation from the mean 95.5% of the data are found +2 or -2 standard deviations from the mean 99.7% are found +3 or -3 standard deviations from the mean LEVEL OF MEASUREMENT OF DATA 1. Nominal data: qualitative, categorical data. Example: ethnicity, SEX,

religion. 2. Ordinal data: Rank-ordered data. Data are grouped from low to high. But we cannot say how much lower or how much higher. Example: "low anxiety", "moderate anxiety" and "high anxiety". 3. Interval data: quantitative data. There is fixed equal interval between numbers e.g. the difference between 10 km and 15 km is the same as the distance between 30 km and 35 km. Examples of interval data: height, weight, temperature measured using the Celsius scale. 4. Ratio level data: Similar to Interval Data but in addition, it has an absolute zero e.g. income, temperature measured using the Kelvin scale. NOTE: For Ratio Data, we can use ratio level, interval level, ordinal level and nominal level statistical tests. For Interval Data, we can use interval level, ordinal level and nominal level tests. For Ordinal Data, we can use ordinal level and nominal level tests. But for Nominal Data, we can only use nominal level statistical tests. INFERENTIAL STATISTICS/STATISTICAL INFERENCE If we are doing research on a large population, we need not study each and every individual in the population. All we need to do is choose a sample (RANDOM and REPRESENTATIVE) from the population. We can use our findings from the sample to infer (draw conclusions) about the population. Research Hypothesis/Alternative Hypothesis: the hypothesis we wish to test. Also called H-one and written as H1 Null Hypothesis: opposite of the Research Hypothesis. Also called H-nought and written as H0. Examples: Research Hypothesis - there is a statistically significant association between Smoking and Cancer Null Hypothesis - there is no statistically significant association between Smoking and Cancer. Any association seen is due to chance Research Hypothesis - there is a statistically significant difference between the two population means

Null Hypothesis - there is no statistically significant difference between the two population means. Any difference seen is due to chance Significance Level/Confidence Level (denoted by ) Usually set at 0.05 or 0.01 An of 0.05 means we wish to test the statement "the probability of what we see occuring by chance is 5% or less i.e. p Y Example: We test to see if X < Y Standardised Normal Distribution: See description under "Standard Deviation" In a normal distribution, the mean, median and mode are equal. The curve is bell-shaped and symmetrical The standardised normal distribution curve has a mean of zero and a standard deviation of 1. Degree of freedom: the degree of freedom depends on the sample size or number of categories. The critical value of a statistical test changes with changes in the degree of freedom. COMMON STATISTICAL TESTS 1. Chi-square test of goodness-of-fit, single sample (NOMINAL DATA - like religion) Degree of freedom is n-1 (where n = number of categories) 2. Chi-square test of independence (NOMINAL DATA). Degree of freedom is (r-1) X (c-1) where r = number of rows and c = number of columns in a contingency table. To test if there is a statistically significant association between two variables (between ethnicity and religion) 3. t-test for two independent samples (INTERVAL DATA like height and weight). Degree of freedom = (n1-1) + (n2-1) To test if there is a statistically significant difference between the two population means from the two samples. 4. t-test for two matched samples (INTERVAL DATA).

Degree of freedom is n-1 where n = the number of pairs To test if there is a statistically significant difference between the two population means from the two matched samples. 5. If there are more than two samples, we use the ANOVA test (Analysis of Variance) or F-test NOTE: IF THERE ARE MORE THAN TWO SAMPLES, IT IS INCORRECT TO USE THE T-TEST TO MAKE PAIRWISE COMPARISONS Example: if there are 3 samples, it is incorrect to compare mean #1 with mean #2, mean #1 with mean #3, mean #2 with mean #3. The ANOVA test should be done on the three means instead. CHOOSING A TEST 1. What is the level of measurement? Nominal, ordinal or interval? 2. How many samples? One, two or more? 3. If two samples, are they independent or paired/matched? 4. Choose the test. Make sure the assumptions of the test are not violated ASSUMPTIONS OF CHI-SQUARE TEST OF INDEPENDENCE 1. Nominal data (ordinal data also OK) 2. 25 =< n =<250 (preferably) 3. Random sample 4. Expected value of each cell is at least 5 (if not, you should combine some of the categories) INTERPRETING RESULTS OF CHI-SQUARE TEST H0 is "There is no association between religion and ethnicity. Any association seen is due to chance alone" H1 is "There is a statistically significant association between religion and ethnicity " Reject H0 if the calculated chi-square equals or exceeds the critical value Reject H0 if p is less than or equal to 0.05 (if testing at alpha = 0.05) ASSUMPTIONS OF T-TEST FOR TWO INDEPENDENT SAMPLES 1. Random 2. Interval data

3. Normal distribution in both groups 4. Preferably n < 30 (for each sample). INTERPRETING RESULTS OF T-TEST H0 is "There is no difference between mean heights of Chinese and mean heights of Japanese. Any difference seen is due to chance alone" H1 is "There is a difference between mean heights of Chinese and mean heights of Japanese " Reject H0 if the calculated t statistic equals or exceeds the critical value Reject H0 if p is less than or equal to 0.05 (if testing at alpha = 0.05) IMPORTANT: What is STATISTICALLY SIGNIFICANT may not be CLINICALLY SIGNIFICANT. CORRELATION AND REGRESSION Correlation: a measure of how two variables go together. Pearson's r (also called Pearson's correlation coefficient) is a measure of linear relationship between two variables. A value of +1 means a perfect positive linear relationship. A value of -1 means a perfect negative linear relationship. A value of 0 means no linear relationship. Assumptions for using Pearson's r: Randomness Linear relationship exists Both variables are normally distributed Variables measured at Interval level It is incorrect to use r for variables measured at nominal or ordinal level Correlations can also be nonlinear. For nonlinear correlations, we do not use Pearson's r but some other correlation coefficient. NOTE: CORRELATION DOES NOT IMPLY CAUSATION Just because two variables are correlated does not necessarily mean that one causes the other. Regression: used to predict how independent variables affect a dependent variable (Y) Simple Regression: Has only 1 dependent variable Y and 1 independent

variable Multiple Regression: Has 1 dependent variable Y but two or more independent variables X1, X2 etc Example Simple regression - predict INCOME (Y) from YEARS OF EDUCATION (X) Multiple regression - predict INCOME (Y) from YEARS OF EDUCATION (X1) and YEARS OF WORKING EXPERIENCE (X2) (Variables measured at the nominal level such as "SEX" can also be used as independent variables in regression. They are used as "dummy variables"). r-square: An indicator of the "amount of variance of the dependent variable accounted for" by the regression equation. Also called the coefficient of determination. The higher the r-square, the better the regression line fits the data. Regression coefficient: If we have a regression equation Y = 0.3X1 + 4X2, then the regression coefficient of X1 is 0.3 and the regression coefficient of X2 is 4.This means that when X1 increases by 1 unit, Y will increase by 0.3Also, when X2 increases by 1 unit, Y will increase by 4 units. Example Diastolic blood pressure of sample of men aged 30-50 are plotted against their age Y = 40 + 1.5X (Y = diastolic blood pressure, X = age) Interpretation: For these men, each year of increase in age raises the diastolic B.P. by 1.5 mm Hg If man is 50 years old, the predicted diastolic bp is 40 + 1.5(50) = 115 mm Hg NOTE: It is incorrect to extrapolate in regression analysis i.e. if your sample consists of men aged 30 - 50, you should not use the regression model to predict the blood pressure of men whose ages are below 30 or above 50 WHAT TO LOOK FOR IN A GOOD REGRESSION MODEL 1. Are the dependent variable and independent variables properly "operationalised" (defined and measured)? 2. Do the relationships between the dependent variable and independent variables make sense? 3. Is the relationship between the dependent variable and each independent variable linear in nature? Do scatterplots

4. Examine the r-square. The higher the r-square, the "better" the model. 5. Examine the sign of each regression coefficient. Do they make sense? 6. Check if each of the regression coefficients are statistically significant (p =< 0.05 or p =< 0.01) 7. If there are more than one regression model, compare and contrast between them OTHER TERMS TO KNOW Confidence Interval: The interval within which something is likely to be found A 95% Confidence Interval for the population mean indicates that there is a 95% probability that the population mean actually lies within that particular Confidence Interval. Skew: If a curve is slanted to the right, it is skewed to the left. If a curve is slanted to the left, it is skewed to the right Nonparametric Tests: Statistical tests which make no assumptions about the parent distribution. Tests involving ranks of data are nonparametric. Parametric Tests: Statistical tests which assume that the population distribution has a particular form e.g. a normal distribution. The t-test is a parametric test as it assumes normal distribution. Standard Error of the Mean: We take samples from a population. For each sample, we calculate its mean. We then plot the sample means and we will get a curve. The curve will have a standard deviation. This standard deviation is the standard error of the mean. It is used to calculate confidence intervals. The smaller the standard error of the mean, the more closely the sample mean estimates the true population mean.