Sie sind auf Seite 1von 31

STATA WORKSHOP DAY 2 (BASIC ANALYSIS BY USING STATA SOFTWARE)

Introduction
This note is created as a guide to STATA course. This note will run through a sample data which is in
STATA format (student_analysis.dta). The sample data is a continuation from previous workshop
(Day 1). As now we will focus on basic statistical analysis after exploring and cleaning of data.
student_analysis.dta

Analysis
1.
2.
3.
4.
5.
6.
7.

One sample t-test


Independent t-test
Paired t-test
One-way Analysis of Variance
Categorical Test
Estimation of Risks
Correlation

Opening Log File (filename_date.log or filename_date.scml)


Opening a log file must be done before anything as the log file will capture all the output including
commands that has been displayed by results window.
1. Select: File > Log > Begin

2. A dialog box of Begin logging Stata ouput is open


Notes:
Two format of saving log file:
(1) *.scml The results are
mimicking those displayed by
results window.
(2) *.log The results are
displayed by notepad and
can be opened on other
computers without having
the STATA software.
Command:
. log using " C:\Users\filename_date.log"
Or
. log using " C:\Users\filename_date.scml"
ANALYSIS 1: One Sample t-test
Research question: The researchers randomly recruited 438 students and assess their systolic
pressure. A standard population mean systolic pressure is 120 mmHg.
Step 1: Hypothesis
H0: The mean systolic pressure is 120 mmHg. (or the mean systolic pressure is 120 mmHg)
HA: The mean systolic pressure is different than 120 mmHg.
Step 2: Level of significance
= 0.05
Step 3: Checking assumptions
1. Random sample
2. Independent sample
Step 4: Statistical test
1. Select: Statistics > Summaries, tables, and test > Classical tests of hypotheses > t-test (meancomparison test)

2. A dialog box is open

Command: ttest variable == testvalue


3. Output:

Step 5: Interpretation
95% confidence interval of mean difference of systolic does not include zero.
p-value < 0.001, reject H0
Step 6: Conclusion
At the 5% level of significance, the mean systolic pressure is difference than 120. The mean systolic
pressure is 115.65 (114.69, 116.60), lower than the mean systolic pressure of population.
Step 7: Presentation of results
Table 1: Comparison of mean systolic pressure to the population of 120 (n =438)
Parameter
Mean (SD)
95% Confidence Interval
t-statistica (df)
Systolic pressure
115.65 (10.21)
114.69, 116.60
-8.93 (437)
a

p-value
<0.001

One sample t-test analysis was applied

ANALYSIS 2: Independent t-test


Research question: The researchers wish to know the difference of BMI between males and females.

Step 1: Hypothesis
H0: The mean BMI between males and females are the same.
HA: The mean BMI between males and females are different.
Step 2: Level of significance
= 0.05
Step 3: Assumptions
1. Random sample
2. Two samples are independent
3. Two populations are normally distributed
There are two ways of checking normal distribution:
(i) Histogram with an overlaid normal curve
a) Select: Graphics > Histogram

b) A dialog box is open

Command: histogram outcomevar, normal by(groupingvar)


4

c) Output:

d) Interpretation: Normality assumption is met.


(ii) Box and whisker plot
a) Select: Graphics > Box plot

b) A dialog box is open

Command: graph box outcomevar, by(groupingvar)

c) Output:

d) Interpretation: Normality assumption is met


4. Two populations have the same variances (homogeneity of variances)
Checking homogeneity of variances through Levenes test:
a) Select: Statistics > Summaries, tables, and tests > Classical tests of hypotheses > Variancecomparison test

b) A dialog box is open

Command: sdtest outcomevar, by(groupingvar)


6

c) Output:
Notes:
Hypothesis of Levenes test
H0: The variances between
groups are the same
HA: The variances between
groups are different (onetailed)
d) Interpretation:
p value > 0.05 ( Do not reject H0)
The variances are equal. Thus, the assumption of Levenes test is met.
Step 4: Stastistical test
1. Select: Statistics > Summaries, tables, and test > Classical tests of hypotheses > t-test (meancomparison test)

2. A dialog box is open:

Command: ttest outcomevar, by(groupingvar)


7

3. Output:

Step 5: Interpretation
95% confidence interval of mean difference BMI does not include zero.
p-value < 0.001, reject H0
Step 6: Conclusion
At the 5% level of confidence, the mean BMI are different between males and females. The mean
BMI of males (23.97 0.26) is higher than females (20.89 0.22).
Step 7: Presentation of results
Table 2: Mean comparison of BMI between gender (n = 438)
Group (n)

Mean (SD)

Male (196)
Female (242)

23.97 (3.59)
20.89 (3.49)

Mean difference t-statistica


(95% CI)
(df)

p-value

3.08 (2.41, 3.75)

< 0.001

9.07(436)

Independent t-test was applied

ANALYSIS 3: Paired t-test


Research question: The researchers conducted an extensive aerobic activity to assess the change of
BMI before and after the activity.
Step 1: Hypothesis
H0: There is no difference of mean BMI before and after aerobic.
HA: The mean BMI before and after aerobic are different.
Step 2: Level of significance
= 0.05

Step 3: Checking assumptions


1. Random sample
2. The two samples are dependent.
3. The difference in population means is normally distributed.
Since this analysis involved paired samples, normality testing is based on the difference of two
samples.
a) Select: Data > Create or change data > Create new variable

b) A dialog box is open

Command: generate newvardiff = prevar - postvar


c) Output:

There are two ways of checking normality of distribution:


(i) Histogram with an overlaid normal curve
a) Select: Graphics > Histogram

b) A dialog box is open

Command: histogram newvardiff, normal


c) Output:

d) Interpretation: Normality assumption of BMI difference is met.


(ii) Box and whisker plot
a) Select: Graphics > Box plot

10

b) A dialog box is open

Command: graph box newvardiff


c) Output:

d) Interpretation: Normality assumption of BMI difference is met.


Step 4: Statistical test
1. Select: Summaries, tables, and tests > Classical tests of hypotheses > t test (mean-comparison
test)

11

2. A dialog box is open

Command: ttest prevar == postvar


3. Output:

Step 5: Interpretation
95% confidence interval of mean difference of BMI does not include zero.
p-value < 0.001, reject H0
Step 6: Conclusion
At the 5% level of significance, the means BMI before and after aerobic activity are different. The
mean BMI of before aerobic is higher than that after aerobic. The aerobic activity is effective in
lowering the BMI status.
Step 7: Presentation of results
Table 3: Comparison of mean BMI before and after an extensive aerobic activity (n = 438).
Group

Mean (SD)

Before
22.27 (3.85)
After
21.41 (3.97)
a
Paired t-test was applied.

Mean difference t-statistica (df)


(95% CI)
0.86 (0.78, 0.94)

22.01 (437)

p-value

< 0.001

12

ANALYSIS 4: One-way Analysis of Variance (One-way ANOVA)


Research question: The researchers wish to know whether there are different in mean height among
four races (Malay, Chinese, Indian, and Others).
Step 1: Hypothesis
H0: The mean height among race (Malay, Chinese, Indian, and Others) are the same.
HA: The mean height among race (Malay, Chinese, Indian, and Others) are different.
Step 2: Level of significance
= 0.05
Step 3: Checking assumptions
1. Random sample
2. Independent samples
3. Two populations are normally distributed
There are two ways of checking normal distribution:
(i) Histogram with an overlaid normal curve
a) Select: Graphics > Histogram

b) A dialog box is open

Command: histogram outcomevar, normal by(groupingvar)


13

c) Output:

d) Interpretation: Normality assumption is met.


(ii) Box and whisker plot
a) Select: Graphics > Box plot

b) A dialog box is open

Command: graph box outcomevar, by(groupingvar)

14

c) Output:

d) Interpretation: Normality assumption is met


4. Two populations have the same variances (homogeneity of variances)
Checking homogeneity of variances through Barletts test:
a) Select: Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA

b) A dialog box is open

Command: oneway outcomevar groupingvar


15

c) Output:
Notes:
Hypothesis of Barletts test
H0: The variances between groups are the
same
HA: The variances between groups are
different (one-tailed)
d) Interpretation:
p value > 0.05 ( Do not reject H0)
The variances are equal. Thus, the assumption of Barletts test is met.
Step 4: Statistical test
1. Select: Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA

2. A dialog box is open

Command: oneway outcomevar groupingvar

16

3. Output:

Step 5: Interpretation
p-value < 0.001, reject H0
Step 6: Conclusion
At the 5% level of significance, the mean height are different among races.
Additional analysis: Post-hoc analysis
Post-hoc analysis is to identify which pair of groups have the significant difference in mean of BMI.
Possible comparison pair of height among groups:
-

Malay vs Chinese
Malay vs Indian
Malay vs Others
Indian vs Chinese
Indian vs Others
Chinese vs Others

There are six


comparison pairs of
height among races

a) Select: Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA

17

b) A dialog box is open

Command: oneway outcomevar groupingvar, bonferroni


c) Output:

p-value of comparison pairs


among races

d) Interpretation:
Malay vs Chinese:- p-value = 0.001, reject H0
Malay vs Indian:- p-value = 0.001, reject H0
Malay vs Others:- p-value > 0.008, do not reject H0
Chinese cs Indian:- p-value > 0.008, do not reject H0
Chinese vs Others:- p-value > 0.008, do not reject H0
Indian vs Others:- p-value > 0.008, do not reject H0

Bonferroni correction:
The alpha value (/npair) = 0.008

Step 6: Conclusion
At the 5% level of significance, mean height are significant among races. Significant difference are
reported between Malay and Chinese,and Malay and Indian.

18

Step 7: Presentation of results


1. Select: Statistics > Summaries, tables, and tests > Other tables > Tables of means, std. dev. and
frequencies

2. A dialog box is open

Command: oneway outcomevar groupingvar, bonferroni


3. Output:

Table 4: Comparison of mean height among races (n = 438)


Races
n
Mean (SD)
Malay
261
162.85 (8.06)
Chinese
138
165.99 (7.85)
Indian
28
168.89 (9.19)
Others
11
169.75 (5.52)

F-statistica(df)

9.59 (3)

p-valueb

< 0.001

One-way ANOVA test was applied


b
Malay vs Chinese:- p-value = 0.001, reject H0; Malay vs Indian:- p-value = 0.001, reject H0; Malay vs Others:- pvalue > 0.008, do reject H0; Chinese cs Indian:- p-value > 0.05, do not reject H0; Chinese vs Others:- p-value >
0.05, do not reject H0; Indian vs Others:- p-value > 0.95, do not reject H0.

19

ANALYSIS 5: Categorical Test


Research question: The researchers would like to determine an association between hypertension
(normal or hypertensive) and gender (male or female).
Step 1: Hypothesis
H0: There is no association between hypertension and gender.
HA: There is an association between hypertension and gender.
Step 2: Level of significance
= 0.05
Step 3: Checking assumptions
1. Independent samples
2. Two variables are categorical.
3. If less than 20% of the cells have expected frequency < 5, then use Chi-square test. If equal or
more than 20% of the cells have expected frequency < 5, then use Fishers exact test.
Checking expected frequency assumption:
1. Select: Statistics > Summaries, tables, and tests > Frequency tables > Two-way table with
measures of association

2. A dialog box is open

Command: tabulate indepvar depvar, expected


20

3. Output:

0 cells (0%) have expected


frequency less than 5,
therefore Chi-square test
will be applied

4. Interpretation: Assumption is met.


Step 4: Statistical test (Pearson Chi-square test)
1. Select: Statistics > Summaries, tables, and tests > Frequency tables > Two-way table with
measures of association

2. A dialog box is open

Command: tabulate indepvar depvar, chi2 row

21

3. Output:

Step 5: Interpretation
p-value < 0.05
2 statistic = 4.29
Step 6: Conclusion
There is an association between hypertension and gender. There is higher proportion in males group
compared to those in females group.
Step 7: Presentation of results
Table 5: Association between hypertension and gender (n=438)
Gender
Hypertension, n (%)
2 statistica (df)
Hypertensive
Normal
Male
14 (7.1)
182 (92.9)
Female
7 (2.9)
235 (97.1)
4.29 (1)
a

p-value

0.038

Pearson Chi-square test was applied; Level of significance was set at 5%.

ANALYSIS 6: Estimation of Risk


Research question: The researchers wish to estimate risk of hypertension among gender.
Step 1: Hypothesis
H0: There is an equal risk of hypertension between males and females (OR = 1)
HA: There is an increased (OR > 1) or reduced risk (OR < 1) of hypertension between males and
female.
Step 2: Level of significance
= 0.05
Step 3: Checking assumption
1. There are cases and non cases observations (factor: gender).
2. There are exposed and non exposed observations (disease: hypertension).
22

Step 4: Statistical test


1. Select: Statistics > Epidemiology and related > Tables for epidemiologists > Cohort study risk-ratio
etc. calculator

2. A dialog box is open of csi Cohort studies is open.

Command: csi a b c d, or
3. Output:

Step 5: Interpretation
OR = 2.58 (95% CI: 1.05, 6.35). The 95% CI does not include 1, reject H0.

23

Step 6: Conclusion
There is 2.58 times of higher odds of getting hypertension among males as compared to females.
Step 7: Presentation of results
Table 6: Association between hypertension and gender (n=438)
Gender
Hypertension, n (%)
OR (95% CI)
Hypertensive
Normal
Male
14 (7.1)
182 (92.9)
Female
7 (2.9)
235 (97.1)
2.58 (1.05, 6.35)
a

2 statistica (df)

p-value

4.29 (1)

0.038

Pearson Chi-square test was applied; Level of significance was set at 5%.

ANALYSIS 7: Correlation
Research question: The researchers wish to identify the relationship between height and weight.
Step 1: Hypothesis
H0: There is no relationship between weight (kg) and systolic pressure (mmHg).
HA: There is a relationship between weight (kg) and systolic pressure (mmHg).
Step 2: Statistical test
1. Distribution of weight by histogram
a) Select: Graphics > Histogram

b) A dialog box is open

Command: histogram var, normal

24

c) Output:

d) Interpretation:
Distribution of weight is approximately normal.
2. Distribution of weight by box and whisker plot
a) Select: Graphics > Boxplot

b) A dialog box is open

Command: graph box var

25

c) Output:

d) Interpretation:
Distribution of weight is approximately normal.
3. Distribution of systolic pressure by histogram and box and whisker plot.

Interpretation: Distribution of systolic pressure is normal.


4. Relationship between weight and systolic pressure
a) Select: Graphics > Twoway graph (scatter, line, etc.)

26

b) A dialog box is open

Command: twoway (scatter var1 var2)


c) Output:

d) Interpretation: The scatter plot is elliptical in shape.


5. Direction of relationship between height and systolic blood pressure
a) Select: Graphics > Twoway graph (scatter, line, etc.)

27

b) A dialog box is open

Command:
. twoway (scatter vary varx) (lfit vary varx)
c) Output:

d) Interpretation: There is a positive correlation between weight and systolic blood pressure.
6. Strength of relationship between height and systolic blood pressure
a) Select: Summaries, tables, and tests > Summary and descriptive statistics > Pairwise
correlations

28

b) A dialog box is open

Command: pwcorr var1 var2, sig star (5)

c) Output:
Notes:
Correlation coefficient (r)
r < 0.25 poor relationship
0.26 < r < 0.50 fair
0.51 < r < 0.75 good
0.76 < r < 1.00 - excellent

Step 3: Conclusion
There is a significant, positive and fair correlation between weight and systolic blood pressure
(r=0.47, p<0.001).
Step 4: Presentation of results
Table 7: Correlation between weight and systolic blood pressure (n=438)
Variable
ra
p-value
Weight (kg)
Systolic pressure (mmHg)
a

0.47

< 0.001

Pearson correlation test was applied; Level of significance was set at 5%.

29

Saving History of Commands in filename.do


1. Go to Review window > Select all the commands > right click > Send to Do-file Editor

2. A Do-file Editor window is open with the list of commands

3. Select: File > Save

4. A dialog box of Save Stata Do-file is open

30

Organising a filename.do
1. Open analysis_date.do > Create comments and notes as below

Notes:
Green text: Comments/ notes
/* : opening symbol for hiding selective commands
*/ : closing symbol for hiding selective commands

2. Click on Execute (do)

to rerun all commands or selective commands

3. STATA reruns the previous work

31