Beruflich Dokumente
Kultur Dokumente
Data
Nadia Akseer
Nadia
MSc. Candidate
MSc . Candidate
Brock University
Brock University
Agenda
a. Examine Individual Variable Distributions
i.
i
. Continuous data:
Proc Univariate
Proc Univariate
distributions, tests for normality, plots
distributions, tests for normality, plots
Proc Means
ii. Categorical data:
Proc Freq
Proc Freq
distributions
b. Examine Relationships Between Variables
i.
i
. Continuous data:
Scatter plots
Correlation
Correlation Spearman, Pearson
Spearman, Pearson
ii. Categorical data:
Freq tables, probabilities, Chi
Freq tables, probabilities, Chi
square, Fishers Exact
square, Fishers Exact
iii. Continuous and Categorical data:
Continuous and Categorical data:
Proc t
Proc t test
test
Example
Height, Weight, BMI, sex and activity level
measurements are available for a group of
physically active students
**Note: 5 activity questions asked ‘1=none…4=very active’ **
**Note: 5 activity questions asked ‘1=none…4=very active’ **
Example Con’td
Example Con’td
n
n What are the mean, median, mode of the BMI?
n How dispersed is the BMI data?
n
n Is BMI normally distributed?
n Is BMI normally distributed?
Proc Univariate
Proc Univariate
n
n Provides information on:
n
n Measures of central tendency
n
n (mean, median, mode etc.)
n
n Measures of dispersion
n
n (standard deviation, range, IQR etc.)
n
n Allows us to visualize data
n
n (stem
(stem
leaf, normality & box plots)
leaf, normality & box plots)
n
n Used for a continuous variable
Used for a continuous variable
Proc Univariate
Proc Univariate Syntax
Univariate Syntax
Proc univariate data=
Proc univariate data=bmi
bmi plot normal ;
bmi plot normal ;
Var bmi
Var bmi;
;
Histogram/normal;
Run;
6
6
Proc Univariate
Proc Univariate Output
Univariate Output
Is the data normally distributed?
If not? Which way is it skewed?
8
8
Variable is normally distributed if p
Variable is normally distributed if p
value>0.05
value>0.05
Example Con’td
Example Con’td
n
n How many individuals have complete data for
height, weight and BMI?
n What is the range of data for all three
n
variables?
n What are the means and standard deviations?
n What are the means and standard deviations?
Proc Means
Used to obtain mean, standard deviation, min
and max for multiple continuous variables
Proc means data= bmi;
Proc means data=bmi ;
Var bmi
Var bmi ht wt;
ht wt;
Run;
11
11
Example Con’td
Example Con’td
n
n What proportion of the sample are boys?
n What proportion are girls?
n
n What proportion of the sample are not
n
physically active in the first activity question?
n What are the physical activity trends in all 5
n What are the physical activity trends in all 5
activity questions?
Proc Freq
n
n Looks at distribution of categorical variables
n Gives information about frequency and
n
proportions
n Can look at multiple variables at a time
n
Proc Freq data= bmi;
Proc Freq data=bmi ;
Table sex active1
Table sex active1
active5;
active5;
Run;
13
13
14
14
Correlation
n
n Correlation
n
n Two variables are considered to be correlated
Two variables are considered to be
when there is a relationship
when there is a relationship between them
n ρ (rho) a.k.a. “Correlation Coefficient (r)”
ρ (rho) a.k.a. “Correlation Coefficient (r)”
n
n Used to express the strength of the association
between the two variables
n Has a range of values:
n 1 ≤ ρ ≤ 1
Has a range of values:
1 ≤
n
n |ρ
ρ|= 1
|= 1 èè perfect
perfect linear
linear relationship
relationship
n ρà à 0
0èè weak
weak linear
linear relationship
relationship
n ρà à 1
1èè strong
strong linear
linear relationship
relationship
relationship
Correlation
n
n Hypotheses
n
n What is our H 0 in correlation?
= 0 è There is no
ρ = 0 There is no linear
linear correlation
n
n What is our H A in correlation?
≠ 0 è There is a
ρ ≠ 0 There is a linear
linear correlation
correlation
Correlation
n
n Procedure for determining if there is a
correlation between two variables
1. Run a scatter plot
2. Check Assumptions
Check Assumptions Normal distribution
3. Run either a Pearson or a Spearman
4. Determine if you reject/fail to reject Ho
5. If you reject, look at correlation coefficient –
If you reject, look at correlation coefficient
How strong is the relationship?
How strong is the relationship?
Review
5. If H 0 is rejected, determine the strength of the
relationship
ρ Relationship
>0.7 Strong
0.4 – 0.7 Medium
<0.4 Weak
Example
n
n Both the Pearson Correlation and the Spearman
Correlation will be used on the same example data
to show the differences between the two methods
Table 9.1. Lengths and Weights of Male Bears
Table 9.1. Lengths and Weights of Male Bears
x Length (in.) 53.0 67.5 72.0
72.0 72.0 73.5 68.5 73.0 37.0
y Weight (lb) 80 344 416 348 262 360 332 34
Example
1. Run a Scatter Plot
proc plot;
plot y*x;
title ‘….’;
run;
proc plot;
plot weight*length;
run;
Can you see an
association??
association??
Example
Check Assumptions
þ n
n Random sample
þ n
n Points approximately on a straight line
þ
þ n
n Outliers examined
ý
ý n
n Normal distribution for both
Normal distribution for
Pvalues<0.05
both variables
Weight
Weight Length
Example
3. Decide Between a Pearson and a Spearman
n
n Only 3/4 assumptions were met, therefore we
should proceed with a….
*Normal distribution most important*
*Normal distribution most important*
Example
Example
Pearson
Proc corr
Proc corr;
;
Var weight length;
Var
Run;
Example
Correlation Is there a linear
coefficient (r) relationship?
pvalue Strength?
Example
Example
Spearman
Proc corr
Proc corr spearman;
Var weight length;
Var
Run;
Example
Example
Example
4. Determine the fate of H 0
5. Determine the strength of the relationship
n
n Spearman è
Spearman è r=0.35929, p=0.3821
r=0.35929, p=0.3821
∴ FTR H 0 –
∴ – There is no
There is no linear
linear relationship between the
linear
weight and length of a bear
n
n Pearson è r=0.897, p=0.0025
Pearson è
∴ Reject H 0 –
∴ – There is a
There is a strong linear relationship
strong linear
strong
between the weight and length of a bear
between the weight and length of a bear
Chi
Chi
square Tests
square Tests
n
n Chi Square testing is generally used to test
Chi Square testing is generally used to test
claims about categorical
claims about data consisting of
categorical data consisting of
categorical
frequency counts for different categories
n Uses Chi
n Uses Chi
square distribution
square distribution
n Many different types of tests:
n
i.e. Independence, Homogeneity, Goodness of fit, Fisher’s
i.e. Independence, Homogeneity, Goodness of fit, Fisher’s
exact, McNemars
exact,
Example: Test of Independence
n
n Lets do a test of independence between sex (M
or F) and BMI group (Normal, Overweight,
Obese)
n H 0 : Sex and BMI group are independent
n
n H A : Sex and BMI group are not independent
n : Sex and BMI group are not independent
SAS Syntax
proc freq data=mydata.newbmi;
table sex*owt/nopercent norow nocol expected
chisq;
run;
Explanation of Syntax:
n
n expected = based on the independent assumption to calculate
= based on the independent assumption to calculate
expected = based on the independent assumption to calculate
the expected frequency
n
n chisq = chi
chisq = chi
square test
square test
Pvalue>0.05
What is our
FTR Ho
conclusion?
What is our conclusion?
Fisher’s Exact Test
n
n When the expected values is <5, then Chi
When the expected values is <5, then Chi
square test is not valid
n In this case, we use
n In this case, we use Fisher’s Exact test
Fisher’s Exact test
n Example: Association between wearing helmets and
n
getting face injuries?
Helmet yes no
Face Injury
yes 2 13
no 6 19
SAS Syntax
Data helmet;
Data
Input helmet $ faceinj $ count @@;
Datalines;
yes yes 2 no yes 13
yes no 6 no no 19
;
;
run;
run
proc freq
proc order=data;
freq order=data;
weight count;
table faceinj*helmet/nopercent norow nocol expected;
table faceinj*helmet/nopercent norow nocol expected;
exact chisq;
;
run;
run
This tells us Chi
square test is not
valid, therefore use
Fisher’s exact p
value
Using the two sided pvalue and significance level=0.05,
value and significance level=0.05,
what is our conclusion?
T
T
Test
Test
n
n Used to compare a continuous variable between two
populations or groups of a categorical variable
n
n Assess difference between
Assess difference between the two means
n
n Assumptions:
1. Equal variance for both populations
2. The sample data need to be randomly sampled
3. The two samples are independent
4. Small sample size (<30) if it is ND, or
5. Larger sample size if it not ND
Larger sample size if it not ND
Example
n
n Let’s examine if the systolic blood
pressure is different between a normal
pressure is different between a normal
blood pressure group (n=15) and
hypertensive (n=10) group
n
n Ho: µNormal=µHypertensive=0
n Ha: µNormal=µHypertensive≠0
n
38
38
Input Data into SAS
Normal BP Hypertensive BP
(mmHg) (mmHg) data bp;
114 117 130 input SYM $ sbp @@;
155 115 115 datalines;
.
125 138 148 .
132 121 100 .
115 122 156 .
;
122 162 Run;
140 151
110 156
122 162
130 158
39
Mean SBP For Both Groups
Proc sort;
By sym;
Run;
proc means;
Var sbp;
By sym;
Run;
40
40
n
n Assumption: Check to see if the groups are
normally distributed?
Proc univariate normal plot;
Var sbp;
by sym;
Run;
41
41
Normality Check
Is the normal group ND?
Is the hypertensive group
ND?
42
n
n Assumption: Are variances are equal?
n
n Yes > use the pooled method (t)
Yes > use the pooled method (t)
n No
n No
> use
> use satterthwaite’s
satterthwaite’s method (t’)
Proc ttest;
Proc ttest ;
Class sym;
Var sbp
Var sbp;
;
Run;
43
43
Difference between means is significant
CI does not include O (p<0.05) so REJECT NULL
Fvalue>1 and
p value<0.05;
44
variances not
equal
Example
n
n Variances are not equal (p=0.0499<0.05)
n Satterthwaite
n Satterthwaite p=0.0276 <0.05
p=0.0276 <0.05
> Reject null
> Reject null
> Blood pressure between the normal and
> Blood pressure between the normal and
hypertensive groups is significantly different
*Interpret with caution since normal distribution assumption not met*
*
Interpret with caution since normal distribution assumption not met*
Further Readings
n
n Step By
Step By
Step Basic Statistics Using SAS:
Step Basic Statistics Using SAS:
Exercises
Author: Larry Hatcher
n
n Data analysis using SAS for Windows: Basic
Data analysis using SAS for Windows: Basic
Author: Mirka
Author: Mirka Ondrack
Ondrack