Beruflich Dokumente
Kultur Dokumente
DATA MANAGEMENT
&
Statistical Analysis
for Social Science Researches
using
IBM-
IBM-SPSS Statistics
ver. 20
TABLE OF CONTENTS
Data Management & Statistical Analysis using IBM-SPSS Statistics
by Maritess D. Villanueva
0
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
TABLE OF CONTENTS
VISION
MISSION
GOALS
VMG of RDE
VISION
MISSION
The University thru its RDE activities shall generate and extend
quality technical information, products and services in various
discipline using appropriate approaches for sustained agro-industrial
development to improve the quality of life.
GOALS
Training Rationale
Training Objectives
Module 1
Basic Concepts and Categories of Statistics
& Statistical Packages/Softwares
Learning Objectives
Statistics
(Singular sense) is a science which deals with the collection,
organization, presentation, analysis, and interpretation of data
a study of variation
(plural sense) is an actual number derived from the data
a collection of facts and figures
a processed data (e.g. Population statistics, statistics on births,
statistics on enrollment)
Types of data
Primary data – acquired directly from the source
Ex: data obtained by measuring wt. of 500 one-day old chicks from
Farm XYZ
Secondary data – non-primary data
Ex: Phil. Rice Production (tons/ha) data by province from 1990-2014
taken from publications of the Phil. Bureau of Agricultural Statistics
Categories of Statistics
Descriptive statistics- methods of organizing, summarizing, presenting data
and their interpretation.
Inferential statistics – concerned with making generalizations about a larger
set of data where only a part is examined.
Role of Statistics
A tool for data analysis (e.g. standard drug vs. new drug…. which is
more effective?)
Opinion poll survey (Do you think Philippines is ready for ASEAN
integration 2015?)
Some Basic Terms:
Universe – set of all entities or individuals under consideration/
subject of the study.
Data Management & Statistical Analysis using IBM-SPSS Statistics
by Maritess D. Villanueva
7
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
2 types:
Finite – when the elements of the universe can be counted for a
given time period
Infinite – when the number of elements of the universe is
unlimited
Example 1.
Variable Possible data values
1. Sex Male, Female
2. Hair Color Black, Brown, Reddish Brown…
3. Cellphone network Smart, Talk n Text, Globe, Sun cellular…
Ordinal
data collected are labels or classes with an implied ordering in
these labels;
the difference between two labels cannot be quantified;
a level of measure higher than nominal;
only ordering or ranking can be done on the data;
Example 2.
Variable Possible data values
1. Military rank Sergeant, Lieutenant, Captain, General
2. Job Position President, Vice-President, Manager
3. Sibling Rank 1st, 2nd ,3rd, 4th, 5th, …
Data Management & Statistical Analysis using IBM-SPSS Statistics
by Maritess D. Villanueva
8
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
Interval
data collected can be ordered or ranked, added and subtracted, but
not divided nor multiplied;
differences between any two data values can be determined;
the unit of measurement is constant (but arbitrary), and the zero
point is arbitrary;
a level of measurement higher than ordinal
Example 3.
Variable Possible data values
1. Baking temperature 172oC to 178oC
2. Intelligence Quotient (IQ) 80 to 140
3. Grades 1.0, 1.25, 1.5, …
Ratio
data collected has all the properties of the interval scale and in
addition, can be multiplied and divided;
has a true zero point;
is the highest level of measurement.
Example 4.
Variable Possible data values
1. Height 4’ to 7’
2. Width 0” to 5”
3. Weight 20 g to 50 kg
Statistical Software
is a specialized computer program used for data management and
statistical analysis
Statistical Packages
CS Pro (Census and Survey Processing System)
SAS (Statistical Analysis Software)
Stata
Minitab
R
STAR (Statistical Tool for Agricultural Research)
IRRISTAT
CROPSTAT
ITSM 2000
E-Views
SPSS (Statistical Package for Social Sciences)
CS Pro
a software package for editing, tabulating, and disseminating data
from censuses and surveys
a public domain software
Advantages:
Can improve the data management and analysis of large scale
surveys
Can be downloaded without any cost (free)
Can run on a computer with very basic specifications
Data Management & Statistical Analysis using IBM-SPSS Statistics
by Maritess D. Villanueva
9
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
Disadvantages:
Too many files are being generated
Only a single user can access and write to a file at any given
time
Modifying item affects existing file
SAS
a propriety software that enables users to implement data
management, statistical analysis, data mining, forecasting, etc.
a popular statistical software for medical research and pharmaceutical
industry
Advantages:
Powerful specially in implementing analysis on experimental
design and analysis of variance (ANOVA)
Has a wide range of statistical procedures
Disadvantages:
Difficult to learn
Expensive
Requires annual license
Recently launched a free SAS version for professors and students
called SAS university (www.sas.com)
Stata
a propriety software that widely used in the field of economics,
sociology and medicine.
executes data management and transformation, parameter
estimations, graphics, statistical measure computations and other
related mathematical calculations.
in executing the program, time series, statistics and graphics are
being loaded.
Minitab
a statistical software package originally intended for teaching
statistics.
Suitable for moderate-size datasets
Advantages:
Easy to learn and easy to use
Impressive quality of graphs
Cheaper compared to SAS and SPSS
Requires less disk space
Disadvantages:
Poor compatibility with other statistical programs
Less efficient for complex procedures
R
A free software programming language based on S programming
language
A software environment for statistical computing and graphics
Advantages:
Freely available online
Has powerful and customizable graphics
Can be integrated to other Statistical packages
ITSM 2000
permit easy execution of data processing, graphical display,
estimation, and diagnostic testing for univariate and multivariate time
series models in the time and frequency domains
provides easy to use estimation and forecasting tools for spectral
analysis
particularly, the dynamic graphics allow the user to instantly see the
effect of data transformations and model changes on a wide variety of
features such as the sample, residual, and model autocorrelation
functions and spectra.
E-Views
offers an extensive array of powerful features for data handling,
statistics and econometric analysis, forecasting and simulation, data
presentation, and programming.
IRRISTAT
a set of microcomputer programs designed to assist agricultural
researchers in developing experimental lay-outs and undertaking plot
sampling, data collection, data and file management, statistical
analysis of data and presentation of results
STAR
a freeware developed specifically by Biometrics and Breeding
Informatics, Plant Breeding, Genetics and Biotechnology Division of
International Rice Research Institute)
a computer program for data management and basic statistical
analysis of experimental data.
Module 2
IBM-SPSS Introduction
Learning Objectives
Option 1
Option 2
Option 3
Option 4
3
Move the scroll bar to the end point
3
4 Click the IBM-SPSS Statistics 20 icon discplayed in
the monitor.
SPSS Interface
SPSS Windows
Module 3
Entering, Saving and Opening SPSS data
Learning Objectives
In the first row of the first column, type origin. Then press ENTER
key. In the second row, type age. Then ENTER. In the third row, type
num_sib. Press ENTER.
New variables are automatically given a Numeric data type
Note: Variable name must start with a letter and has no space
Data Management & Statistical Analysis using IBM-SPSS Statistics
by Maritess D. Villanueva
18
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
Begin entering data in the first row starting at the first column.
Move the cursor to the second row of the first column to add the next
subject’s data.
Check Read
variable names
from the first
row of data.
Put the
worksheet
number/name
where you typed your data.
8
Put Barrio in the Old Value, and 1 in the New Value then Click Add.
City in the Old Value, and 2 in the New Value then Click Add.
Town in the Old Value, and 3 in the New Value then Click Add.
Click OK
Note that the entries for variable Place of Origin were replaced by
codes 1, 2 and 3.
Click OK.
Click Continue.
Click OK.
Compute Variable
Supply a variable name for target variable, say Annual_salary, then put
in a numeric expression box: 12* Monthly_Salary
Data Management & Statistical Analysis using IBM-SPSS Statistics by Maritess D. Villanueva
34
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
Module 4
Generating Descriptive Statistics
Learning Objectives
Output
Interpretation
Measure of Location
Minimum – smallest observed value in the data
Maximum – largest value observed in the data
Measure of Dispersion
Standard deviation – a measure of variability of the data points
from the mean value
Variance – average squared differences of the data points from the
mean value
Range – the simplest measure of variation computed as the
difference between the highest and lowest value of the data set
TO DO:
Open your recently saved SPSS data: Desktop > SPSS Training > Data
Sets > Exercise1c.sav
Generate the Descriptive Statistics of the data for the variables Age and
Annual Salary (Minimum, Maximum, Range, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis)
Module 5
Generating Frequency Tables and Graphs
Learning Objectives
Frequency Table – a table that lists the number of occurrence of each item in
the data
Consider your recently saved SPSS data: Desktop > SPSS Training >
Data Sets > Exercise1c.sav
Place of Origin)
at the center.
Expected Output
Place of Origin
Sample Interpretation
About 51.4% of the total number of DH respondents are from
Barrio.
More than half (51.4%) of the total number of DH respondents
came from Barrio.
In every 10 DH respondents, five originated from Barrio.
Generating Graphs
Chart or graphs are visual representation of the data
Pie Charts
Bar Charts
Consider your recently saved SPSS data: Desktop > SPSS Training >
Data Sets > Exercise1c.sav
Move the Place of Origin variable to the x-axis. Click OK to create the
chart.
Output
Generated Pie Chart (double click the chart to enhance more the
Pie Chart)
TO DO:
Open your recently saved SPSS data: Desktop > SPSS Training > Data
Sets > Exercise1c.sav
Generate pie chart for Previous Occupation and bar graph for Number
of Siblings (recoded).
Module 6
Detecting Data Outliers
Learning Objectives
Click Plots
Sd
Click OK
Expected Output
TO DO:
Open the SPSS data: Desktop > SPSS Training > Data Sets > Exercise4
(senior citizens).sav
Test if there are outliers for the variables age and income using
histogram and box-and-whiskers plot.
Module 7
Inferential Statistics : Steps in Testing Statistical
Hypothesis
Learning Objectives
REVIEW:
Inferential Statistics – concerned about estimating parameters by statistics.
Statistical hypothesis
A conjecture about….
⇒ The value of a parameter of the population or
⇒ The distribution of the population
• Examples of Statistical Hypothesis:
• The mean height of students enrolled in Statistics is
5’2” (H: µ = 5’2”).
• The grain length of a variety of rice (IR-8) is normally
distributed (H: Y~normal)
⇒ Conclusions are stated subject to uncertainty
Null Hypothesis – the conjecture which is being tested, denoted by Ho.
- Generally, this is a statement of equality or status quo or no
difference.
Alternative Hypothesis – the complementary statement that will be accepted
in the event that the null hypothesis is rejected. It is
denoted by Ha or H1.
If µ represents the average bag mass of the population, then the following
possibilities exist:
Possible value Action
_________________ _____________________________________
_________________ _____________________________________
_________________ _____________________________________
Decision Rule → Rule which specifies that region for which the test statistic
leads to the rejection of Ho in favor of Ha.
One-tailed or Two-tailed?
Parametric or Nonparametric?
TO DO:
A. Consider each of the following situations and indicate for each of the four
actions whether it is a CORRECT DECISION, a TYPE I error or a TYPE II
error.
Ho : A large manufacturing firm is being charged with discrimination in its hiring practices.
5. The jury gave an innocent verdict to the guilty firm. - ___________________________
6. The jury gave a guilty verdict to a not innocent firm. - ___________________________
7. The jury gave a guilty verdict to an innocent firm. - ___________________________
8. The jury gave a “not guilty” verdict to an innocent firm. - ___________________________
Data Management & Statistical Analysis using IBM-SPSS Statistics by Maritess D. Villanueva
54
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
Module 8
Testing Assumptions
Learning Objectives
TEST ON
ON ASSUMPTIONS
In most situations, the satisfaction of assumptions for certain parametric
methods ensures the validity of the results and the appropriateness of the test
employed. It is for this reason that a number of methods has been designed to
test on certain assumptions of parametric methods.
Example: Three sections of the same Mathematics course are taught by
three instructors. The final exam score of the students in the
three sections are recorded as follows:
Section 1: 95, 32, 47, 75, 83, 84, 73, 68
Section 2: 85, 90, 79, 50, 32, 84, 78, 95, 65, 80
Section 3: 79, 92, 63, 68, 76, 20, 37, 74, 86
Is the distribution of final exam scores the same in three sections? Test
for α = 5%.
PROCEDURE: Analyze > Compare Means > Oneway ANOVA > Options > Homogeneity of Variance Test
PROCEDURE: Analyze > Nonparametric Tests > Legacy Dialogs > RUNS
Test of Hypothesis:
1. Ho: The distribution of data is normal.
Ha: The distribution of data is not normal.
2. TEST PROCEDURE: Wilk-Shapiro Test for Normality
3. α = 5%
4. Decision Rule: Reject Ho if sig < α; Otherwise, fail to
reject Ho.
5. Computations:
sig = 0.438 (for section 1)
sig = 0.088 (for section 2)
sig = 0.172 (for section 3)
α = 0.05
6. DECISION: Since sig= < α =0.05; we fail to reject
Ho.
7. CONCLUSION: At α = 5%, the distribution of data is
normal among three sections.
PROCEDURE: Analyze > Descriptive Statistics > Explore > Plots > Normality plots with tests.
TO DO:
Parametric tests
Nonparametric tests
Module 9
Test on Single Population
Learning Objectives
Example:
Test of Hypothesis:
1. Ho: The median level of histamine content did not exceed 20mg/100g sample
Ha: The median level of histamine content exceed 20mg/100g sample
2. TEST PROCEDURE: Binomial test
3. α = 1%
4. Decision Rule: Reject Ho if sig < α; Otherwise, fail to reject Ho.
5. Computations:
Using SPSS, solve the following problem and perform a complete test of
statistical hypothesis.
1) One-tailed or Two-tailed:____________________________________
2) Parametric or Nonparametric:________________________________
a) Ho:
_________________________________________________________
Ha:
_________________________________________________________
e) Computation:
α= _________
= _________
f) Decision:_________________________________________________
g) Conclusion: _______________________________________________
Module 10
Test of Hypothesis: Case of Two Population
Means – Related Samples
Learning Objectives
Test of Hypothesis:
1. Ho: There is no difference between the scores of a control group and their
matched individuals.
Ha: There is a difference between the scores of a control group and their
matched individuals.
2. TEST PROCEDURE: Wilcoxon Signed-Rank Test
3. α = 5%
4. Decision Rule: Reject Ho if sig < α; Otherwise, fail to reject Ho.
5. Computations:
Ranks
y-x
b
Z -1.604
Asymp. Sig. (2-tailed) .109
a. Wilcoxon Signed Ranks Test
b. Based on positive ranks.
Data Management & Statistical Analysis using IBM-SPSS Statistics by Maritess D. Villanueva
63
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
Using SPSS, solve the following problem and perform a complete test of
statistical hypothesis.
It is claimed that a new diet will reduce a person’s weight in a period of two
weeks. The weights of 7 women who followed this diet were recorded before
and after a 2-week period.
Woman
1 2 3 4 5 6 7
1) One-tailed or Two-tailed:__________________________________________
2) Parametric or Nonparametric:______________________________________
a) Ho: __________________________________________________________
Ha: __________________________________________________________
e) Computation:
α= _________
= _________
f) Decision:_______________________________________________________
g) Conclusion: _____________________________________________________
Module 11
Test of Hypothesis: Case of Two Population
Means – Independent Samples
Learning Objectives
Using SPSS, solve the following problem and perform a complete test of
statistical hypothesis.
Production line quantities for two managers in two plants of a large company
are compared. Each data value represents the amount of production during
randomly selected 1-hour periods over a whole week.
Manager A:
15 13 8 16 12 15 12 18 11 12
9 10 7 9
Manager B:
14 15 10 16 11 13 15 12 14 11
1) One-tailed or Two-tailed:____________________________________
2) Parametric or Nonparametric:________________________________
a) Ho: _____________________________________________________
Ha: _____________________________________________________
e) Computation:
α= _________
= _________
f) Decision:__________________________________________________
g) Conclusion: _______________________________________________
Module 12
Test of Hypothesis: Case of Two or More
Population Means – One-way Classification
Learning Objectives
One-Way ANOVA
Using SPSS, solve the following problem and perform a complete test of
statistical hypothesis.
1) Response Variable:____________________________________
3) Parametric or Nonparametric:________________________________
a) Ho: __________________________________________________________
Ha: __________________________________________________________
e) Computation:
α= _________
= _________
f) Decision:_______________________________________________________
g) Conclusion: _____________________________________________________
Module 13
Test of Hypothesis: Case of Two or More
Population Means – Two-way Classification
Learning Objectives
At the end of this module, the participants should be able to:
Decide whether to use parametric or nonparametric on two or more
population means, two-way classification
Perform a statistical test of hypothesis on two or more population
means, two-way classification.
Features:
1. It employs a one-directional blocking of experimental units within a block or more or
less homogeneous.
2. Each block is a complete replication of the entire set of treatments.
3. The number of experimental units in a block should be equal to the number of
treatments, or some multiple of it.
Randomization
1. Group or stratify the experimental units into r blocks, with each block having t (or
some multiple of t) experimental units.
2. Allocate the treatments into the experimental units in a block at random, and do this
from block to block, independent of the results of randomization in other blocks.
Computation of Sums of Squares
Analysis of Variance Table:
TSS = ∑∑ (Yij)2 – CF
SV df SS MS Fc
TrSS = ∑ (Yi.)2/r – CF Treatment t–1 TrSS MSTr
RSS = ∑ (Y.j)2/t – CF Block r–1 RSS MSR
ESS = TSS – TrSS – RSS Error (t – 1)(r – 1) ESS MSE
and CF = (Y..)2 /tr Total tr – 1 TSS
Test of Hypothesis
1. To test for difference among treatment means (effects)
Test statistic: Fc = MSTr/ MSE ~ F[t – 1,(t – 1)(r – 1)]
Data Management & Statistical Analysis using IBM-SPSS Statisticsby Maritess D. Villanueva
72
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
ns – not significant
sig = 0.408 > α= 0.05, we fail to reject Ho. sig = 0.295 > α= 0.05, we fail to reject Ho.
7. Conclusion: At α = 5%, There are significant differences 7. Conclusion: At α = 5%, There are no significant
among treatment means. differences among block means.
Data Management & Statistical Analysis using IBM-SPSS Statisticsby Maritess D. Villanueva
73
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
Example:
A clothing manufacturer conducted an experiment to study the effect on
productivity of increases in its employee’s hourly wages. 4 treatments
were used and 12 employees were selected and grouped according to the
length of time they had been with the company. The employees were
observed for 3 weeks, and their productivity was measured as the average
number of nondefective garments each produced per hour. The resulting
productivity measures appear in the table:
TREATMENTS
No increase in Increase hourly Increase hourly Increase hourly
hourly wage wage by $0.50 wage by $1.00 wage by $1.50
Group 1 (less than 1 year) 2.4 3.0 3.1 3.2
Group 2 (1-5 years) 4.8 6.1 5.9 5.7
Group 3 (over 5 years) 5.1 7.0 7.2 7.3
a) Is there evidence that the mean productivity levels differ among the
four pay programs? Use α=0.01
b) Is there evidence that the mean productivity levels differ among the 3
groups? Use α=0.05
sig = 0.122
α = 0.05
sig = 0.018
α = 0.05
Using SPSS, solve the following problem and perform a complete test of
statistical hypothesis.
A food chain sells a particular item at all its stores. Each store carries three
brands, two of which are economy brands. The management decides to
discontinue selling one of the economy. It has decided to look at the turn
time of each brand – i.e, the average time between successive purchases of
the same brand. Five of the stores in the chain are selected, and an
employee in each store reports the turn time (in min) for each brand.
STORE BRAND
1 4.1 3.9
2 5.2 5.1
3 5.0 5.0
4 4.9 4.7
5 6.1 5.9
Is there a difference in the mean turn times for the two economy brands?
Use α = 0.01
Is there a difference in the mean turn times for the 5 stores? Use α =
0.05
1) Treatment:____________________________________
2) Block: ______________________________________
3) Parametric or Nonparametric:________________________________
a) Ho: __________________________________________________________
Ha: __________________________________________________________
e) Computation:
α= _________
= _________
f) Decision:_______________________________________________________
g) Conclusion: _____________________________________________________
a) Ho: __________________________________________________________
Ha: __________________________________________________________
e) Computation:
α= _________
= _________
f) Decision:_______________________________________________________
g) Conclusion: _____________________________________________________
Module 14
Using SPSS to find Simple Random Samples
Learning Objectives
At the end of this module, the participants should be able to:
Draw simple random samples from the constructed frame in SPSS
data
Module 15
Measures of Correlations and Relationships
Learning Objectives
At the end of this module, the participants should be able to:
compute the correlation coefficient & test its significance.
compute the rank correlation coefficient & test its significance.
perform an appropriate test for categorical data – the chi-square test
(χ2) for independence.
Using SPSS, solve the following problem and perform a complete test of
statistical hypothesis.
A random sample of 400 married men, all retired or at least in their 65’s were classified according to
educational attainment and number of children.
Number of Children
Educational Attainment
0-2 3-5 Over 5
None 12 22 26
Elementary 14 59 37
Highschool 20 80 34
College 26 31 19
Test the hypothesis that the number of children is independent of the level of education attained by
the father at α = 0.05.
1) Independent Variable:____________________________________
3) Parametric or Nonparametric:________________________________
a) Ho: __________________________________________________________
Ha: __________________________________________________________
e) Computation:
α= _________
= _________
f) Decision:_______________________________________________________
g) Conclusion: _____________________________________________________
Module 16
Regression Analysis
Learning Objectives
At the end of this module, the participants should be able to:
formulate predicting equation and test its significance
perform at least simple linear regression analysis
The nearer its value to 1 the better is the fit of the regression line.
Note: If the model is not significant, do not use the prediction because it might
not be linear.
Data Management & Statistical Analysis using IBM-SPSS Statistics by Maritess D. Villanueva
87
GSO & GS PROGRAMS
CapSU Pontevedra Research & Extension Services
August 2014 Bailan, Pontevedra, Capiz
Using SPSS, solve the following problem and perform a complete test of
statistical hypothesis.
A young economist wants to verify if wage is related to the educational background of an
individual. He interviewed 20 randomly chosen individuals and obtained the following
results:
Observation No. of Years in Monthly Observation No. of Years in Monthly
No. School Wage (P) No. School Wage (P)
1 0 300 11 15 1600
2 3 400 12 10 900
3 6 600 13 17 2000
4 10 800 14 8 700
5 1 400 15 14 1250
6 11 950 16 17 2500
7 11 950 17 10 850
8 7 650 18 13 1200
9 14 1000 19 9 600
10 2 450 20 14 1500