Beruflich Dokumente
Kultur Dokumente
By Reza Beebeejaun
This section deals with the how to code the data, input the data and clean the data We will focus on the preliminary data analysis techniques such as frequency distribution and also discuss hypothesis testing using various analysis techniques.
Data editing
The purpose of editing is to generate data which is: Accurate consistent with intent of the question and other information in the survey uniformly entered Complete arranged to simplify coding and tabulation One of the major editing problem concerns with faking of an interview One of the best ways to tackle the fraudulent interviews is to add a few open-ended questions within questionnaire
4
Once the data is collected it is important to use editing and coding procedures to input the data in the appropriate statistical software. There are several steps which are required to prepare the data ready for analysis. The steps generally involve data editing and coding, data entry, and data cleaning.
Coding
Data entry
Coding involves assigning numbers or other symbols to answers so the responses can be grouped into a limited number of classes or categories Coding entails the assignment of numerical values to each individual response for each question within the survey. Coding close ended questions is much easier as they are structured questions and the responses obtained are predetermined For open ended questions, content analysis is used, which provides an objective, systematic and quantitative description of the response
5
Once the questionnaire is coded appropriately, researchers input the data into statistical software package There are various methods of data entry. Manual data entry or keyboarding remains a mainstay for researchers who need to create a data file immediately and store it in a minimal space on a variety of media
6
Data cleaning
Data cleaning focuses on error detection and consistency checks as well as treatment of missing responses. The first step in the data cleaning process is to check each variable for data that are out of the range or as otherwise called logically inconsistent data. Such data must be corrected as they can hamper the overall analysis process
7
In this section we will focus on the first stage of data analysis which is mostly concerned with descriptive statistics. Descriptive statistics include the mean, standard deviation, range of scores, skewness and kurtosis. This statistics can be obtained using frequencies, descriptives or explore command in SPSS For analysis purposes, researchers define the primary scales of measurements (nominal, ordinal, interval and ratio) into two categories. Nominal and ordinal scale based variables are called categorical variables (such as gender, marital status and so on) while interval and ratio scale based variables are called continuous variables (such as height, length, distance, temperature and so on).
8
Inconsistencies or unexpected results should be investigated using the original data as the reference point
Frequencies
Frequencies
Frequency Distributions
Frequency tables Histograms
11
Normality test
The main things to look for are: (a) 5% trimmed mean (if there is a big difference between original and 5% trimmed mean there are many extreme values in the dataset.) (b) Skewness and kurtosis values are also provided through this technique. +ve value of skewness- long right tail and +ve value of kutosis- high peak
Normality test
(c) The test of normality
SPSS produces two Sig. values, the first is for the KolmogorovSmirnov test, the second is the Shapiro-Wilks test. For sample size greater than 50 (n>50), use result from the Kolmogorov-Smirnov test For sample size less than 50 (n<50), use result from Shapiro-Wilks test Sig. value less or equal to 0.05 - not normally distributed. Sig. value greater than 0.05 - normally distributed
The ratio of skewness to its standard error can be used as a test of normality (that is, you can reject normality if the ratio is less than -2 or greater than +2) The ratio of kurtosis to its standard error can be used as a test of normality (that is, you can reject normality if the ratio is less than -2 or greater than +2).
15
(d) The histograms provide the visual representation of data distribution (a bell-shaped curve).
16
Bivariate analysis
Explore Relationships/associations differences between two variables
Measuring association Bivariate correlations Partial correlation Multiple correlation (multiple regression) Crosstabs Measuring differences T-Tests ANOVA
Ordinal Scales
Test hypotheses about relationships between two nominal or ordinal level variables.
Chi-square test
A chi-square test is used when you want to see if there is a relationship between two categorical variables. Chi-square is simply an extension of a cross-tabulation that gives you more information about the relationship. However, it provides no information about the direction of the relationship (positive or negative) between the two variables. In SPSS, the chisq option is used on the statistics subcommand of the crosstabs command to obtain the test statistic and its associated p-value To conduct a chi-square the following conditions must be met: There must be at least a total of 30 observations (people) in the table Each cell must contain a count of 5 or more To conduct a chi-square test we compare the observed data (from study results) with the data we would expect to see
8/23/2011 20
We use cross-tabulation when: We want to demonstrate relationship between two categorical variables (nominal and ordinal variables) We want a descriptive statistical measure to tell us whether differences among groups are large enough to indicate some sort of relationship among variables.
19
20
Correlation
Examine relationships among variables to show the strength and the direction of a relationship A correlation tells you how and to what extent two variables are linearly related 1. Rank correlation (ordinal variables) 2. Linear correlation (interval variables)
Ordinal Scale
Spearmans rho
Pearson r
8/23/2011
21
22
Rank correlation
for ordinal variables and test used is Spearmans rho
correlation coefficient (r ) may range from -1 to 1, where -1 or 1 indicates a perfect relationship The further the coefficient is from 0, whether positive or negative, the stronger the relationship between the two variables.
r>0.70 indicate strong correlation/relationship r>0.30 indicate moderate correlation R<0.30 indicate weak correlation
Linear correlation
Linear correlation for interval variables (continuous) > Pearsons r E.g. correlation between height and weight. The hypothesis is that tall people are heavier than short ones
8/23/2011
23
8/23/2011
23
24
Partial correlations
The Partial
Linear Regression
Linear Regression estimates the coefficients of the linear equation, involving one or more independent variables, that best predict the value of the dependent variable
that describe the linear relationship between two variables while controlling for the effects of one or more additional variables.
Partial
between two variables after keeping the control or adjusting the effects of one or more additional variables. Testing hypothesis: If p-value >the significance level (0.05), then do not reject the null hypothesis If p-value <the significance level (0.05), then reject the null hypothesis E.g. height and weight controlling bmi
25
26
Independent-Samples T Test
Paired-Samples T Test
A test procedure that compare means for two groups of cases to determine whether there is significant difference between the means of two groups, e.g.: 1- female 2- male E.g. comparing means for cholesterol for the 2 groups
The Paired-Samples T test is used to test whether one continuous variable has a significantly higher mean value than another for the same cases in the same data file
Repeated Measures are obtained on one group of participants, such as in measuring participants before a treatment is applied and again after the treatment.
Thus, each person serves as his/her own control, and because the two sets of scores to be compared are obtained from the same people, the two groups of scores are not independent
27
28
ANOVA
Simple or One-Way ANOVA is used to determine whether there is a significant difference between two or more means. Most statistical software programs will calculate ANOVA Output varies slightly in different programs ANOVA uses either the t-test or the f-test When comparing continuous variables between groups of study subjects: Use a t-test for comparing 2 groups Use an f-test for comparing 3 or more groups Both tests result in a p-value Example: testing age differences between 2 groups If groups have similar average ages and a similar distribution of age values, t-statistic will be small and the p-value will not be significant If average ages of 2 groups are different, t-statistic will be larger and p-value will be smaller (p-value <0.05 indicates two groups have significantly different ages)
29
Multivariate Analysis
Analysis that provides a simultaneous analysis of multiple independent and dependent variables.
Allows us to estimate the effects of each independent variable on the dependent variable
30
Univariate Techniques
Multiple regression is very similar to simple regression, except that in multiple regression you have more than one predictor variable in the equation The dependent variable = categorical. Independent variables = factors or covariates. Factors should be categorical variables and covariates should be continuous variables.
31
32
Univariate Techniques
1. 2. 3. 4. 5. 6. 7. 8.
33