Sie sind auf Seite 1von 39

Chapter 7

Planning the Data Analysis


PRESENTED BY:

GORHAN KAROLIN, MUAJA MARGARETH, NATALIA CHRISTY


This chapter deals with

1. A brief description of data processing and analysis packages for computerised analysis.
2. Common rules for adapting data for computerised analysis, including coding.
3. Some analytical approaches for univariate, bivariate and multivariate analysis.
4. The 3 factors which determine the analytical technique to be selected for a problem
5. The concept of hypothesis testing and
6. How to perform a 't' test using the computer.
Statistical and Data Processing Packages

1. Today, in most cases, the computer is used for data processing and analysis.

2. like Excel and FoxPro, which are essentially spreadsheets and database management packages.

3. But for the types and quantum of data generated by a field survey, there is another set of packages available, and the
student can choose from several which are commercially available..

4. Some of these packages are called SPSS, SAS, STATISTICA and SYSTAT.

5. The new versions of these packages are usually WINDOWS-based.


Types of Analysis

Packages like SPSS, STATISTICA, etc. can be used for two major types
of applications in Marketing Research:

 Data Processing – General


 Statistical Analysis – Specialised (Univariate, Bivariate and
Multivariate)
Data Processing

This application includes coding and entering data for all respondents, for all
questions on a questionnaire. For example, there may be a question which
asks for the education level of a participant. The choices may be 12th or
below, Graduate, Post-Graduate and any other.
The first step in data processing is to assign a code for each of the options –
for instance, 1 for 12th or below, 2 for Graduate, 3 for Post-Graduate and 4
for any other.

Next, depending on the option ticked for each respondent, to enter the
respective code against his row (usually, the data for one respondent is
entered in a row assigned to him in the data set) in the column assigned to
the question, in the data matrix.
The end result of data processing for this question would be to be able to tell the
researcher how many of the sample of respondents were of education level 12 th or
below (Code 1), how many were Graduates (Code 2), how many Post-Graduates
(Code 3) and how many were in any other category (Code 4). For example, it
could be that out of a sample of 500 respondents, 100 were in Code 1 category, 200
in Code 2, 150 in Code 3, and 50 in Code 4 (Any other).

Similarly, all other questions on the questionnaire are processed, and totals for each
category of answers can be computed.

The menu commands used for such data processing are called FREQUENCIES,
SUMMARY STATISTICS, DESCRIPTIVE STATISTICS, or TABLES depending
on the software package used.
Data Input Format

Most of the above-mentioned packages have a format similar to spreadsheet packages for data entry. Readers familiar
with any spreadsheet package like Excel can easily handle the data entry (input) part of these statistical packages.

The input follows a matrix format, where the variable name/number appears on the column heading and data for one
person (respondent or record, also called a case in statistical terminology) is entered in one row.

For example, the data for respondent no. 1 is entered in row 1. The answer given by respondent no.1 to Question 1 is
entered in Row 1 and Column1. The answer given by respondent no.1 to Question 2 is entered in Row 1 and Column
2. The input matrix looks like the following :
Var 1 Var 2 Var 3… Var k
Respondent 1 x x x x
Respondent 2 x x x x
Respondent 3


… x x x x
Respondent n x x x x
Here, n would be the sample size of the marketing research study, consisting of k variables. Sometimes, each question
on a questionnaire can generate more than one variables.
Coding

One limitation of doing analysis on the computer with these statistical packages is that all data must be
converted into numerical form. Otherwise, it cannot be counted or manipulated for analysis. So, all data
must be coded and converted to numbers, if it is non-numerical.

We saw one example of coding in the previous section, where we gave numerical codes of 1, 2, 3 and 4 to the
education level of the respondent.

Similarly, any non-numerical data can be converted into numbers. Usually, all nominal scale variables
(categorical variables) need to be coded and entered into the packages.

An important aspect of coding is to remember which code stands for what. Most software packages have a
facility called definition of Value Labels for each variable, which should be used to define the codes for every
value of a variable. This is illustrated in a section labelled "value labels" a little later.
Variable

Usually, a question on the questionnaire represents a Variable in the package. This is not always the case,
because sometimes we may create more than one variables out of answers to a question.

For example, it could be a ranking question which requires respondents to rank 5 brands on a scale of 1 to 5.
We may define Ranking given to Brand X as variable 10, and ranks given to it could be any number from
1 to 5. Similarly, Ranking of Brand Y could be defined as variable 11, and again, the responses could be
from 1 to 5.

Therefore, we may end up with 5 variables from that single ranking question on the questionnaire. It all
depends on how we want the output to look like, and how we want to analyse it.

One very useful provision that all the packages have is the variable name. For instance, if the particular
question (variable) represents the respondent’s Income, then the Variable Name can be INCOME on the
column representing this variable.
SPSS Commands for Defining Variable Labels
Variable Label and Format
There is a provision to give a longer name to each variable if required In SPSS, you can double click on the column heading
(usually called Variable Label) in each one of the packages. of the Variable and fill out the Variable Name, format
etc. in the dialog box /table which opens up. In SPSS
There is also a provision by which the user can define in these version 10.1, a table opens up where Variable Name is
packages the type of variable (Numeric or non-numeric), and the filled in the first column, and Label in another column,
number of digits it will have. etc. In older versions of SPSS, a dialogue box opens
when you double click on a variable (column heading)
A non-numeric variable can be defined, but no mathematical
in the data file, and you have to fill up the relevant
calculations can be performed with it. For a numerical variable, you
can also define the number of decimal points (if applicable). Variable Label, format, etc. in the dialogue box.
Value Labels/Codes

Sometimes, the different values taken by the variable are continuous numbers. But sometimes, they are categories. For
example, income categories could be

Below 5,000 per month


5,001 to 10,000 per month
10,001 to 20,000 per month
More than 20,000 per month

Each of these could be given numerical codes such as 1, 2, 3 or 4. To save these codes along with their meanings (labels) in
the computer, we have to use a feature called “Value Labels”. We can use the feature and label 1 as “Below Rs. 5,000 p.m.”, 2
as “Rs. 5,001 to 10,000 p.m.”, 3 as “Rs. 10,001 to 20,000 p.m.”, and 4 as “More than Rs. 20,000 p.m.” . The words used in
quotes are called Value Labels, and can be defined for each variable separately.

For each categorical variable that we have allotted codes to, we need to record the codes along with the Variable Name and
Question Number for our records in a separate coding sheet also.

Definition of Value Labels simplifies the problems while interpreting the output. The value labels are generally printed along
with the codes when a table is printed involving the given variables (for example, income).
SPSS commands for Defining Value Labels

In SPSS, the same procedure described earlier for defining a Variable Label also gives the opportunity to define Value Labels.

That is, double click on the column heading of a variable. In the table or dialog box which opens up, go to the relevant space
for Value Labels, and define a label for each value of a variable, one after another.

In SPSS 10.1, a table opens up when you double click. You have to then go to a column labeled VALUES, select the cell in
the relevant row, and click to open a Value Labels dialogue box.

In the Value Label dialogue box, type the value labels, for example , 1 as value and “Below Rs.5000” as the label, then Click
ADD, then 2 as value, followed by “Rs,.5000-10000” as its label, etc. Do this for all value labels for a variable.

Repeat the process for other variables where value labels have to be defined.
Record Number / Case Number

Every row is called a “case” or “record”, and represents data for one respondent. In rare
cases, the respondent may occupy two rows, if the number of variables is too large to be
accommodated in one row. We may not encounter such cases in our examples, but these are
sometimes encountered in commercial applications of Marketing Research. The manual for
the package being used (SPSS, SAS, SYSTAT etc.) can be referred to for an explanation of
how to use two or more rows for representing a single case (respondent).
Missing Data

Frequently, respondents do not answer all the questions asked. This leaves some blanks on the questionnaire.
There are two approaches for handling this problem.

Pairwise Deletion
List wise
List wise Deletion : This instruction to the
Pairwise Deletion : The computer can be asked to use computer results in the entire row of data being
the pairwise deletion, which means that if one deleted, even if there is one missing (blank) piece
respondent’s data is missing for one question, then the of data in the questionnaire. This may result in a
package simply treats the sample size as one less than the large reduction in sample size, if there is a lot of
given number of respondents for that question alone, and missing data on different questions.
computes the information asked for. All other questions
are treated as usual.
Statistical Analysis

We have so for discussed general data processing applications of statistical packages. But these packages are
capable of a lot of statistical tests, like the chi-squared, the the ‘t’ test and the ‘F’ test.
They can also be used to perform analyses such as Correlation and Regression Analysis, ANOVA or Analysis of
Variance, Factor Analysis, Cluster Analysis, Discriminant Analysis, Multidimensional Scaling, Conjoint Analysis
and many other advanced statistical analyses. The packages we have mentioned (SPSS, SAS, SYSTAT) generally
perform most of these analyses. In addition, the statistical packages also have varying graphical capabilities for
drawing graphs.
 Most of the important statistical analysis techniques typically used by a marketing
researcher are described in detail in later chapters. The exact commands used will vary
depending on which statistical package is used by the reader. But in most of the current
packages, a pull-down menu is used, and a Help feature is available on line, so a user can
easily perform most of these analyses if he is slightly familiar with WINDOWS operating
system and general data entry into packages like EXCEL. For details, the manual for
whichever package is being used should be consulted.
Hypothesis Testing and Probability Values
(p values)

In manual forms of hypothesis testing, we generally compute the value of a statistic (the z, the t, or the F
statistic, for example), and compare it with a table value of the same statistic for a given constraint (sample size,
degrees of freedom, etc.).

But in the computer output for any analysis involving a statistical test, a more convenient way is to interpret the
p-value printed for a particular test.
 But what is a null hypothesis? In general, a null hypothesis is the opposite of any
statistical relationship between variables that we expect to prove. In other words, if we
want to check if variables x and y are related to each other, the null hypothesis would be
that there is no significant relationship between x and y.

 This method of proving or disproving a hypothesis is very simple to understand and use
in the context of computers doing the testing. This is what we will use throughout this
book.
Approaches to Analysis

Analysis of data is the process by which data is converted into useful information.
Raw data as collected from questionnaires cannot be used unless it is processed in
some way to make it amenable to drawing conclusions.
Three Types of Analysis

Broadly, we can classify analysis into three types:

1. Univariate, involving a single variable at a time


2. Bivariate, involving two variables at a time, and
3. Multivariate, involving three or more variables simultaneously.
Fig. 1 lists out the various options available to the analyst who wants to do univariate or bivariate analysis.

UNIVARIATE TECHNIQUES

Non-parametric Statistics Parametric Statistics

One Sample Two or more One Sample Two or more


samples samples
·chi square
·
* 't' test
·Kolmorov- * Z test
Smirnov
·Runs Independent
Dependent
·'t' test
Independent Dependent ·Z test Paired
·ANOVA
·chi-square ·Sign sample
·Rank Sums ·Wicoxon 't' test
·Kolmogoro ·McNemar
v-Smirnov ·Cochran Q
Fig. 2 lists out a roadmap for selecting appropriate multivariate analysis techniques.

Multivariate Techniques

Dependence Techniques

Interdependence Techniques
One Multiple
Independent
Variable Independent

·ANOVA ·MANOVA
Variables
·Multiple ·Canonical
Regression Correlation
·Discriminan
t Analysis Focus on Variables
·Conjoint
* Factor Focus on Objects
Analysis
Analysis

·Cluster Analysis
·Multidimensional
Scaling
The choice of which of the above types of data analysis to use
depends on at least three factors:

1) the scale of measurement of the data,


2) the research design, and
3) assumptions about the test statistic being used, if one is used.
“ Scale of Data

If the variables being measured are nominally scaled or ordinally scaled, there
are severe limitations on the usage of parametric multivariate statistics. Mostly,
univariate or bivariate analysis can be used on nominal/ordinal data. For
example, a ranking of 5 brands of audio systems by a sample of consumers may


produce ordinal scale data consisting of these ranks.
“ Research Design

The second determinant of the analysis technique is the Research Design. For
example, whether one sample is taken or two, and whether one set of
measurements is independent of the other or dependent on the other determine
the analysis technique.


“ Assumptions About the Test Statistic or Technique

The third factor affecting the choice of analytical technique is the set of
assumptions made while using a particular test statistic.


The next chapter describes how simple tabulation and crosstabulation of data can be done. These two are the most widely
used analysis techniques in survey research.

A detailed coverage of the non-parametric techniques mentioned on the left side of Fig.1 is beyond the scope of this book.
Out of these non-parametric tests, we will discuss only the chi-squared test for crosstabulations in the next chapter,
because that is the most widely used in practice.
For the univariate and bivariate analysis of metric data (interval scale or ratio scale), we use 't' tests of different types, or
the Z test. We will illustrate the use of two types of 't' tests, which are shown in the right half of Fig.1. These are

 The independent sample 't' test and


 The paired sample 't' test

These two are the most likely tests which a marketing researcher would encounter.

The major focus of this book will be on simple and crosstabulations for univariate and bivariate analysis (used mainly for
non-metric data), and a variety of multivariate analysis techniques for special applications (using primarily metric data,
with a few exceptions).
Hypothesis for the t-Test

Before we illustrate the use of the independent sample 't' test and the paired sample 't' test, we will again discuss the concept
of hypothesis testing, in the context of the 't' test.

Suppose, as marketers of a brand of jeans, we wanted to find out whether a set of customers in Delhi and a set of customers
in Mumbai thought of our brand in the same way or not. Suppose we conducted a small survey in both cities and got
Ratings on an interval scale (assume it was a seven point scale with ratings 1 to 7) from our customers.

We now want to do a statistical test to find out if the two sets of Ratings are "significantly different" from each other or
not. We have to now set a level of "statistical significance" and select a suitable test. We also need to specify a null
hypothesis.

The 'null hypothesis' represents a statement to be used to perform a statistical test to prove or to disprove (reject) the
statement. In the above example, the null hypothesis for the 't' test would be "There is no significant difference in the
ratings given by customers in Mumbai and Delhi". In other words, the null hypothesis states that the mean (average) rating
from these two places is the same.
Now, we have to set a significance level for the test. This represents the chance that we may be making a mistake of a certain
type. It can also be set as (100 minus confidence level desired in the test, divided by 100). For example, if we desire that the
confidence level for the test should be 95 percent, then (100-95)/100, or .05, becomes the significance level.

We can think of it as a .05 probability that we are making a certain type of error (called Type I error) in our decision-making
process. Type I error is the error of rejecting the null hypothesis (wrongly, of course) when it is true.

Commonly used values of significance used in marketing research are .05 (corresponding to a confidence level of 95 percent) or
0.10 (corresponding to a confidence level of 90 percent). But there is no hard and fast rule, and the significance level can be set
at a different level if necessary.

Let us assume that we take the conventional value of .05 for our hypothesis test.
Now, a suitable test for the problem discussed above has to be found. In this case, from Fig. 1, we know that the independent
sample 't' test is required.

What do we expect to achieve from this test? We will either reject the null hypothesis (that is, prove that the Delhi and Mumbai
ratings are significantly different), or fail to reject it (conclude that there is no difference between the Delhi and Mumbai ratings).
The independent sample 't' test

Let us proceed with the same example and set up an independent sample 't' test as discussed above, at a
significance level of .05. Table 1 presents the input data (assumed) for the test. This assumes that 15 customers
of our brand each in Mumbai and Delhi were asked to rate our brand on a 7 point scale. The responses of all the
30 customers are in column labelled 'Ratings' in the table. The column labelled City indicates the city from
which the ratings came, with a code of 1 for Mumbai and 2 for Delhi.
Table 2 presents the output from the independent sample 't' test performed on the above data. The decision
rule for the test (for any computerised output which gives a 'p' value for the test) at .05 significance level
is this -
If the 'p' value is less than the significance level set up by us for the test, we reject the null hypothesis.
Otherwise, we accept the null hypothesis. In this case, we find that the 'p' value for the 't' test is .011
assuming unequal variances in two populations. This value of .011 being less than our significance level
of .05, we reject the null hypothesis and conclude that the Ratings of Mumbai and Delhi are different. If
the 'p' value had been larger than .05, we would have accepted the null hypothesis that there was no
difference between the two ratings.
Manual Versus Computer-based Hypothesis Testing

Please note that conventional hypothesis testing would have required us to do a manual computation of the t value from the
data, compare it with a value from the 't' tables and arrive at the same kind of conclusion that we did.

The advantage of using the computer is that the test is performed by the package automatically, and we get the 'p' value for the
test in the computer output. All that we need to do is to compare the p-value from the computer output with our significance
level (usually .05), and reject the null hypothesis when the computer gives us a value less than the one set by us (less than .05
if we have set it at .05).

We are going to use this approach (computerised testing) throughout this book for all the tests and analytical procedures. This
removes the need for tedious manual calculations, and leaves the student to do managerial jobs like interpreting computer
outputs rather than waste time in manual computation.

This is modern approach, because managers can increasingly delegate mundane tasks to the computer, and add more value to
their own jobs by concentrating on design and interpretation of Marketing Research studies.
Paired Sample 't' test

In some cases, we may not have independent samples, but the same sample could be used to do a research study involving two measurements.
For instance, we may measure somebody's attitude towards a brand before it is advertised, and after it is advertised, to try and find out if their
attitude has changed due to the ad campaign. In such cases, a paired sample 't' test is the appropriate statistical test.

We will illustrate using the example mentioned above. Assume that we used a sample of 18 respondents whom we asked to rate on a 10 point
interval scale, their attitude towards say, Tamarind brand of garments, before and after an ad campaign was released for this brand. A rating of 1
represents "Brand is Highly Disliked" and a rating of 10 represents "Brand is Highly Liked", with other ratings having appropriate meanings.

The assumed data are in Table 3. The first column contains ratings given by respondents Before they saw the ad campaign, and the second
column represents their ratings After they saw the ad campaign.
Table 4 contains the resultant computer output for a paired sample 't' test. Assume that we had set the significance
level at .05, and that the null hypothesis is that "there is no difference in the ratings given by respondents before and
after they saw the ad campaign.

The output table shows that the 2- tailed significance of the test is .000, from the last column. This is the 'p' value,
and it is less than the level of .05 we had set. Therefore, as per our decision rule specified in the earlier example, we
have to reject the null hypothesis at a significance level of .05, and conclude that there is a significant difference in
the ratings given by respondents Before and After their exposure to the ad campaign. The mean rating after the ad
campaign is 5.7778 and before the campaign, it is 3.2778, and the difference of 2.5 is statistically significant.
Proportions
Large Sample Sizes
Even though we have tested for differences in
mean values of variables in this section, we could
If we have a sample size larger than 30 for the
independent sample 't' test, we can use the 'Z' test instead also test in the same way for differences in
of the 't' test . The statement of null hypothesis etc. will Proportions. The procedure is the same, and a Z
remain the same in the case of a Z test also. test or a 't' test is used, depending on whether the
sample size is more than or less than 30.
THANK YOU

Das könnte Ihnen auch gefallen