Sie sind auf Seite 1von 6

CHAPTER 8

VARIABLES AND DATA CODING


Variables and levels of measurement
The constructs studied in quantitative research are referred to as variables. Numbers or values are
assigned to variables. A variable can be defined as "a property that takes on different values" (F.
Kerlinger, Foundations of Behavioral Research, 1973, p. 29). In other words, it "varies." For example, a
demographic variable such as "sex" has two values: male and female. Each respondent to a survey
would have a value of male or female for this variable. For data analysis purposes, one would assign a
numeric code to each value (e.g., male = 1; female = 2). Another demographic variable, "age", could take
on many possible values. The numeric code assigned to a respondent would simply be his or her
reported age. [Alternatively, a survey question on age could have a response set made up of "groupings";
the respondent would check the grouping into which his/her age falls (e.g., 18 - 24 years). In this case,
numeric codes would be assigned to each age grouping.]
In the above example, "sex" is referred to as a dichotomous variable, since it has two values. Age
groupings would be an example of a polytomous variable, since the variable takes a limited number of
values greater than two. Both "sex" and "age groups" are examples of categorical (or discrete)
variables. When "age" is not collapsed into groupings, it is considered a continuous variable (it can take
any value within a specified range such as 18, 27, 42, etc.). Why are these distinctions between types of
variables important?
Data analysis is predicated on applying the appropriate statistical procedure to variables. An important
consideration is a variable's level of measurement. Certain statistical procedures are limited to higher
levels of measurement. For certain variables, the numeric value assigned does not imply order or
importance, it simplies is used to classify respondents into groups. In other instances, the numbers
assigned have more meaning. A higher or lower number implies order.
There are four levels of quantification or measurement:
Nominal Level
Categorical variables belong to the nominal level of measurement. Thus, the nominal level of
measurement can be thought of as a category system in which numbers are assigned to categories that
have qualitative rather than quantitative differences.. It is the lowest level of measurement. Consider the
demographic variable "sex." It is simple and obvious to note the categories for this variable: females and
males. All female respondents are considered the same and assigned the same value ("2"). All male
respondents are considered the same and are assigned the same value ("1"). There is no implication of
"rank order" -- a value of "2" is not considered "higher" or "twice as much" as a value of "1." The numbers
are merely labels for various categories. In other words, the numbers assigned to the values are not
treated as true numbers; they cannot be ordered or added. Further, the categories of a nominal variable
should be mutually exclusive (no individual would fit into more than one category) and exhaustive (all
reasonable categories are present). Many statistical procedures, as you will learn in statistics courses, are
not appropriate for nominal level variables. Other examples of nominal variables include political party,
marital status, and religion.
Ordinal Level
A variable for which values can only be rank-ordered is at the ordinal level of measurement. [Note: A
higher level of measurement possesses the characteristics of lower levels of measurement. Thus, an
ordinal level variable also possesses the qualities of a nominal variable.] This level goes beyond labeling
categories; the categories have an intrinsic order. That is, the numeric codes imply an order. The numbers
do not represent absolute qualities (e.g., saying you rank "X" above "Y" does not reveal the strength of
your liking of either "X" or "Y"), and the intervals between the numbers are not presumed to be equal.
Returning to the example of age groupings, here is a typical set of values for such a variable (since many
surveys query individuals 18 years and older, the first age grouping has a lower bound of 18):

Age Grouping

Numeric Code

18 24
25 34
35 49
50 64
65 and higher

1
2
3
4
5

A value of "4" is higher ("older") than a value of "2" but it does not equal twice the value of "2".
Obviously, questions which ask respondents to rank order a set of responses are ordinal measures (e.g.,
being asked to rank order the importance attached to various sources of political information). Other
examples of ordinal level variables include: social class (upper lower), political philosophy (very liberal
.very conservative), and year in school (freshman senior).
Only non-parametric statistics can be used to analyze nominal and ordinal data. Parametric statistics can
be used to analyze data at the two highest levels of measurement: Interval and Ratio.
Interval Level
Interval level variables possess the qualities of nominal and ordinal level variables. In addition, the
distance between values on an interval level scale are equal. Interval level variables lack a zero point.
Intelligence quotient is an oft-cited example of an interval scale. No one (we hope!) has no (zero)
intelligence. The distance between an IQ of 110 and IQ of 120 is equal to the distance between IQ's of
120 and 130. Another example is the Fahrenheit temperature scale. It has a zero point, but this does not
represent total absence of heat, and below-zero temperatures are possible (just ask a Minnesotan in
Winter). You will encounter research in which attitude scales (e.g., the well-known and often-used Likert
(pronounced "Lick-ert") scale: Strongly Agree, Agree, Unsure, Disagree, Strongly Disagree) are treated as
interval-level variables. A bi-polar adjective scale is another example of an interval scale:
Sad

1 - 2 - 3 - 4 - 5 - 6 - 7

Happy

In the Sad-Happy scale, the value "4" is two units above "2", but "4" is not twice as happy as "2". You can
perform addition/subtraction, but not multiplication/division on values in an interval scale.
Ratio Level
The highest level of measurement, ratio, possesses the qualities of the three previous-described levels.
As Anderson (Communication Research: Issues and Methods, 1987) notes: ratio scales make use of all
the properties of numbers - classification, order, distance and quantity. In addition, they contain an
absolute or natural zero. A value of zero on a ratio scale means complete absence of a particular property
or quality. Because of the presence of a zero on a ratio scale, multiplication and division of values are
possible. For example, age is a ratio scale (when it is a continuous variable). Someone who is 48 years
old is twice the age of someone who is 24 years old. Ratio level variables are more typical in the physical
sciences.
Anderson, in defining a ratio scale, touches upon many of the qualities of levels of measurement: (a ratio
scale is) "an exhaustive and mutually exclusive set of categories within some domain of interest arrayed
in an order of equal intervals along a dimension which has a point of origin (zero point)" (p. 137).
Other examples of ratio-level variables include years of education, income, years of teaching experience,
and number of hours one listens to radio in a week.
Coding the questionnaire
Many of the examples in this manual derive from survey research. The heart of a survey is the
questionnaire. Mail, telephone and face-to-face surveys all rely on a questionnaire made up of carefully
worded and ordered questions. Fixed-response or closed-end questions are questions that give the

respondent a limited set of possible responses. The data and the examples used in this manual all begin
with fixed-response questions. The first step in the data analysis process is to turn the respondent's
answers into numeric data. This process is referred to as data coding or "coding" a questionnaire.
One point to keep in mind is that a question on a survey is not always one variable. For example,
a question on a survey for a statewide physician's association asked "Which of the following procedures
do you routinely perform?" The question was followed by a list of 40 different procedures. Thus, this one
question is in fact 40 variables.
The process of data coding begins by identifying each variable in the questionnaire (this discussion is
limited to fixed-response questions). Each variable can be named at this point. The most basic (and
fastest) method of naming variables is to name the first variable on the questionnaire "Q1", the second
"Q2" and so on [see discussion of variable naming rules below]. After a variable is identified and named,
the numeric codes to be assigned to each value of the variable are determined.
The following questions are from a local media survey. Can you identify the variables? How would you
code each one (what numeric codes would you assign to the answers or set of responses)?
1.

On an average day, how much time do you spend watching/listening to the radio and TV? Check your best estimate of
the time you spend with each medium:
Radio
Television

__None
__None

__Less than 1 hour


__Less than 1 hour
__Yes

__1 - 2 hrs.
__1 - 2 hrs.

2.

Do you subscribe to cable TV?

3.

Do you watch local early evening news (5:00, 5:30 or 6:00 p.m.)?

__2 - 3 hrs.
__2 - 3 hrs.

__4 + hrs.
__4 + hrs.

__No
__Yes

__No

(If yes) Which local early evening newscast do you watch most often?
__WCTV-TV (Channel 6, Comcast Cable channel 9)
__WTWC-TV (Channel 40, Comcast Cable channel 12)
__WTXL-TV (Channel 27, Comcast Cable channel 7)
How often do you read the following local publications? (Check the ONE best answer for each publication.)

4.

Every Issue

Once in While

About Once a Year

Never

Break
FSView
Osceola
Tallahassee Bull
Tall. Democrat
Tall. Magazine
Here is how the questions might be coded (variable names appear in the left column; values for each
possible response to a variable are also shown):
1.

On an average day, how much time do you spend watching/listening to the radio and TV? Check your best estimate of
the time you spend with each medium:
(1)
__None
__None

(2)
__Less than 1 hour
__Less than 1 hour

(3)
__1 - 2 hrs.
__1 - 2 hrs.

V1
V2

Radio
Television

V3 2.

Do you subscribe to cable TV?

V4 3.

Do you watch local early evening news (5:00, 5:30 or 6:00 p.m.)?

__Yes (1)

(4)
__2 - 3 hrs.
__2 - 3 hrs.

(5)
__4 + hrs.
__4 + hrs.

__No (2)
__Yes (1)

__No (2)

V5

(If yes) Which local early evening newscast do you watch most often?
__WCTV-TV (Channel 6, Comcast Cable channel 9)
__WTWC-TV (Channel 40, Comcast Cable channel 12)
__WTXL-TV (Channel 27, Comcast Cable channel 7)

4.

(1)
(2)
(3)

How often do you read the following local publications? (Check the ONE best answer for each publication.)
Every Issue
(1)

Once in While
(2)

About Once a Year


(3)

Never
(4)

V6 Break
V7 FSView
V8 Osceola
V9 Tallahassee Bull
V10 Tall. Democrat
V11 Tall. Magazine
In the above example, a coding scheme has been established. The next step would be to code each
questionnaire according to this scheme. That is, one would write the value for each variable in the left
margin of the questionnaire. If, for instance, on the first questionnaire "1 - 2 hours" (V1 or time spent
listening to the radio) was checked, then a value of "3" would be written to the left of the question in the
margin. This process would be repeated for every variable in the questionnaire.
The above coding scheme does not take into consideration some undesirable possibilities:
1.

What if the respondent leaves a variable blank (does not give an answer)? This is referred to as
item nonresponse.
What to do: Assign a "missing value." For instance, a value of "9" could be written in the left
margin for a variable left blank by the respondent. It is very important that the value assigned to
"missing" values is not being used for any legitimate value. If a question asked "How many hours
a week do you spend watching television?", a value of "9" could not be used for "missing"
because it is very possible that some respondents watch nine hours of TV in a week. In this case,
a value of "99" could be assigned to missing. Why "9" or "99" and not "8" and "88"? There is no
"correct" number to use for missing; it simply has to be a value that would not be encountered
among the responses to a variable. In fixed response items, this will be very easy to determine
(e.g., in variable 11 above (Tallahassee Magazine) it is clear that only values 1 - 4 are used, so
any number above four could serve as the missing value. SPSS manuals have for years used "9",
"99", "999" etc. This manual follows the SPSS example.
Why do you assign a value for missing? Later, you will learn that, by having missing values coded
with a specific value, you can exclude them from analysis.

2.

What if a respondent checks (or circles) more than one answer to a variable?
What to do: If a respondent gives multiple answers to a variable that is intended to have only one
answer, it may be because the directions are missing or unclear, the question is poorly worded,
the response set is not mutually exclusive, or the respondent cannot (or will not) follow directions.
Regardless of the reason, it will happen from time to time. You cannot "know" what the correct
response should be; you cannot know what the respondent meant to answer. A variable with
multiple responses where only one is requested should be coded as "missing."

3.

What if a respondent writes in his/her own answer to a fixed response question?


What to do: If a respondent writes in an answer to a fixed-response item, it may because the
question is poorly worded, the response set is not exhaustive, or the respondent is not following

directions. If the author of a questionnaire left out one or more plausible answers, the respondent
is forced to (a) choose an available answer that is not really the best answer, leave it blank (no
response) or write in an answer that is most appropriate. The questionnaire writer can avoid this
problem by using a hybrid fixed-response format wherein a set of possible responses are
provided plus an "other" choice is given. Often, the "other" choice is followed by a blank line in
which the respondent can write in his/her answer.
When an answer is written in (and there is not an "other
" provided) the coder should
treat it as missing and assign the missing value code. Alternatively, if numerous respondents write
in an answer, it is better to assign new values to these answers and code them accordingly.
There are additional coding problems one might encounter that are not relevant to the media
questionnaire example above. A few of these problems:
4.

What if a respondent writes in a range (e.g., 5 10) when one number is requested?
What to do: Code as missing.

5.

What if a question asks for a whole number and the respondent writes in a decimal or fraction?
What to do: Round up answer to next whole number. Be consistent in your rounding.

6.

What if a question asks for a whole number and the respondent's answer is illegible.
What to do: Code as missing. Don't guess!

The question below asks the respondent to rank his/her three favorite department stores.
From the following list, please rank your three favorite department stores. Please rank your most favorite #1 your next-most
favorite #2, and your third choice store #3.
1

BURDINES
DILLARD'S
J.C. PENNEY'S
SEARS
STEINMART
PARISIAN

7.

What if the respondent ranks two department stores #1, three are given a rank of #2, and one
receives a rank of #3? Or, as will occasionally happen, a respondent simply checks three stores,
but does not rank them using the numbers 1 - 3?
What to do: Code as missing.

8.

What if the respondent ranks all eight stores using ranks #1 through #8?
What to do: Code the stores ranked #1, #2 and #3 and ignore ranks #4 through #8.
How is a variable like the department store ranking item coded? Below is a copy of the question
with a hypothetical respondent's ranks typed in. In the margin are the values. NOTE: Each
department store is one variable. Thus, this question contains eight variables. Each one receives
a code of 1, 2, 3 or 9 for a rank of 1, rank of 2, rank of 3 or blank, respectively.

From the following list, please rank your three favorite department stores. Please rank your most favorite #1 your next-most
favorite #2, and your third choice store #3.
1
3
2

BURDINES
DILLARD'S
J.C. PENNEY'S
SEARS
STEINMART
PARISIAN

CODE
1
9
3
2
9
9

The process of data coding is time-consuming and repetitive. It demands accuracy and precision.
Following the established coding scheme, every questionnaire is carefully coded. Once all questionnaires
are coded, it is time to begin data entry using the SPSSWIN Data Editor.
Special issues
1.

How much of a questionnaire can be blank (not filled in by the respondent) and still be considered
usable?
Sometimes, respondents tire of completing a mail questionnaire after one or two pages. They
may send back the partially-completed questionnaire (most people who do not wish to complete
the questionnaire simply toss it into the wastebasket). In a telephone survey, some respondents
may break off the interview for a variety of reasons, leaving the questionnaire incomplete from the
point of the breakoff. If the respondent does not answer questions that are central to the purpose
of the survey, the questionnaire is of little value. If one or more pages are left completely blank, it
is best to consider that questionnaire unusable. In some survey situations, attempts will be made
to reach respondents in an effort to complete the questionnaire. Do not treat a questionnaire as
unusable if there is only occasional item nonresponse. Respondents may refuse to answer
certain questions (e.g., personal questions such as income) or may judge other questions to be
inapplicable.

2.

Adding an identification number for each questionnaire


As part of the coding process, it is advisable to assign each questionnaire a unique identification
number (ID). Write the ID number (beginning with "1" on the first questionnaire) on the top of the
first page of each questionnaire. The ID number provides a critical link between the
questionnaires and the data file containing the numeric codes for all the variables for each
questionnaire. Often, there are errors in either coding or entering the data. Verifying the correct
response requires an examination of the original questionnaire. The ID number facilitates this
process.

Student Survey: You will be asked to conduct an intercept survey. The "Student Survey" are to be given
to students at various locations around campus. The collected questionnaires are to be coded in
preparation for data entry. Review this year's Student Survey and familiarize yourself with the variables
and the values to be assigned. The coding process will first be discussed in class. Then, after you are
comfortable with the coding scheme, code your set of questionnaires.
Exercise 1
A set of questionnaires (from various surveys) that were completed by respondents have been scanned
and placed on the course website. You are to print out the questionnaires and code them.

Das könnte Ihnen auch gefallen