Sie sind auf Seite 1von 26

Analysis of categorical response data

Topic covered in lecture 1: What is categorical data Response and explanatory variables Measurement scales for categorical data Course coverage Tabulated count data and related questions Non tabulated categorical data Sampling design for tables Links with other methods

1.1

What is categorical data?: The measurement scale for the response consists of a number of categories Variable Measurement Scale Farm system Mortality Food texture Litter size Dairy, Beef, Tillage etc. Dead, alive Very soft, Soft, Hard, Very hard 0, 1, 2, 3 and >3

Types of data discussed in this course Response variable(s) is categorical Explanatory variable(s) may be categorical or continuous Example 1: Does Post-operative survival (categorical response) depend on the explanatory variables? Sex (categorical) Age (continuous) Example 2: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system. Farm system (categorical) Attitude to EU (categorical/ordinal)? (Two response variables - no explanatory variables) Could one of these be regarded as explanatory?

1.2

Measurement scales for categorical data


Nominal - no underlying order Variable Measurement Scale Farm system Dairy, Beef, Tillage etc. Weed Species Stellaria media, Poa annua, etc. Ordinal - underlying order in the scale Variable Food texture Disease diagnosis Education Measurement Scale Very soft, Soft, Hard, Very hard Very likely, Likely, Unlikely Primary, Secondary, Tertiary

Interval - underlying numerical distance between scale points Variable Litter size Age class Education Measurement Scale 0, 1, 2, 3 and >3 <1, 1-2, 2-3.5, 3.5-5, >5 years in education

1.3

Tabulated count data and questions


Single level table Example 1: A geneticist carries out a crossing experiment between F1 hybrids of a wild type and a mutant genotype and obtains an F2 progeny of 90 offspring with the following characteristics. Wild Type 80 Mutant 10 Total 90

Evidence that a wild tpe is dominant, giving on average 3:1 offspring phenotype in its favour? Two-way table Example 1- A sample 124 mice was divided into two groups, 84 receiving a standard dose of pathogenic bacteria followed by an antiserum and a control group of 40 not receiving the antiserum. After 3 weeks the numbers dead and alive i9n each group were counted. Outcome Dead Alive Total % dead + antiserum 19 65 84 23 - antiserum 18 22 40 45 Total 37 87 124 Association between mortality and treatment ? Is the mortality rate the same for both treatments?

1.4

Example 2 - Categorical response and categorical explanatory variable: The opinion poll after the Good Friday Agreement with respondents classified by religion (R - Catholic or Protestant) Favour Oppose Catholic Protestant Total % Cath 258 149 407 63 32 91 123 26 Undec. 62 208 270 23 Total 352 448 800 % Favour 73 33 51

1. Evidence that a majority of decided voters (all voters) support the agreement? 2. Support pattern the same for Protestants and Catholics?

1.5

Example 3 (Snedecor and Cochran): Categorical response and interval categorical explanatory variable. The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate. Has the higher concentration given a significantly different percentage kill? Is there a relationship between concentration and mortality? Concentration of sodium 0.65 Dead Alive Total % Dead 55 22 77 71.4 oleate (%) 1.10 1.6 62 13 75 82.7 100 12 112 89.3 2.1 72 5 77 93.5 Total 289 52 341 84.8

Is mortality related to sodium oleate concentration?

1.6

Example 4 Categorical response and interval categorical explanatory variable (Cornfield 1962): Blood pressure (BP) was measured on a sample of males aged 40-59, who were also classified by whether they developed coronary heart disease (CHD) in a 6-year follow-up period. The data were classified by BP (interval categorical variable in 8 classes) and CHD (CHD or No-CHD). BP <117 117 - 126 127 - 136 137 - 146 147 - 156 157 - 166 167 - 186 >186 Total CHD 3 17 12 16 12 8 16 8 92 No CHD 153 235 272 255 127 77 83 35 1237 156 252 284 271 139 85 99 43 1329 1.9 6.7 4.2 5.9 8.6 9.4 16.2 18.6 Total % CHD

1.Is the incidence of CHD independent of BP? 2.Simple relationship between the probability of CHD and the level of BP?
1.7

Multiway table - relationship between categorical responses or categorical response and several categorical explanatory variables: Example 1: The NI opinion poll with respondents further classified by where they lived in Northern Ireland (L) (ARL table) West - rural and strong nationalist/Catholic Belfast - mixed population North East - industrial and Unionist/Protestant. Favour Oppose Undecided West Catholic 73 20 20 47 34 69 Protestant Belfast Catholic 90 9 21 54 23 66 Protestant North East Catholic 95 3 21 Protestant 48 34 73 Total 407 123 270 1. Evidence that a majority of decided voters (all voters) support the agreement? 2. Difference in support pattern between Protestants and Catholics? 3. Difference in support pattern between Protestants and Catholics consistent over region? 4. Within the Catholic (Protestant) population does the strength of support change with region? ETC ETC
1.8

Example 2: Grouped binomial data - patterns of psychotropic drug consumption in a sample from West London (Murray et al 1981, Psy Med 11,551-60). Sex Age Psych. On Total Group case drugs M 1 No 9 531 M 2 No 16 500 M 3 No 38 644 M 4 No 26 275 M 5 No 9 90 M 1 Yes 12 171 M 2 Yes 16 125 M 3 Yes 31 121 M 4 Yes 16 56 M 5 Yes 10 26 F 1 No 12 588 F 2 No 42 596 F 3 No 96 765 F 4 No 52 327 F 5 No 30 179 F 1 Yes 33 210 F 2 Yes 47 189 F 3 Yes 71 242 F 4 Yes 45 98 F 5 Yes 21 60 Is Pychotropic drug use affected by gender, age or psychological state and are there interactions among these effects?
1.9

Non-tabulated data and questions


Example 1: Individual plants were monitored the survival of plants of Legousia in an experiment to see whether they survived after 3 months. Survived yes is scored 1 and Survived -no scored 0. Also recorded were CO2 treatment 2 levels low and high Density of Legousia Density of companion species Height of the plant (mm) two weeks after planting. Most individuals will have a unique profile in these three additional variables and so tabulation of the data by them is not feasible. The individual data is presented. Density Subject Surv CO2 Ht Leg. Comp 1 0 L 35 20 30 2 1 L 68 22 27 3 1 H 43 16 33 4 0 L 27 4 16 1.Is survival related to the explanatory variables (CO2, Height, density-self, density-companions.)? 2.Can the probability of survival be predicted from the subjects profile?
1.10

Example 2: A sample of 62 patients who had angioplasty for coronary artery disease were followed to see if they reblocked (restenosed) after 6 months RS -yes is scored 1 and RS -no scored 0 (a binary response categorical variable). Also recorded were Age in years - continuous variate Blood pressure (BP) - continuous variate Sex - nominal categorical (?) Cholesterol - continuous Most individuals will have a unique profile in these four additional variables and so tabulation of the data by them is not feasible. The individual data is presented. Subject 1 2 3 4 RS 0 1 1 0 Age 35 68 43 27 BP 117 154 123 110 Sex m f f m Cholest. 1 5 2 3

3.Is RS related to the explanatory variables (Age, BP, Sex and Cholesterol)? 4.Can the probability of RS be predicted from the subjects profile?

1.11

Sampling designs - two and multiway tables


Single sample (no margin fixed) simultaneously classified by several categorical variables. Used in Cross-sectional studies. Example: A simple random sample of 200 students was classified by gender and attitude to EU integration. EU integration Favour Male Female Total 43 61 104 Oppose 53 33 86 Total 96 104

This is a snapshot of opinion at a moment in time hence Cross-sectional.

1.12

One margin fixed: Samples of fixed size are selected for one category and individuals are classified by the other category(s). Example 1 (Clinical trial - a prospective study): Of 400 HIV positive pregnant women 200 are assigned at random to each of Breast feeding (BF) or Formula feeding (FF). Two years after birth the childs HIV status is determined. Childs status (???) Total HIV + HIV BF 62 138 200 FF 45 155 200 Example 2 (Cohort study - a prospective study): 400 HIV positive pregnant women are asked to select either Breast feeding (BF) or Formula feeding (FF). Two years after birth the childs HIV status is determined. Here the sample totals are determined by the mothers choices. Example 3 (Case-control or retrospective study): A sample of 200 HIV+ and another of 200 HIV- two year old children are selected and classified by whether they were BF or FF. Here the HIV outcome numbers are controlled - cannot compute % HIV from BF and FF.
1.13

Past

Present

Future

Cohort

Cases and controls

Cross-sectional

1.14

Notes on sampling designs In more complex studies more than one margin may be fixed. Example 1: Any replicated factorial experiment where the response is binary Example 2: Physicians health study. NEJM 1988, 262-264. Four treatments

Treatment A B C D

Aspirin No Yes No Yes

beta carotene No No Yes Yes

Example 3: 2x2 table with both margins fixed? The statistical properties differ considerably between sampling schemes, nevertheless the methods to be discussed below apply, with some modifications, to data collected using any of these sampling schemes.

1.15

Relationships with regression methods. Traditionally categorical data analysis has been viewed as completely distinct from and unconnected with regression and ANOVA methods. We show that there are many strong links and that many concepts transfer naturally between the methods.

1.16

SAS Analysis of example 1


A sample 124 mice was divided into two groups, 84 receiving a standard dose of pathogenic bacteria followed by an antiserum and a control group of 40 not receiving the antiserum. After 3 weeks the numbers dead and alive i9n each group were counted. Outcome Dead Alive Total % dead + antiserum 19 65 84 23 - antiserum 18 22 40 45 Total 37 87 124 Association between mortality and treatment ? Is the mortality rate the same for both treatments?

1.17

SAS program for analysis of example 1 data PROC FREQ


OPTIONS LINESIZE=72 PAGESIZE= 59 NOCENTER ; DATA ANTISER; INPUT ANTISER $ MORTALI $ COUNT ; CARDS ; A__plus Dead 19 A_plus Alive 65 A_minus Dead 18 A_minus Alive 22 ; PROC FREQ ; TABLES ANTISER*MORTALI/CHISQ EXPECTED DEVIATION CELLCHI2 NOROW NOCOL NOPERCENT NOCUM; WEIGHT COUNT ; RUN ;

1.18

Table of ANTISER by MORTALI ANTISER MORTALI

Frequency Expected Deviation Cell Chi-SquareAlive Dead A_minus 22 18 28.065 11.935 -6.065 6.0645 1.3105 3.0814 A_plus 65 19 58.935 25.065 6.0645 -6.065 0.624 1.4673 Total 87 37

Total 40

84

124

Statistics for Table of ANTISER by MORTALI Statistic ChiChi-Square Likeli Ratio ChiChi-Squ Sample Size = 124 DF 1 1 Value 6.4833 6.2846 Prob 0.0109 0.0122

Observed counts Outcome Dead Alive 19 65


1.19

+ antiserum

Total 84

% Dead 23

- antiserum Total

18 37

22 87

40 124

45 30

Expected (blue) counts if outcome is independent of treatment Outcome Dead Alive .3*84 .7*84 25.2 58.8 .3*40 .7*40 12.0 28.0 37 87

+ antiserum - antiserum Total

Total 84 40 124

% Dead 23 45 30

Is there a discrepancy between obsewrved and expected? Chisquared = (Observed-expected)2/expected

1.20

SAS Analysis of example 3


The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate. Has the higher concentration given a significantly different percentage kill? Is there a relationship between concentration and mortality? Concentration of sodium oleate (%) 0.65 1.10 1.6 2.1 55 62 100 72 22 77 13 75 12 112 5 77

Dead Alive Total

Total 289 52 341

Is mortality independent of sodium oleate concentration?

1.21

SAS program for analysis of Insecticide data PROC FREQ


OPTIONS LINESIZE=72 PAGESIZE= 59 NOCENTER ; DATA INSECT; INPUT SODOL D_AL COUNT ; CARDS ; 0.65 1 55 1.10 1 62 1.6 1 100 2.1 1 72 0.65 2 22 1.10 2 13 1.6 2 12 2.1 2 5 ; PROC FREQ ; TABLES D_AL*SODOL/CHISQ EXPECTED DEVIATION CELLCHI2 NOROW NOCOL NOPERCENT NOCUM; WEIGHT COUNT ; RUN ;

1.22

Output from SAS PROC FREQ.


TABLE OF D_AL BY SODOL D_AL SODOL

FREQUENCY| EXPECTED | DEVIATION| CELL CHI2| 0.65| 1.1| 1.6| 2.1| ---------+--------+--------+--------+--------+ 1 | 55 | 62 | 100 | 72 | | 65.3 | 63.6 | 94.9 | 65.3 | | -10.3 | -1.6 | 5.1 | 6.7 | |1.61249 |.038436 |.271785 |.696522 | ---------+--------+--------+--------+--------+ 2 | 22 | 13 | 12 | 5 | | 11.7 | 11.4 | 17.1 | 11.7 | | 10.3 | 1.6 | -5.1 | -6.7 | |8.96172 |.213617 | 1.5105 |3.87106 | ---------+--------+--------+--------+--------+ TOTAL 77 75 112 77

TOTAL 289

52

341

1.23

STATISTICS FOR TABLE OF D_AL BY SODOL STATISTIC DF VALUE PROB -----------------------------------------------------CHI-SQUARE 3 17.176 0.001 LIKELIHOOD RATIO CHI-SQUARE 3 16.633 0.001 MANTEL-HAENSZEL CHI-SQUARE 1 16.157 0.000 PHI 0.224 CONTINGENCY COEFFICIENT 0.219 CRAMER'S V 0.224

Conclusion: Insect mortality is not independent of dose. Mortality is not constant as dose changes. Sodium oleate (%) 0.65 1.10 1.6 2.1 Dead Alive Total % Dead 55 22 77 71.4 62 13 75 82.7
1.24

Total 289 52 341 84.8

100 12 112 89.3

72 5 77 93.5

Group two lowest and two highest levels

1.25

Analysis of CHD data Blood pressure (BP) was measured on a sample of males aged 40-59, who were also classified by whether they developed coronary heart disease (CHD) in a 6-year follow-up period. The data were classified by BP (interval categorical variable in 8 classes) and CHD (CHD or No-CHD). BP <117 117 - 126 127 - 136 137 - 146 147 - 156 157 - 166 167 - 186 >186 Total CHD 3 17 12 16 12 8 16 8 92 No CHD 153 235 272 255 127 77 83 35 1237 156 252 284 271 139 85 99 43 1329 1.9 6.7 4.2 5.9 8.6 9.4 16.2 18.6 Total % CHD

3.Is the incidence of CHD independent of BP? 4.Simple relationship between the probability of CHD and the level of BP?

1.26

Das könnte Ihnen auch gefallen