283 Aufrufe

Hochgeladen von gomoria

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

- CQE
- 2.15 Exercise Time Travel Lite16
- Logistic Regression
- Manual Web
- Oral Proficiency Standards & Teacher Candidate Standards Flan12030
- Impact of Facebook Usage on Students’ Involvement in Studies Doscussion- 15 Dec
- Regression
- PO Models With Complex Survey Data
- PASSS RQ2 1Simple Logistic Regression - One Categorical Variable
- Research Article -HR Old
- Financial Analyst
- Choosing Test
- Crop Insurance Paper
- Economics Books
- Regression
- Economic Analysis of the Utilization of Disused Biomass From the Agricultural Activity in the Region of Thessaloniki-1
- Speech Delayed
- Which Obama Voters
- Project 1 Education
- Idriss Boulanger SPT Liquefaction CGM-10-02

Sie sind auf Seite 1von 6

Mark Tranmer, University of Southampton, UK

D. Holt, Office for National Statistics, London, UK

David G. Steel, School of Mathematics and Applied Statistics,

University of Wollongong, NSW 2522, Australia (david_steel~uow.edu.au)

lacy, multilevel models, random effects, grouping variables as auxiliary variables they were able to

variables. reduce the aggregation effects in a set of ecological

correlations by up to 70 per cent.

1. Introduction In social research the variables are usually cat-

egorical at the individual level and the area level

An ecological analysis uses aggregate group level means are the corresponding proportions. Vari-

data to estimate individual level relationships. ous methods of ecological inference in this situa-

The ecological fallacy arises when ecological anal- tion were evaluated by Cleave, Brown and Payne

ysis provides biased estimates of individual level (1995). These included linear ecological regres-

relationships. The groups involved are often small sion as originally suggested by Goodman (1959)

geographic areas such as census Enumeration Dis- and a method based on an Aggregated Compound

tricts (EDs). Suppose that we are interested in in- Multinomial (ACM) model. The ACM model as-

vestigating the relationship between two variables sumed that the frequencies in each group have in-

Y and X. The aggregate data available for area g dependent multinomial distributions and used a

_

usually consists of the group level means Yg and Dirichlet compounding distribution. Their evalu-

ation favoured the ACM based method but they

The ecological fallacy arises because the indi- recognized that it is not easy to implement. Re-

viduals within the groups are not equivalent to cently King (1997) proposed a new method of eco-

randomly formed groups. Progress in understand- logical analysis for categorical data. This method

ing aggregation effects requires allowance for the models the conditional proportions of Y given X

population structure in the model underpinning as random effects with a joint truncated Normal

the analysis. Steel and Holt (1996) proposed a distribution and exploits the constraints implied

model for cases where the relationship between the by the group means.

two variables of interest, Y and X, is linear. This In this paper we develop and evaluate some sim-

model incorporated auxiliary variables, Z, which ple adjusted ecological analysis procedures based

explain much of the within area homogeniety of on the idea of incorporating auxiliary variables

the variables of interest. Random group level coef- and using individual level data available for these

ficients were also included to reflect group level ef- variables.

fects additional to those due to the auxiliary vari-

ables. Based on this model Steel and Holt (1996)

developed a method to adjust group level covari-

ance matrices using limited individual level data 0 Adjusted Ecological Analysis for

available for the auxiliary variables. Steel, Holt Dichotomous Variables

and Tranmer (1996) evaluated this approach for

estimating correlation coefficients using ED level Let Yi and Xi be the values of the variables of

data from the 1991 UK population census. Using interest for the i th individual in the population.

This research was supported by the UK Economic So- Suppose there is a sample, s, of groups and within

cial Research Council, grant Ft000 23 6135 the sampled group, g, a sample of ng individuals

324

is used to calculate sample group means: and King (1997). In both cases the covariates are

used at the area level to explain some of the vari-

1 ZYi and )(g= 1 ~}-~X, ation in the random coefficients characterizing the

relationship between the variables across the ar-

eas. Our approach is motivated by the idea that

The variables are dichotomous and so these means a large part of the variation in the relationships

are proportions. An important special case occurs across areas is due to compositional effects and

when the means are available for all areas and each can be removed by inclusion of the auxiliary vari-

mean is based on all individuals within the rele- ables. If this is the case then handling the remain-

vant area. ing variation between areas should be easier. We

For the sample in the group g let nabg be the propose attempting to average over the auxiliary

number of individuals for which Y~ = a and variables to estimate the marginal relationship be-

Xi = b. The corresponding population counts tween Y and X. This requires information about

are Nabg. We use " + " to indicate summa- the relevant parameters of the individual level dis-

tion over a subscript. Define Pablg = nabg//n++g tribution of the covariates.

and Pablg = Nabg/N++g as the sample and fi- For a single categorical auxiliary variable the

nite population proportions for group g. The con- information needed to calculate the adjusted

ditional proportions are Palbg = nabg//n+bg and marginal probabilities consists only of the propor-

Palbg = Nabg/N+bg. Define Pab+ -- nab+l/n+++ tions in each category. These can be calculated

and Pab+ = Nab+/N+++ as the overall sample and from the weighted group level data or could come

finite population proportions. The corresponding from some other source, such as a survey. If two or

conditional proportions are Palb+ = nab+/n+b+ more categorical auxiliary variables are used then

and Palb+ = Nab+ /N+b+. the marginal cross tabulations of these variables

The basis of the approach proposed by Steel and are required, unless further assumptions can be

Holt (1996) is that, for continuous variables, the made concerning the relationship between the aux-

conditional probability density function of Y given iliary variables. No individual level data about the

X can be expressed as variables of direct interest are used.

f f(xlz)f(z)dz (1) 3. Empirical Evaluation

When the parameters of f(ylx, z), f(zlz), and f(z)

Individual and group level data from the 1991 UK

are distinct, analysis can proceed by using individ-

population census were used to evaluate several

ual level data to estimate the parameters of f(z)

methods. The Small Area Statistics (SAS) data-

and aggregate data are used to estimate the pa-

base provided data in the form of totals for a range

rameters of f (ylx, z)and f (xlz).

of categorical variables for EDs. This was the

Assume t h a t there is a single auxiliary variable

source of the group means Yg, )fg and Zg. For the

Z and t h a t Y, X and Z are all categorical. The

variables analysed in this paper these means are

target of inference is the conditional probability

all based on 100 per cent of the census records for

distribution of Y given X, P(YIX). The approach

the relevant ED. Individual level data are available

we develop here is based on

from a 2 percent Sample of Anonymized Records

p ( y i x ) = F_,z P(YI X, Z ) P ( X I Z ) P ( Z ) (SAR) for Local Authority Districts (LADs). The

~-,z P ( X I Z ) P ( Z ) (2) evaluation used data for the LAD of Manchester,

which contained 897 EDs and 7613 individuals in

Estimation of P(YIX) can be a t t e m p t e d by us- the SAR, of which 5802 were aged 16 or more.

ing aggregate data to estimate P(YIX, Z) and Using these data it is possible to calculate ad-

P(XIZ ) and then using the individual level data justed ecological estimates of the marginal proba-

to estimate P(Z). The analysis using aggregate bility distribution P(YIX) based on equation (2)

data will be based on linear or logistic regression. using Yg, Xg and Zg obtained from the SAS data-

The potential benefit of using group level in- base and using individual level data obtained from

formation about covariates in ecological analysis the SAR concerning variables chosen as auxiliary

is discussed by Cleave, Brown and Payne (1995) variables. In this evaluation we considered the

325

following estimators of the conditional probability data to estimate P(YIX, Z) and P(XIZ). Es-

P(Y = I IX = b) = 7rllb, for b = 0, 1. timation of P(Y[X) is then based on equation

(2) with the individual level data being used

(a) SAR relative frequencies. The relative fre- to estimate P ( Z ) .

quency obtained from the SAR, nlb+/n+b+.

(g) Adjusted correlation approach. For two di-

(b) SAS relative frequencies. For pairs of vari- chotomous variables the correlation combined

ables for which the SAS contains the relevant with the marginal totals determines the pro-

cross tabulation we can calculate Nlb+ IN+b+. portions in the cross classification i.e.

Ply+ = PI++P+~+ +

(c) Ecological linear regression. This is built

around the relationship Ryxv/PI++(1 - P l + + ) P + l + ( 1 - P+l+) (4)

where R y x is the correlation coefficient based

~'g = P11og(1 - X . ) -4- P111g)(. (3) on the table of proportions P~b+. The method

proposed by Steel and Holt (1996) is used

If Pll0g and Pxllg are random variables with to produce adjusted estimates of the correla-

E[P~Io, IXg] - 7r~10 and E [ ~ I ~ , I X , ] _ - 7r11~ tion R y x ( Z ) which can then be substituted

then a linear regression of Yg on Xg gives into equation (4). This method enables use of

unbiased estimates of 7r110 and 7r111. This is information about several auxiliary variables

the classical Goodman regression approach even if only the two way cross tabulations are

(Goodman, 1959). This approach can pro- available.

duce estimates outside [0,1] and simple direct

use of this model is not usually recommended. (h) King's ecological inference method. This

method is also built around equation (3). The

(d) Ecological logistic regression. A common group level conditional proportions Pll0g and

model used in analysing a dichotomous re- Pl[lg are assumed to have a joint Normal dis-

sponse variable is logistic regression which as- tribution which is truncated so that they each

sumes Y~ is a Binomial variable based on one lie in the [0,1] interval. The method incorpo-

trial, B(1, E[Y~IXi]), where rates the fact that given ]?g and if g, equation

(3) implies bounds and constraints for Plllg

E[r, lx,] ) and Pll0g. This approach produces estimates

log 1 - E[Y~IX,] - ~ + DX~

of Plllg and Pll0g which can then be com-

bined to produce estimates of Plfl and/9110.

If groups were completely homogeneous with

This method can incorporate group level co-

respect to X then each group total Yg would

variates to help model the variation between

be a Binomial variable based on Ng trials with

groups.

probability such that

The variables used in the analysis were as fol-

E[~alXg] ) lows:

log 1 - -~[Ygg~ffg] - oL+ Z f( g

Y: employed, unemployed,

X" marital status,

Groups are not homogeneous with respect to

Z: age 45-59, age 60+, living in owner occupied

X but this model has the advantage of not

dwelling, renting from local authority.

giving predicted probabilities outside [0,1].

These auxiliary variables were chosen because of

(e) Adjusted ecological linear regression. Here we their success in removing aggregation effects in

use linear regression based on aggregate data correlations in the evaluation by Steel, Holt and

to estimate P(YIX, Z) and P(XIZ). Estima- Tranmer (1996). The analysis was confined to

tion of P ( Y I X ) i s then based on equation (2) those aged 16 or more.

with the individual level data being used to Table 1 gives the estimated probabilities of be-

estimate P(Z). ing employed (Y = 1) given marital status (X).

The estimates are obtained using the methods

(f) Adjusted ecological logistic regression. Here listed above. The ecological methods and adjusted

we use logistic regression based on aggregate correlation methods were implemented using the

326

SAS package and King's method was implemented mative bounds. However, the estimates are still

using EzI software, developed by King and col- worse than those obtained from the adusted logis-

leagues (King, 1997). The first row of the table tic regression method. When the variable "owner

give the "true" values of the probabilities as es- occupied" is used as a covariate, the estimates ob-

timated from census cross tabulations, available tained from King's method are effectively equal to

for these particular variables from the SAS data the true values, while those based on the adjusted

base, which were the same as the corresponding linear and logistic regression methods are not.

estimates obtained from the SAR. While these results are limited to two relation-

ships, they highlight the importance of using aux-

When adjustment variables are not included,

iliary variables in ecological analysis to obtain rea-

the ecological linear and ecological logistic esti-

sonable estimates. The choice of auxiliary vari-

mates are considerably different from the true val-

ables is important and methods of identifying ef-

ues. The inclusion of "owner occupied housing"

fective adjustment variables need to be used, as

as an adjustment variable leads to ecological lin-

suggested by Steel and Holt (1996). If appropriate

ear and logistic estimates that are much closer to

auxiliary variables are used, quite simple methods

the true values. Used as a single adjustment vari-

can perform almost as well as more sophisticated,

able "aged 60 and over" does not improve the es-

computer intensive ones. We expect that the sim-

timates. The estimates obtained by the adjusted

ple adjusted ecological methods can be extended

ecological linear method are fairly close to the true

to also include random effects within the sort of

values when several adjustment variables are used.

multilevel framework developed, for example, by

In particular, those combinations of adjustment

Goldstein (1995). Methods that solely use ran-

variables that include housing tenure. The ad-

dom effects to account for the variation in the re-

justed linear regression method generally works

lationship between groups will, in general, not be

better than the adjusted correlation method sug-

very successful. More information needs to be in-

gested by Steel and Holt (1996). In this exam-

corporated in order to get useful estimates. This

ple, King's method without covariates gives results

information may be in the form of individual level

similar to those for the ecological linear and lo-

data on auxiliary variables, group level covariates

gistic methods without covariates. When "owner

or the constraints exploited in King's method.

occupied" is used as a covariates, the estimated

probabilities are further from the true values than

those obtained using the adjusted linear regres- 4. References

sion method. However, when the covariate "rent-

ing from a local authority" is added the estimates Cleave, N., Brown, P.J. and C. D. Payne (1995)

from King's method are slightly better than those Evaluation of Methods for Ecological Inference.

based on the adjusted linear regression method. Journal of the Royal Statistical Society, A, 158,

Use of King's method with more than two covari- pp 55- 72

ates proved difficult in practice and no results for Goldstein, H. (1995). Multilevel Statistical Mod-

such cases are included. els, 2nd Edition, Edward Arnold, London.

Goodman, L.A. (1959). Some alternatives to Eco-

For the estimates of the proportion unemployed logical regression. American Journal of Sociologi-

by marital status given in Table 2, the unad- cal Review, 18, 663-664.

justed ecological linear and logistic methods pro- King, G. (1997). A Solution to the Ecological In-

vide poor estimates. The linear method leads to ference Problem :Reconstructing Individual Behav-

an out of range estimate. Including "owner occu- ior from Aggregate Data. Princeton Univ Press

pied" as an adjustment variable in these meth- Steel, D. and Holt, D. (1996). Analysing and Ad-

ods leads to some improvement. The adjusted justing Aggregation Effects: The Ecological Fal-

ecological linear regression method works reason- lacy Revisted. International Statistical Review,

ably well for combinations of adjustment vari- 64, pp 39-6O

ables that include housing tenure and age. King's Steel D., Holt D. and Tranmer M. (1996). Making

method without covariates works somewhat bet- unit level inferences from aggregated data, Survey

ter in this case than in the previous example, be- Methodology 22, 3-15

cause differences in the proportions unemployed

and married across the EDs lead to more infor-

327

T a b l e 1: E s t i m a t e d p r o b a b i l i t i e s : Y - e m p l o y e d ; X - m a r r i e d

SAS 'truth' .41 .50 SAS 'truth' .41 .50

covariate( s): covariate(s):

none .26 .67 none .26 .67

age2 .23 .71 age60+ .29 .65

age1, age2 .28 .65 age1, age2 .33 .59

oo .48 .42 oo .49 .41

oo, rla .47 .43 oo, rla .46 .44

oo, rla, age1,2 .45 .45 oo, rla, age1,2 .45 .46

SAS 'truth' .41 .50 SAS 'truth' .41 .50

covariate(s)" covariate(s):

none .27 .66 none .25 .69

age2 .22 .70 age60+

age1, age2 + .20 .73 age4559, age60+

oo .55 .35 oo .50 .39

oo, rla .51 .39 oo, rla .44 .46

oo, rla, agel,2 .47 .43 oo, rla, agel,2

Population: Residents aged 16 or more in households, Manchester LAD.

Y takes the value 1 for 'employed' 0 for 'not employed'.

X takes the value 1 for 'married' and 0 for 'not married'.

Pllo means P ( Y = 1 I X = 0); Plll means P ( Y = 1 I X = 1)

Adjustment variables: oo - owner occupied; rla - rented from local authority;

age1 = persons aged 45 - 59; age2 = persons aged 60 and over.

328

T a b l e 2: E s t i m a t e d p r o b a b i l i t i e s : Y - u n e m p l o y e d ; X - m a r r i e d

SAS 'truth' .14 .07 SAS 'truth' .14 .07

covariate(s)" covariate(s)"

none .24 -.06 none .17 .02

age2 .24 -.05 age60+ .17 .02

agel, age2 .22 -.03 agel, age2 .11 .09

oo .18 .01 oo .16 .04

oo, rla .19 .01 oo, rla .15 .04

oo, rla, agel,2 .16 .04 oo, rla, agel,2 .11 .09

SAS 'truth' .14 .07 SAS 'truth' .14 .07

covariate(s)- covariate(s)"

none .27 -.09 none .19 .00

age2 .27 -.09 age2

agel, age2 .28 -.09 agel, age2

oo .19 -.00 oo .14 .07

oo, rla .18 .01 oo, rla

oo, rla, agel,2 .16 .03 oo, rla, agel,2

Population: Residents aged 16 or more in households, Manchester LAD.

Y takes the value 1 for 'unemployed' 0 for 'not unemployed'.

X takes the value 1 for 'married' and 0 for 'not married'.

Pil0 means P(Y = l I X = 0); Pili means P(Y = l I X = 1)

Adjustment variables: oo = owner occupied; rla = rented from local authority;

agel = persons aged 45 - 59; age2 = persons aged 60 and over.

329

- CQEHochgeladen vonrajaabid
- 2.15 Exercise Time Travel Lite16Hochgeladen vonaustronesian
- Logistic RegressionHochgeladen vonMohamed Med
- Manual WebHochgeladen vonm
- Oral Proficiency Standards & Teacher Candidate Standards Flan12030Hochgeladen vonfaze2
- Impact of Facebook Usage on Students’ Involvement in Studies Doscussion- 15 DecHochgeladen vongarimaarora
- RegressionHochgeladen vonMansi Chugh
- PO Models With Complex Survey DataHochgeladen vonakrause
- PASSS RQ2 1Simple Logistic Regression - One Categorical VariableHochgeladen vonNguyen Bich Ngoc
- Research Article -HR OldHochgeladen vonsehrish23
- Financial AnalystHochgeladen vonapi-76955184
- Choosing TestHochgeladen vonpmiklos2
- Crop Insurance PaperHochgeladen vonShiv Raj Singh Rathore
- Economics BooksHochgeladen vonbabzz01
- RegressionHochgeladen vonCynthia Smith
- Economic Analysis of the Utilization of Disused Biomass From the Agricultural Activity in the Region of Thessaloniki-1Hochgeladen vonAnonymous uxlBscW9
- Speech DelayedHochgeladen vonNadia Sani amalia
- Which Obama VotersHochgeladen vonLuigiScaini
- Project 1 EducationHochgeladen vonpoop2269
- Idriss Boulanger SPT Liquefaction CGM-10-02Hochgeladen vonJose Perez
- histogram kepemimpinanHochgeladen vonSekar Kinasih
- 2_GEOSTAT_BasicStatisticsHochgeladen vonESHANU
- orioHochgeladen vonChidochashe Chydide Chinogurei
- P86Hochgeladen vonAriel
- Ch 4 the Research Process an Overview (1)Hochgeladen vonClarenciaAdhemesTantri
- bigMNPHochgeladen vonBryant Shi
- SPT-Based liquefaction triggering procedures.pdfHochgeladen vondnavarro
- ersh8320f07_18[1]Hochgeladen vonnewhaven01
- 15_J_2348Hochgeladen vonAnasA.Qatanani
- Political Efficacy and Introductory Political Science Course: Findings from a Survey of a Large Public UniversityHochgeladen vonMiguel Centellas

- Chapter 2-3 Short Run SPC.pdfHochgeladen vontantiba
- BSRM AssignmentHochgeladen vonRubinder Khepar
- 01. Syllabus NTC Level - 7 (General Module).pdfHochgeladen vonGorakh Raj Joshi
- SamplingHochgeladen vonYami Pineda
- Assignment StatisticHochgeladen vonDeqZain Mohd
- Design of Screening ProceduresHochgeladen von5landers
- Earnings MgmtHochgeladen vonRudi Syafputra
- The Changing of the Guard Turnover and Structural Change in the Top Management PositionsHochgeladen vonelise_fahlen
- Functional and Effective Connectivity a ReviewHochgeladen vonIrfan Fachri M
- Rsh Qam11 Ism Ch01Hochgeladen vonJames Morgan
- Chapter 10 CbsHochgeladen vonAshwini Deshpande
- Statistics for ResearchHochgeladen vonkalos
- Gaussian Sequence ModelHochgeladen vonruanfeng
- Choosing Statistical TestsHochgeladen vondrquan
- S programHochgeladen vonverai1131
- Sociology with Family PlanningHochgeladen vonJM Urian
- The Malaysia Japan Model on Technology Partnership International Proceedings 2013 of Malaysia Japan Academic Scholar ConferenceHochgeladen vonvadavada
- Concept of Sampling DistributionHochgeladen vonsonikac22124
- STAT 2507 Midterm 2011FHochgeladen vonexamkiller
- astm.d4329Hochgeladen vonLeo Orl
- TI84 Plus CE GettingStarted EnHochgeladen vondteijeira
- Quantitative MeasurementHochgeladen vonPaw Siriluk Sriprasit
- 4. Demand ForecastingHochgeladen vonnavyasheel
- assessment_rubrics.pdfHochgeladen vontaufiqishak09
- Exercise 1 Instruction PcaHochgeladen vonHanif Ishak
- Disinfestation of Stored Dates Using Microwave EnergyHochgeladen vonSetyadi Gumaran
- 31Hochgeladen vonJasna Hasanbegović
- Assumptions of Logistic RegressionHochgeladen vonGundeep Kaur
- Homicide Studies 2011 Nivette 103 31Hochgeladen vonCorina Ica
- Personality and Stress on Academic PerformanceHochgeladen vonfaruk