Bayesian Analysis Using MCMC On Survey Data

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES
IMPROVING FORECASTING
OF POLITICAL POLLING OUTCOMES
By
Lancelot Muwayi and Sanoj Kumar
Supervisor: Dr. Ming-Long Lam
A Capstone Project
Submitted to the University of Chicago

in partial fulfillment of the requirements for the degree of
Master of Science in Analytics
Graham School of Continuing Liberal and Professional Studies
(August, 2015)
ii
The Capstone Project committee for Lancelot Muwayi and Sanoj Kumar
certifies that this is the approved version of the following capstone project report:
APPROVED BY SUPERVISING COMMITTEE:
Dr. Ming-Long Lam
____________________________________
Dr. Sema Barlas
_____________________________________
iii
Abstract
This research explores the extent to which Bayesian estimators improve the
forecasting of political polling outcomes. A Bayesian model is built using a two-step
process. The first step applies decision trees to select significant demographic
variables; the second step uses the Markov Chain Monte Carlo (MCMC) method to
estimate the Bayesian model. The Bayesian model takes into account sampling
variation and ensures the stability of parameter estimates by weighting them with
prior.
Key Words
Bayesian estimator, forecasting of polling outcomes, MCMC, decision tree
iv
Executive Summary
Incomplete and noisy data from disparate sources call for non-conventional
statistical tools for correct analysis. Bayesian methods are beneficial for obtaining
robust estimates and combining information from disparate sources. This research
examines the suitability of Bayesian methods in estimating the parameters of a model
that predicts election outcomes on the basis of polling data in which parameters of
earlier models could be used as priors. A two-stage process, CART and CHAID
decision trees for variable selection and Markov Chain Monte Carlo (MCMC) method,
is used to predict approval of the Obama presidency based on race, gender, education,
and region of respondents. Results show that black respondents from the U.S. southern
region approve of Obamas presidency in greater numbers than white respondents.
Approval of Obamas presidency is lower among respondents with no college. The
U.S. southern region stands out among all regions as having the lowest approval of
Obamas presidency. More survey data related to reasons for approval or disapproval
of Obamas presidency would further validate and support the results of this research.
Table of Contents
Introduction..............................................................................................................................1
Problem Statement..............................................................................................1
Research Purpose.................................................................................................2
Research Question..............................................................................................3
Background..............................................................................................................................3
Methodology............................................................................................................................8
Exploratory Data analysis.......................................................................................8
Building Classification Model................................................................................10
Bayesian Logistic Regression Model..............................................................10
PROC MCMC with Bayesian Logistic Regression Model........................11
Final Results...........................................................................................................................13
Exploratory Analysis Results.13
Factors Selection........18
Building Classification Model Results...19
Summary and Conclusions....................................................................................................22
Recommendations.................................................................................................................23
Further Model Development .................................................................................23
Appendix................................................................................................................................24
Bibliography ..........................................................................................................................27
vi
List of Tables
Table 1: Aggregated Data for Model...............................................................................11

Table 2: Interaction of Race and Region.........................................................................14
Table 3: Interaction of Race and Sex..............................................................................15
Table 4: Interaction of Race and Education....................................................................16
Table 5: Important Factors..............................................................................................18
Table 6: Output Parameters of Model.............................................................................19
Table 7: Random Effects Parameters..............................................................................20
Table 8: Posterior Summaries of Parameter Estimates...20
List of Diagrams
Diagram 1: Black Only as a factor..................................................................................17

Diagram 2: South as a factor...........................................................................................17
Diagram 3: Diagnostics Plot for Beta1............................................................................22
Introduction
Problem Statement
Every four years, Americans elect their President for a new term. Since this is
an important decision for the nation, political groups have been conducting election
polls since the 1930s. Millions of Americans have been surveyed on their political
opinions, yielding a wealth of information that political scientists can utilize to their
full advantage. (Caughey and Warshaw, 2003). State-level pre-election survey data
represent a rich new source of information for both forecasting election outcomes and
tracking the evolution of voter preferences during the campaign (Linzer, 2013).
However, major hurdles still exist that are affecting survey researchers,
including low response rates, rising costs to carry out the surveys, and the demand for
quicker turn-around times. These factors have increased the demand for new ways to
generate accurate and timely polling estimates of public opinion and social behaviors
by using incomplete or noisy data and combining insights from different surveys and
other sources. Researchers are faced with the task of doing more with disparate
datasets.
In the arena of political polling, for example, Linzer (2013) proposed a
Bayesian approach to improve forecasting election outcomes from state polls using
recent advances in Bayesian methodology (see also Jackman, 2005; Carlin and Louis,
2009; and Ntzoufras 2009). Bayesian inference is normally used to update a
previously estimated probability given new information (Gelman et al., 2004).
This study uses polls conducted from January 2012 to March 2012, which
asked nearly 37,000 voters in each state about their approval of the Obama presidency
on seven different scales, ranging from strongly disapprove to strongly approve. The
channel through which data was collected is not known, and the collected data are
sparse, noisy, and come from unrepresentative samples.
The goal of this study is to build a statistically robust model to relate approval
ratings of the Obama presidency to characteristics of the populations using the survey
data provided by Ipsos.
Research Purpose
The primary purpose of this research is to build a model to predict approval
ratings of the Obama presidency and to identify significant predictors. The secondary
purpose is to examine the suitability of Bayesian methods. The last, but not the least
important purpose, is to develop custom SAS codes that our Capstone sponsor can
reference.
Ipsos Public Affairs, our Capstone sponsor, maintains an active program of
research that integrates Bayesian and other methodologies into its polling and broader
research practices. This research assists and advises the companys researchers on
how to obtain population-valid public opinion outcomes using Bayesian and other
appropriate methods for both static and time-dependent data (including, eventually,
real-time continuous-feed data).
Data from disparate sources are noisy and incomplete. Therefore, Bayesian
methods are beneficial since robust estimates can be obtained from them. They are
2
also useful in combining information from disparate sources. Parameters of the models
from earlier polls can be used as priors for the parameters of the model for the new
poll.
The particular case study at the center of this project is U.S. state-level mid-term
elections, using historical data sources as well as partial or incomplete polling data and
other data (including simulated data). We worked closely with Ipsos methodologists
and polling experts to advance their forecasting and polling capabilities, as well as to
broaden their program of research on nonprobability sampling.
Research Question
We are tasked to develop a free standing and flexible SAS program that will
implement the Bayesian model estimation, generate correct standard errors, and allow
for the eventual introduction of additional target/model variables. The model will
allow for integration of other specialized attitudinal measures, in addition to the
parameters used in this project. The project asks this question: to what extent are
Bayesian methods suitable for estimating parameters of a model that predicts election
outcome on the basis of polling data, where parameters of earlier models can be used
as prior?
Background
The focus of our research of this project is to analyze survey data related to the
presidency of Obama and to identify the significant predictors that affected his
approval rating across different states or regions. The polls used in our study were
3
conducted as random samples of registered or likely voters. Sample data is sparse and
noisy. In some states / regions, there are more respondents than in other states /
regions. In our data, response variable is categorical in nature. We therefore had to
use classification technique. We started looking for different statistical methods to fit
our needs.
Our data is factorial in nature: some factors are observed within other factors.
For instance, different levels of race are observed in different regions. Within each
race we have two genders, and so on. Our goal is to understand the interaction effects
among these factors. Therefore, we started by developing an SAS program for multilevel regression of polling data.
To understand multi-level regression on survey data, we first analyzed case
studies conducted by a number of researchers. Gelman and Ghitza, in Deep
Interactions with MRP: Election Turnout and Voting Patterns among Small Electoral
Subgroups, identified how multilevel regression derives the voting pattern and
turnout estimates based on small subgroups of the population. MRP stands for multilevel regression and post-stratification1. They analyzed how to fit a multilevel
regression model that includes group-level predictors as well as unexplained variation
at each of the levels of the factors and their interactions.
Ipsos wants to use its prior knowledge on the parameter estimates. Therefore,
an application of Bayesian analysis became inevitable in this project. To understand
Bayesian analysis on polling data, we studied Linzers Dynamic Bayesian
Forecasting of Presidential Elections in the States. In this study, Linzer introduced a
dynamic Bayesian forecasting model that unifies the regression-based historical
1 Post-stratification is not within the scope of this project.
4
forecasting approach developed in political science and economics with the polltracking capabilities made feasible by the recent upsurge in state-level opinion polls.
In the case study Electoral Forecasting and Public Opinion Tracking in Latin
America: An Application to Chile, Bunker et al. argued that Bayesian inference is
well suited to estimate true public opinion. Bayesian inference is normally used to
update a previously estimated probability given new information (Gelman et al.,
2004).
Bayesian methods are derived from the application of Bayes theorem. For events
A and B, Bayes theorem is expressed as
Pr ( AB) =
Pr ( B| A ) Pr ( A)
Pr ( B)
in which degree of belief in proposition A given evidence B is equal to the joint

probability of A and B divided by the probability of B. This theorem has been frequently
applied in electoral studies, including research that has been used in public opinion polls
to estimate electoral returns (Fernandez-I Marin, 2011; Lock and Gelman, 2010;
Jackman, 2004; Linzer, 2013; Strauss, 2007). As such, it has been found to increase the
overall accuracy of estimates (see Brooks and Gelman, 1998; Gelman and Rubin, 1992)
and (Jackman, 2000, 2009).
Pr(A) is the prior degree of belief in A

Pr(A|B) is the posterior degree of belief in A, in the sense
of what after looking at the evidence (B)
It can also be written as

5
Pr (A|B) =
Pr ( B|A ) Pr ( A )
Pr ( B| A ) Pr ( A ) + Pr ( B| ) Pr ( )
in which means not A. If A is parameter and B as data y, then we have
Pr(|y) =
Pr ( y| ) Pr ( )
Pr ( y )
Pr ( y| ) Pr ( )
Pr ( y| ) Pr ( ) + Pr ( y| ) Pr ( )
The quantity Pr(y) is the marginal probability, and it serves as a normalizing constant to
ensure that the probabilities add up to unity. Because Pr(y) is a constant, we can ignore it
and write
Pr( | y ) Pr ( y| ) Pr ( )
Thus, the prior Pr() is being updated with likelihood Pr(y|) to form the posterior
distribution Pr(|y).
In a nutshell, Bayesian analysis updates beliefs about the parameters by
accounting for additional data. We need to weight the likelihood for the data with the
prior distribution to produce the posterior distribution. If we want to estimate a parameter
from data y =
{ y 1 .. y n } by using a statistical model described by density p(y|
, Bayesian analysis says that we cannot determine exactly but we can describe
the uncertainty by using probability statements and distribution. We can formulate a prior
distribution ( to express our beliefs about . We then update these beliefs by
combining the information from the prior distribution and the data, described with the
statistical model p( y , to generate the posterior distribution p( | y ) .
p( | y )=
p( , y)
p(y)
p( y) ()
p( y)
p ( ) P( y )
p() p ( y| ) d
In general, any prior distribution can be used, depending on the available prior
information. The choice can include informative prior distribution if something is
known about the likely values of the unknown parameters, or diffuse or noninformative priors if either little is known about the coefficient values or if one
wishes to see what the data themselves provide as inferences. Non-informative prior
distributions play a minimal role in the posterior distribution.
Several features of the Bayesian approach make it attractive for researchers
because it provides a mechanism for combining a prior probability distribution for the
states of nature with sample information to provide a revised (posterior) probability
distribution about the states of nature. These posterior probabilities are then used to
make better decisions.
Previous posterior distribution can also be used as a prior when new
observations become available. Sparse and noisy data inference proceeds in the same
manner as if one had a large sample. It provides interpretable answers, such as the
true parameter has a probability of 0.95 of falling in a 95% credible interval. It
provides a convenient setting for a wide range of models, such as hierarchical models
and missing data problems.
In case the posterior distribution in Bayesian analysis does not have a closed
form, one can apply the Markov Chain Monte Carlo (MCMC) simulation methods for
any sample size and obtain accurate estimates of parameters.
The MCMC procedure uses the Markov Chain Monte Carlo technique to
estimate the model parameters and to produce correct standard errors and confidence
limits. Markov chain Monte Carlo is a general computing algorithm that has been
widely used in many scientific disciplines, including statistics. The posterior
distribution often involves a high-dimensional integration. The function of Monte
Carlos two-part algorithm is to simulate the prior distribution to estimate the posterior
distribution of parameters.
The simulations are done many times and form a Markov Chain. When the
chain is stabilized, the chains of estimates are used to produce final results. When
MCMC is used for optimization and Bayesian inference, the objective is to compute
the global optimum of some Bayesian posterior probability by drawing representative
samples from the posterior probability distribution. In practice, a Bayesian posterior
distribution does not take any well-known form and may have multiple local
optimums. Thus, simulating the posterior will yield a plausible understanding of the
distribution. Very often the MCMC needs to use a very general purpose optimization
technique such as the simulated annealing procedure, which will return the global
optimum.
Methodology
Exploratory Data Analysis
Dataset included Obama approval ratings and demographic information of over
37,000 respondents from all fifty states of America. The demographic information
included the respondents age, education, sex, race, region, and state of residence. Each
parameter of demographic information had multiple levels. Race consisted of five

levels: black only, white only, Hispanics, DK (Dont Know), and others. Education
consisted of no college, some college, college or more. Gender was male and female.
There were four regions: midwest, northeast, south, and west. Since states were many
and data was sparse, we decided to include region into our analysis and forgo states.
Similar to demographic parameters, the response variable (i.e., Obama approval)
consisted of seven levels: lean approve, lean disapprove, mixed feelings, somewhat
approve, somewhat disapprove, strongly approve, and strongly disapprove.
Since our input parameters (i.e., demographic variables and response variable)
were multi-leveled, we decided to reduce the levels of some variables for simplicity.
We kept different levels of race in our analysis but reduced levels of education to two
categories by combining some college and college or more into one category (i.e.,
college).
Similarly, we formatted the response variable. We combined lean approve,
somewhat approve, strongly approve into approve category and lean disapprove,
somewhat disapprove, and strongly disapprove into disapprove category. We removed
data related to mixed feelings since we found that of no relevance.
Despite consolidating some levels into fewer categories, a significant number of
levels remained. From our data, it was clear that we have to find how these variables
interact at different levels. In order to accomplish this, we used PROC FREQ and
PROC TABULATE procedures of SAS to tabulate data in different tabular format.
We tried to explore interactions between education and region, education and
sex, and region and sex. We did not find any significant difference between any levels
in these interactions. However, as a result of analyzing the three tables mentioned in

Final Results section, we found that race emerged as a dominant factor and among race
black only and white only was more prominent. The interactions of black only with
south region and white only with no college education appeared significant.
To verify the above findings, we used CHAID and CART in SAS. Factors
selection in the Final Results section discusses how significant factors were chosen.
Building Classification Model

Bayesian Logistic Regression Model
Our target variable, Obama approval, has two levels, namely Approve and
Disapprove. It can be represented in binary format. We therefore used a logistics
regression model. As discussed earlier, Bayesian analysis needs to be incorporated in
our model. Therefore, a Bayesian logistic regression model is the best option to use in
this scenario.
From Factors selection of Final Results section, we chose four factors to build our
model, and those are: Black, White, No College among education, and South among
regions. First, we prepared our data for a logistic regression random-effects model. We
dummy coded the categorical variables. For example, we chose only two races (Black
and White) during factor selection. Therefore, in the Blacks column of the table (i.e.
Table 1), we have 1 for Black and 0 for all other race. Similarly for Whites column,
we have 1 for White and 0 for all other races. For education, we assigned No College
as 1 and 0 for College. Similarly, we assigned South as 1 and 0 for all other Regions.
10
Finally, the Approval value is 1 when Obama is approved, and 0 otherwise, i.e.,
Disapproval. We tabulated the 24 (=2 * 3 * 2 * 2) possible values of the target and the
above factors using programming code in appendix. Then for modeling purposes, when
Approval is 1, we calculated total counts for all categories discussed above, for which
we generated this data:
Table 1: Aggregated Data for Model
Group
Black
White Non-College Region South
Approval
Total
0 0
946
2164
0 1
405
1100
1 0
280
737
1 1
124
362
0 0
5099
14697
0 1
2106
8337
1 0
1308
4883
1 1
517
2888
0 0
403
544
10
0 1
920
1177
11
1 0
124
190
12
1 1
297
404
From the above table, we deduce that when races are neither Black nor White,
education level is other than no college, and region is other than south, total
respondents who approved of Obamas presidency are 946 out of 2164. There are 12
groups.
PROC MCMC with Bayesian Logistic Regression Model:
11
We built the model on the above data using factors Black, White, education, and
region as fixed effects and interaction of Black with region and White with education
as random. Then we fit a Bayesian logistic model in PROC MCMC.
We modeled each respondents response as Bernoulli trial with probability of
success pi. This means the approval in each cell follows Binomial distribution with
parameters pi and ni, where ni is known but pi is to be estimated. We used the logit link
function to link the covariates of each observation, edu (for no college education) and
region (for south region), to the probability of success:
i = beta0 + beta1*black + beta2*white + beta3*edu + beta4*region +
beta5*black*region + beta6*white*edu.
The probability of approval is:
pi=
i +i
e
+
1+ e
i
where i is assumed to be an
identically independently distributed random effect with the default normal prior with
mean zero and constant variance. The six regression coefficients and the variance
2
in the random effects are model parameters. The betas are given non-
informative priors: each beta assumes a uniform distribution. We used non-informative

priors because we do not have historical information about the betas distribution. This
is also known as the let the data speak for itself approach (Gelman et al., 2004, 51).
Ipsos may consider other priors since they have results from past surveys. The
variance parameter takes another default prior--a Gamma distribution with shape =
0.01 and scale = 0.01.
e
pi=
1+ e
i
The same model can be expressed as a random effects model:
which i distributes as a normal distribution with mean: beta0 + beta1*black +

12
in
beta2*white + beta3*edu + beta4*region + beta5*black*region + beta6*white*edu and

variance 2 .
Both of the above models are equivalent. In the first model, the random effects
i centers at 0 in the normal distribution, and in the other model,
i centers at the
regression mean. This hierarchical centering improves mixing.

Based on above logic, the PROC MCMC program statements are:
proc mcmc data=last outpost=postout seed=332786 nmc=30000 thin=10;
parms beta0 0 beta1 0 beta2 0 beta3 0 beta4 0 beta5 0 beta6 0 s2 1;
prior s2 ~ igamma(0.01, s=0.01);
prior beta: ~ general(0);
w = beta0 + beta1*black + beta2*white + beta3*edu + beta4*region
+ beta5*black*region + beta6*white*edu;
random delta ~ normal(w, var=s2) subject=groups;
pi = logistic(delta);
model count ~ binomial(n = n, p = pi);
run;
The PROC MCMC statement specifies the input/output datasets, sets a seed for the
random number generator, requests a very large simulation number of 30000, and thins
the Markov chain by 10. The PARMS statement declares the model parameters. This is
nothing but regression coefficients. The PRIOR statements specify the prior
distribution for beta and s2.
The symbol w calculates the regression mean, and RANDOM statement
specifies the random-effects, with a normal prior distribution, centered at w with
variance s2. The SUBJECT option indicates the group index for the random-effects.
The symbol pi is the logit transformation. The MODEL specifies the response
variable count as a binomial distribution with parameters n and pi.
Final Results
13
Exploratory analysis Results:

When we explored the interaction between race and region and presented that in
tabular format, we found that Black only, as a race, heavily approved of Obamas
presidency, followed by Others and Hispanics. White only and DK (Dont Know)
mostly disapproved. However, the percentage of approval and disapproval varies by
region, as illustrated in Table 2.
Table 2: Interaction of Race and Region

ObamaApproval
Race
Region
Count
%Approve
%Disapprove
338
74.85
25.15
215
73.95
26.05
1,581
76.98
23.02
181
63.54
36.46
Midwest
73
27.40
72.60
Northeast
77
31.17
68.83
South
119
34.45
65.55
West
100
24.00
76.00
Midwest
225
45.78
54.22
Northeast
254
48.43
51.57
South
550
36.55
63.45
West
638
42.01
57.99
Midwest
400
45.25
54.75
Northeast
308
42.86
57.14
South
793
36.19
63.81
West
826
42.49
57.51
BlackOnly Midwest
Northeast
South
West
Dont
Know
Hispanic
Others
14
ObamaApproval
Race
Region
White
Only
Count
%Approve
%Disapprove
Midwest
7,618
32.23
67.77
Northeast
5,270
36.07
63.93
South
11,225
23.37
76.63
West
6,692
30.65
69.35
37,483
33.43
66.57
All
From the above table, it appears that Black only respondents from the South
approved of Obamas presidency to a greater extent than any other race from any
region. White only respondents and Dont Know (coded as DK in the dataset)
disapproved of Obamas presidency the most across all regions.
Similarly, when we explored the interaction between race and sex in Table 3, we
found that in case of race DK there is a huge gap between male and female in favor of
approval. For all other races, there is more or less similarity across sex.
Table 3: Interaction of Race and Sex

ObamaApproval
Race
Sex
BlackOnly Female
%Approve
Count
%Disapprove
1,562
75.61
24.39
Male
753
74.77
25.23
Dont
Know
Female
227
35.24
64.76
Male
142
20.42
79.58
Hispanic
Female
1,045
42.78
57.22
622
39.87
60.13
1,403
41.84
58.16
924
39.39
60.61
Male
Others
Female
Male
15
ObamaApproval
Race
Sex
WhiteOnly Female
Male
All
%Approve
Count
%Disapprove
19,955
29.30
70.70
10,850
29.35
70.65
37,483
33.43
66.57
When we explored interactions between race and education in Table 4, we found

an almost a similar gap between college and no college among respondents of all race
for approval of Obamas presidency. College-educated respondents approved of
Obamas presidency in far greater numbers than non-college-educated respondents.
Black only, as a whole, approved of Obamas presidency with college or no college.
White only with no college disapproved of Obamas presidency significantly.
Table 4: Interaction of Race and Education

ObamaApproval
Race
Education
BlackOnly College
Count
%Approve
%Disapprove
1,721
76.87
23.13
Nocollege
594
70.88
29.12
Dont
Know
College
295
31.53
68.47
74
21.62
78.38
Hispanic
College
1,153
42.15
57.85
514
40.66
59.34
1,816
42.51
57.49
511
35.03
64.97
Nocollege
Nocollege
Others
College
Nocollege
16
ObamaApproval
Race
Education
White
Only
College
All
Nocollege
Count
%Approve
%Disapprove
23,034
31.28
68.72
7,771
23.48
76.52
37,483
33.43
66.57
It therefore appears that there are some interactions between black with south region
and white with no college education. We needed to verify above findings and for that
we drew a decision tree.
The screenshots below, in diagram 1, show the results from the SAS decision
tree splits for the demographics variables. As expected, race is the major predictor in
Obama approval ratings compared to all other factors. Of the 1,900 survey
participants, over 75% of blacks approved of Obamas presidency, regardless of
geographic region. On the other hand, the other races as a whole had an approval
rating of only 30%. Therefore Black only, as a race, stands out in this diagram.
Diagram 1: Black Only as a factor
17
Diagram 2: South as a factor
From the diagram 2 above, South as a region stands out among all regions. The
south had the lowest approval (24.75%) of Obama. After black as a race in Diagram 1,
we can see that white only stands out from other race in Diagram 2. White only has
18
the maximum disapproval (67.28%) of Obama.
Table 5: Important Factors

Importance
1.00
0.39
0.06
0.05
Variable
Race
Region
Education
Sex
Table 5 above shows the Variable Importance metric, which is a relative metric
with a value of one for the most important variable. Less important variables have
metrics less than one. We see that the most important variable is race, then region,
education and sex, in descending order.
Factors Selection
From the above tables and diagrams, it appears that among race black
respondents stand out clearly for their maximum approval of Obamas presidency. Out
of the remaining four race categories, white respondents differentiate themselves for
low approval of Obamas presidency. Among regions, south stands out clearly.
Although other than these three (i.e., black, white, and region south), no other factors
appear significant in the decision tree, the white with no college education had the
lowest approval of Obamas presidency, as shown in table 4. We decided not to
consider the races DK, i.e., dont know. Therefore, we choose black and white among
19
race category, south among regions, and no college among education. Interaction of
black in south region, and white with no college needs to be studied. We decided to
keep a minimum number of factors and their interactions to keep our model simple
and easy to interpret.
Building Classification Model Results
Table 6: Output Parameters of Model

Parameters
Sampling
Block Parameter Method
Initial
Value PriorDistribution
1 s2
Conjugate
2 beta0
NMetropolis
1.0000 igamma(0.01,s=0.01)
0 general(0)
beta1
0 general(0)
beta2
0 general(0)
beta3
0 general(0)
beta4
0 general(0)
beta5
0 general(0)
beta6
0 general(0)
The Parameters table lists the sampling information, the name of the model
parameters, sampling algorithms used, initial values, and their prior distributions. The
conjugate sampling algorithm is used to draw the posterior samples of s2 and random
walk Metropolis for the regression parameters.
20
Table 7: Random Effects Parameters

Random Effect Parameters
Sampling
Parameter Method
Delta
Subject
N-Metropolis
Number of Subject
Subjects Values
groups
Prior
Distribution
12 1 2 3 4 5 6 7 8 9 10 11 12 normal(w, var=s2)
The Random Effect Parameters in the table above list the name of the random
effect, the subject variable, and the number of distinct levels in the subject variable.
The total number of random-effects parameters in this model is 12.
Table 8: Posterior Summaries of Parameter Estimates
Paramete
r
Label
Mean
Standard
Deviation
95% HPD Interval
beta0
Intercept
3000
-0.1893
0.1123
-0.4144
0.0324
beta1
Race = Black Only
3000
1.1767
0.1631
0.8390
1.4947
beta2
Race = White
3000
-0.4817
0.1389
-0.7502
-0.2074
beta3
Education = No College
3000
-0.2496
0.1223
-0.4867
-0.00440
beta4
Region = South
3000
-0.3758
0.1097
-0.5792
-0.1357
beta5
Race = Black Only *

Region = South
3000
0.6631
0.2108
0.2834
1.1427
beta6
Race = White *
Education = No College
3000
-0.1484
0.1900
-0.4896
0.2541
s2
Variance
3000
0.0213
0.0245
0.00143
0.0597
The Posterior Summaries table reports the posterior mean, standard deviation,
and confidence intervals of p. Here we have Mean as the parameter estimate to our
logistic model. The standard deviations are interpreted as standard errors for the
parameter estimates. The 95% HPD Interval stands for Highest Posterior Density
Interval.
21
The mean regression coefficient estimate of 1.1767, with a standard deviation of

0.1631 for beta1 (i.e. Race = Black) is interpreted as follows: when the Race = Black,
the odds for approval for Obama is exp(1.1767) = 3.24 times the odds when Race is
otherwise, provided all other predictors remain the same.
We can conclude that there is a high approval for Obama across all black
population. Similarly, a significantly high value for beta5 (interaction between race =
black and region = south), mean regression coefficient estimate of 0.6631 with a
standard deviation of 0.2108, shows that blacks in south regions approved of Obamas
presidency to a greater extent than blacks in other regions. On the contrast, the beta2
value of -0.4817 suggests that the odds of Obama approval shrinks by a multiplicative
factor of exp(-0.4817) = 0.62 when Race = White versus other races provided all other
predictors remain the same. For White respondents without a college degree, the
beta6 value of -0.1484 suggests the odds of Obama approval will further shrinks by a
multiplicative factor of exp(-0.1484) = 0.86 versus otherwise provided all other
predictors remain the same.
The 95% HPD confidence interval is interpreted like typical confidence interval.
If the interval does not contain the value zero, then the corresponding parameter is
significant at 5% level. Therefore, the 95% HPD confidence intervals for beta0 and
beta6 suggest that these two parameters are not significant at 5% level. All other
parameters (beta1, beta2, beta3, beta4, beta5, and s2) are all significantly different
from zero at 5% level.
The diagnostics plot indicates good mixing for the main-effects parameter beta1
(i.e. Black). Autocorrelation curve dies down very smoothly in the diagram 5. It shows
22
that the beta1 parameter converged. And Posterior Density has one maximum
likelihood estimate. In addition, Posterior Density has bell-shaped posterior
distribution, which supports applying inferences for normal distribution. Similarly,
diagnostics plots for other parameters provide evidence that these parameters have
converged too.
Diagram 3: Diagnostics Plot for Beta1
Summary and Conclusions

Overall, we achieved our research goal and were able to generate a program in
SAS for Bayesian logistic regression. We were able to develop PROC MCMC with
Bayesian analysis and dynamic prior values. We were able to generate the coefficients
using PROC MCMC that can be assigned a weight for post-stratification. Our primary
means of accomplishing this involved identifying right PROC in SAS and developing
23
the macros to implement Gelmans multi-level regression process that can be further
used for post-stratification.
Recommendations
Further Model Development
Though we were able to create a working model, we feel strongly that other
factors that influence respondents opinions of the Obama presidency should be
explored in further research. Survey participants should have been asked the reasons for
their approval/disapproval, which might be related to foreign policy, handling of the
economy, party identification, and/or social issues.
Other predictors and interactions can be further explored to see if those have an
effect on approval of Obamas presidency. Different priors can be passed to the model
and we need to see if results are affected by choice of priors.
Our vision is that an application could be developed leveraging the methods
applied in our work that would be useful for multilevel regression with poststratification process using SAS. Although MRP has been implemented in R, we did
not find any example in SAS. Therefore, this method \ is highly effective for
forecasting elections or marketing applications at a time when getting survey data is
difficult and corporations and government have scarce survey data.
24
Appendix
Programming code
This set of code has been used for coming up for data lines used on MCMC PROC.
data cstone.Ipsos_August;
set cstone.Ipsos_August;
if race = "Black Only" then black=1; else black=0;
if race = "White Only" then white=1; else white=0;
if education2 = "No college" then nedu=1; else nedu=0;
if region= "South" then nregion=1; else nregion=0;
if SEX= "Male" then nsex=1; else nsex=0;
if bo_apr2="Approve" then y1=1;else y1=0;
if bo_apr2="Disapprove" then y2=1;else y2=0;
run;
proc freq noprint;
tables y1*black*white*nsex*nedu*nregion / nocum nopercent nocum nocol
out=d;
run;
proc print data=d; run;
data temp1 (drop=percent);
set d;
if y1=1;
run;
data temp2 (keep=count2);
set d;
count2=count;
if y1=0;
run;
data last (drop = y1 count2);
merge temp2 temp1;
n=count+count2;
ind=_N_;
run;
proc print; run;
Findings for other factors

When we analyzed the diagnostic plots of beta2, beta3, and beta4, we found out
there is trend similar to the one above except that Posterior density curve falls in
negative region. This matches with the result from Posterior summaries and interval
25
table. There we have mean regression coefficient estimates as negative. Diagram for
beta2 represents white respondents. Diagram for beta3, and beta4 represent education
and region.
26
27
Bibliography
Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford.
Carlin, B. P. and Louis, T. A. 2009. Bayesian Methods for Data Analysis. Third Edition.
CRC/Chapman and Hall.
Gelman, A. Carlin, J. B., Stern, H. S., Dunson, D. B. 2013. Bayesian Data Analysis.
Third Edition. CRC/Chapman and Hall.
Gelman, A. and Hill, J. 2007. Data Analysis Using Regression and
Multilevel/Hierarchical Models. Cambridge.
Ghitza, Y. and Gelman, A. 2013. Deep Interactions with MRP: Election Turnout and
Voting Patterns Among Small Electoral Subgroups American Journal of Political
Science.
Hastie, T., Tibshirani, R., Friedman, J. 2009. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer.
Jackman, S. 2005. Pooling the Polls Over an Election Campaign Australian Journal
of Political Science 40: 499-517.
Linzer, D. A. 2013. Dynamic Bayesian Forecasting of Presidential Elections in the
States Journal of the American Statistical Association 108: 124-134.
Ntzoufras, I. 2009. Bayesian Modeling Using WinBUGS. Wiley.
Park, D. K., Gelman, A., and Bafumi, J. 2004. Bayesian Multilevel Estimation with
Poststratification: State Level Estimates from National Polls Political Analysis. 12:
375-385.
Wang, W., Rothschild, D., Goel, S., and Gelman, A. 2014. Forecasting Elections with
Non-Representative Polls International Journal of Forecasting Forthcoming.
28

Bayesian Analysis Using MCMC On Survey Data

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bayesian Analysis Using MCMC On Survey Data

Hochgeladen von

Copyright:

Verfügbare Formate

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

Lancelot Muwayi and Sanoj Kumar

Supervisor: Dr. Ming-Long Lam

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

Submitted to the University of Chicago

Graham School of Continuing Liberal and Professional Studies

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

APPROVED BY SUPERVISING COMMITTEE:

Dr. Ming-Long Lam

Dr. Sema Barlas

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

Table 1: Aggregated Data for Model...............................................................................11

Diagram 1: Black Only as a factor..................................................................................17

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

in which degree of belief in proposition A given evidence B is equal to the joint

Pr(A) is the prior degree of belief in A

It can also be written as

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

in which means not A. If A is parameter and B as data y, then we have

{ y 1 .. y n } by using a statistical model described by density p(y|

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

parameter of demographic information had multiple levels. Race consisted of five

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

in these interactions. However, as a result of analyzing the three tables mentioned in

Building Classification Model

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

White Non-College Region South

PROC MCMC with Bayesian Logistic Regression Model:

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

informative priors: each beta assumes a uniform distribution. We used non-informative

The same model can be expressed as a random effects model:

which i distributes as a normal distribution with mean: beta0 + beta1*black +

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

beta2*white + beta3*edu + beta4*region + beta5*black*region + beta6*white*edu and

regression mean. This hierarchical centering improves mixing.

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

Exploratory analysis Results:

Table 2: Interaction of Race and Region

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

Table 3: Interaction of Race and Sex

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

When we explored interactions between race and education in Table 4, we found

Table 4: Interaction of Race and Education

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

Diagram 1: Black Only as a factor

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

Diagram 2: South as a factor

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

the maximum disapproval (67.28%) of Obama.

Table 5: Important Factors

IMPROVING FORECASTING OF POLITICAL POLLING OUTCOMES

Building Classification Model Results

Table 6: Output Parameters of Model

beta2white + beta3edu + beta4region + beta5blackregion + beta6white*edu and