Sie sind auf Seite 1von 4

Homework 4: Logistic Regression & Overdispersion

Dr. Timothy R. Johnson Spring Semester, 2014


This homework assignment is due no later than 3:00 on Friday, April 18th. Please read the instructions below carefully.

Homework Instructions
This homework assignment is due no later than 3:00 on Friday, April 18th. Late assignments will only be accepted in extreme circumstances and only if arrangements have been made in advance. Your solutions must be typed and very neatly organized. I will not try to infer your solutions if it they are not clearly presented. Equations need not be typeset perfectly but they should be clear. You may substitute letters for symbols (e.g., b1 for 1 ), and you may write-out equations (neatly) by hand if necessary. Include with your solutions you must include the relevant R output and the R scripts that created them. Include these within the text of your solutions using cut-and-paste. Try to include only the relevant output. Use a monospace font (e.g., Consolas or Courier) for R scripts and output for clarity, but only for R scripts and output. It is permitted for you to discuss the homework with other students in the course. However you must still write your own R scripts, produce your own output, and write up your own solutions. You are welcome to ask me questions concerning the homework. I will be particularly open to helping with any R problems. I want to evaluate your understanding of applied regression, not R, but part of the purpose of the homework assignments is to get you to exercise using R. If you email me with a R question, it may be helpful for you to include with your email your full R script so that I can replicate your problem. The Statistics Assistance Center (SAC) and Statistical Consulting Center (SCC) are not designed to accommodate this course. Direct all questions to me.

homework 4 : logistic regression & overdispersion

Carriers of Streptococcus pyogenes and Tonsil Size


An observational study classied 1,398 children on two variables: tonsil size (normal, large, and very large), and whether or not they were carriers of Streptococcus pyogenes.1 The data are summarized in Table 1 and can be found in the data frame Tonsils in the BSDA package. Below you will analyze these data using logistic regression. 1. Report the summary of a logistic regression model for the expected proportion of carriers using tonsil size as an explanatory variable. 2. In an introductory statistics class students often learn how to analyze two-way tables of counts like Table 1 to determine if there is evidence for a statistically signicant relationship.2 This can be done in R as follows.
> library(vcd) # for the assocstats function > assocstats(as.matrix(Tonsils[,2:3])) X^2 df P(> X^2) Likelihood Ratio 7.3209 Pearson 7.8848 2 0.025721 2 0.019401
Usually the Pearson test statistic is discussed. The likelihood ratio test is derived using a different approach. Recall that we discussed two kinds of tests: likelihood ratio and Wald. The Pearson test is based on a third approach known sometimes as a score or Lagrange multiplier test.
2

Streptococcus pyogenes is a bacterium that is the cause of infections resulting in Streptococcal pharyngitis (i.e., strep throat).

Table 1: Number of children classied by tonsil size and carrier status. Carrier Size Normal Large Very Large Yes 19 29 24 No 497 560 269

The likelihood ratio test is identical to that obtained using logistic regression if a likelihood ratio test is done for the effect of tonsil size. Use the model from the previous problem and an appropriate null model to replicate the likelihood ratio test statistic and pvalue above using the anova function. 3. The contrast function can be used to estimate the log-odds or logit of being a carrier for each tonsil size. Exponentiating these values gives the estimated odds of being a carrier, and applying the function ez /(1 + ez ) to the log-odds, or the function z/(1 + z) to the odds, will give the estimated probability of being a carrier.3 Give the estimated log-odds, odds, and probability of being a carrier for each of the three tonsil sizes, and include a condence interval for each quantity.4 Be sure to use contrastfix with wald = TRUE to get the proper condence interval. 4. Use contrast and contrastfix to obtain estimates and condence intervals for the odds ratios for comparing the odds of being a carrier between children with normal and large tonsils, normal and very large tonsils, and larger and very large tonsils. Briey summarize each comparison in a sentence (e.g., The odds of being a carrier is x times larger for children with very large tonsils than for children with normal-sized tonsils.). Also report a condence interval for each odds ratio, and a test statistic and p-value for the null hypothesis that the odds ratio is 1.

R knows the function ez /(1 + ez ) as

plogis.

Remember that when transforming a parameter estimate, to obtain the condence interval for the transformed parameter estimate you can apply the same transformation to the end points of the condence interval.
4

homework 4 : logistic regression & overdispersion

Moth Coloration and Natural Selection


Bishop (1972) conducted a eld experiment to investigate the role of moth coloration on predation.5 Dead preserved moths were placed on random trees at seven locations of varying distance from Liverpool, England. At each location two different color morphs were placed: light and dark. The goal of the study was to examine the susceptibility to predation as a function of morph and distance to Liverpool. Liverpool is an industrial city that produces signicant air pollution that darkens trees. It was suspected that the risk of predation would be related to distance to Liverpool but also color since darker moths would be less obvious than light moths near Liverpool, but visa versa farther from Liverpool where the trees are less darkened. A R script that can be downloaded from the following link plots the observed and predicted proportions for a normal/linear model for the proportion of removed months with distance and morph as explanatory variables.
https://dl.dropboxusercontent.com/u/10884844/Homework/moths.R
Bishop, J. A. (1972). An experimental study of the cline of industrial melanism of Biston betularia [Lepidoptera] between urban Liverpool and Rural North Wales. Journal of Animal Ecology, 41, 209-243.
5

Modify this script as necessary to answer the following questions. 1. Estimate a logistic regression model using the same linear predictor as the normal/linear model in the script. Report the parameter estimates and standard errors from summary, and plot the predicted values with the observed proportions. 2. Use contrast and contrastfix to estimate the two odds ratios to describe the effect of distance on the odds of removal for each morph. Provide condence intervals and use these to briey describe the relationship between distance and odds of removal for each morph. 3. Use contrast and contrastfix to estimate the odds ratio to describe the effect of morph at a distance of 0 km, and also the odds ratio to describe the effect of morph at a distance of 50 km. Provide condence intervals and use these to briey describe the relationship the effect of morph on the odds of removal at the two distances. 4. Is there any evidence of overdispersion? Why or why not?

homework 4 : logistic regression & overdispersion

Snow Geese Overdispersion


Another potential model for the snow geese data is a Poisson regression model but with an identity link function such that E(Yi ) = i , where i = 0 + 1 xi1 + + p xip . Identity link functions do not usually work very effectively for counts, but the relationship between the observer and photo counts does appear to be nearly linear, which maybe makes sense in the context of the study. This model is used in a R script that can be downloaded at the following link.
https://dl.dropboxusercontent.com/u/10884844/Homework/snowgeese-overdispersion.R

The script also plots the data with the predicted values as well as the studentized residuals against the predicted values. The residuals show clear evidence of signicant overdispersion suggesting that the variance structure implied by the Poisson distribution is not sufcient. Below you will investigate a couple of other variance structures. 1. One approach to dealing with overdispersion is to use quasilikelihood which assumes the variance structure Var(Yi ) = E(Yi ) where > 0 is an unknown parameter rather than being xed at = 1 in the case of the Poisson distribution.6 Report the parameter estimates and standard errors based on a quasi-likelihood Poisson model.7 Discuss briey how these compare to those from the original Poisson model. Also provide a plot of the studentized residuals against the predicted values and discuss briey if the quasi-likelihood approach has sufciently dealt with the overdispersion. 2. A negative binomial distribution implies the variance structure Var(Yi ) = E(Yi ) + E(Yi )2 where 0. This variance structure assumes that the variance increases quadratically rather than linearly with the expected value. Repeat what you did in the previous problem but for a negative binomial model.8

The term quasi-likelihood comes from the fact that by not xing = 1, estimation is no longer based on a known distribution with a particular likelihood function, but in many ways inferences are analogous to those based on a likelihood function. 7 Recall that this is done by specifying family = quasipoisson, but do not forget to also include the identity link function.
6

There is no family argument for the glm.nb function, so to specify a link function you can simply use the optional argument link = identity.

Das könnte Ihnen auch gefallen