Sie sind auf Seite 1von 11

Logistic regression

Logistic regression is used to obtain odds ratio in the presence of more than one explanatory
variable. The procedure is quite similar to multiple linear regression, with the exception that the
response variable is binomial. The result is the impact of each variable on the odds ratio of the
observed event of interest. The main advantage is to avoid confounding effects by analyzing the
association of all variables together. In this article, we explain the logistic regression procedure
using examples to make it as simple as possible. After definition of the technique, the basic
interpretation of the results is highlighted and then some special issues are discussed.
Introduction
In statistics, logistic regression, or logit regression, or logit model is a regression model where
the dependent variable (DV) is categorical. This article covers the case of binary dependent
variablesthat is, where it can take only two values, such as pass/fail, win/lose, alive/dead or
healthy/sick. Cases with more than two categories are referred to as multinomial logistic
regression, or, if the multiple categories are ordered, as ordinal logistic regression.
Logistic regression measures the relationship between the categorical dependent variable and one
or more independent variables by estimating probabilities using a logistic function, which is the
cumulative logistic distribution. Thus, it treats the same set of problems as probit
regression using similar techniques, with the latter using a cumulative normal distribution curve
instead. Equivalently, in the latent variable interpretations of these two methods, the first
assumes a standard logistic distribution of errors and the second a standard normal distribution of
errors.
Fields and example applications
Logistic regression is used in various fields, including machine learning, most medical fields,
and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is
widely used to predict mortality in injured patients, was originally developed by Boyd et al.
using logistic regression. Many other medical scales used to assess severity of a patient have
been developed using logistic regression. Logistic regression may be used to predict whether a
patient has a given disease (e.g. diabetes; coronary heart disease), based on observed
characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).
Another example might be to predict whether an American voter will vote Democratic or
Republican, based on age, income, sex, race, state of residence, votes in previous elections, etc.
The technique can also be used in engineering, especially for predicting the probability of failure
of a given process, system or product. It is also used in marketing applications such as prediction
of a customer's propensity to purchase a product or halt a subscription, etc. In economics it can
be used to predict the likelihood of a person's choosing to be in the labor force, and a business
application would be to predict the likelihood of a homeowner defaulting on
a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are
used in natural language processing.
Example: Probability of passing an exam versus hours of study
The reason for using Logistic Regression for this problem is that the dependent variable pass/fail
represented by "1" and "0" are not cardinal numbers. If the problem were changed so that
pass/fail was replaced with the grade 0100 (cardinal numbers), then simple regression
analysis could be used.

A group of 20 students spend between 0 and 6 hours studying for an exam. How does the
number of hours spent studying affect the probability that the student will pass the exam?
The table shows the number of hours each student spent studying, and whether they passed (1) or
failed (0).
Ho
urs

0. 0. 1. 1. 1. 1. 1. 2. 2. 2. 2. 3. 3. 3. 4. 4. 4. 4. 5. 5.
50 75 00 25 50 75 75 00 25 50 75 00 25 50 00 25 50 75 00 50

Pas 0
s

The graph shows the probability of passing the exam versus the number of hours studying, with
the logistic regression curve fitted to the data.
Graph of a logistic regression curve showing probability of passing an exam versus hours
studying
The logistic regression analysis gives the following output.

Coefficient Std.Error

z-value P-value (Wald)

Intercept 4.0777

1.7610

2.316

0.0206

Hours

0.6287

2.393

0.0167

1.5046

The output indicates that hours studying is significantly associated with the probability of
passing the exam (p=0.0167, Wald test). The output also provides the coefficients for Intercept =
-4.0777 and Hours = 1.5046. These coefficients are entered in the logistic regression equation to
estimate the probability of passing the exam:
Probability of passing exam =1/(1+exp(-(-4.0777+1.5046* Hours)))
For example, for a student who studies 2 hours, entering the value Hours = 2 in the equation
gives the estimated probability of passing the exam of p = 0.26:
Probability of passing exam =1/(1 + exp((4.0777 + 1.50462))) = 0.26.
Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is
p=0.87:
Probability of passing exam =1/(1 + exp((4.0777 + 1.50464))) = 0.87.

This table shows the probability of passing the exam for several values of hours
studying.
Hours of study

Probability of passing exam

0.07

0.26

0.61

0.87

0.97

The output from the logistic regression analysis gives a p-value of p = 0.0167, which is based on
the Wald z-score. Rather than the Wald method, the recommended method to calculate the pvalue for logistic regression is the Likelihood Ratio Test (LRT), which for this data gives
p=0.0006.
Logistic regression can be binomial, ordinal or multinomial. Binomial or binary logistic
regression deals with situations in which the observed outcome for a dependent variable can have
only two possible types (for example, "dead" vs. "alive" or "win" vs. "loss"). Multinomial
logistic regression deals with situations where the outcome can have three or more possible types
(e.g., "disease A" vs. "disease B" vs. "disease C") that are not ordered. Ordinal logistic
regression deals with dependent variables that are ordered. In binary logistic regression, the
outcome is usually coded as "0" or "1", as this leads to the most straightforward interpretation. If
a particular observed outcome for the dependent variable is the noteworthy possible outcome
(referred to as a "success" or a "case") it is usually coded as "1" and the contrary outcome
(referred to as a "failure" or a "noncase") as "0". Logistic regression is used to predict the odds of
being a case based on the values of the independent variables (predictors). The odds are defined
as the probability that a particular outcome is a case divided by the probability that it is a
noncase.
Like other forms of regression analysis, logistic regression makes use of one or more predictor
variables that may be either continuous or categorical. Unlike ordinary linear regression,
however, logistic regression is used for predicting binary dependent variables (treating the
dependent variable as the outcome of a Bernoulli trial) rather than a continuous outcome. Given
this difference, the assumptions of linear regression are violated. In particular, the residuals
cannot be normally distributed. In addition, linear regression may make nonsensical predictions
for a binary dependent variable. What is needed is a way to convert a binary variable into a
continuous one that can take on any real value (negative or positive). To do that logistic
regression first takes the odds of the event happening for different levels of each independent
variable, then takes the ratio of those odds (which is continuous but cannot be negative) and then
takes the logarithm of that ratio. This is referred to as logit or log-odds) to create a continuous
criterion as a transformed version of the dependent variable.

Thus the logit transformation is referred to as the link function in logistic regressionalthough
the dependent variable in logistic regression is binomial, the logit is the continuous criterion
upon which linear regression is conducted.
The logit of success is then fitted to the predictors using linear regression analysis. The predicted
value of the logit is converted back into predicted odds via the inverse of the natural logarithm,
namely the exponential function. Thus, although the observed dependent variable in logistic
regression is a zero-or-one variable, the logistic regression estimates the odds, as a continuous
variable, that the dependent variable is a success (a case). In some applications the odds are all
that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent
variable is or is not a case; this categorical prediction can be based on the computed odds of a
success, with predicted odds above some chosen cutoff value being translated into a prediction of
a success.

Time series
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time
series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discretetime data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of
the Dow Jones Industrial Average.
Time series are very frequently plotted via line charts. Time series are used in statistics, signal
processing, pattern recognition, econometrics, mathematical finance, weather forecasting, intelligent transport
and trajectory forecasting, [1]earthquake prediction, electroencephalography, control
engineering, astronomy, communications engineering, and largely in any domain of
applied science and engineering which involves temporal measurements.
Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics
and other characteristics of the data. Time series forecasting is the use of a model to predict future values
based on previously observed values. While regression analysis is often employed in such a way as to test
theories that the current values of one or more independent time series affect the current value of another time
series, this type of analysis of time series is not called "time series analysis", which focuses on comparing
values of a single time series or multiple dependent time series at different points in time. [2]
Time series data have a natural temporal ordering. This makes time series analysis distinct from cross-sectional
studies, in which there is no natural ordering of the observations (e.g. explaining people's wages by reference
to their respective education levels, where the individuals' data could be entered in any order). Time series
analysis is also distinct from spatial data analysis where the observations typically relate to geographical
locations (e.g. accounting for house prices by the location as well as the intrinsic characteristics of the houses).
A stochastic model for a time series will generally reflect the fact that observations close together in time will
be more closely related than observations further apart. In addition, time series models will often make use of
the natural one-way ordering of time so that values for a given period will be expressed as deriving in some
way from past values, rather than from future values.
Time series analysis can be applied to real-valued, continuous data, discrete numeric data, or discrete symbolic
data

Methods for time series analysis


Methods for time series analysis may be divided into two classes: frequency-domain methods and timedomain methods. The former include spectral analysis and wavelet analysis; the latter include autocorrelation and cross-correlation analysis. In the time domain, correlation and analyses can be made in a filterlike manner using scaled correlation, thereby mitigating the need to operate in the frequency domain.

Additionally, time series analysis techniques may be divided into parametric and non-parametric methods.
The parametric approaches assume that the underlying stationary stochastic process has a certain structure
which can be described using a small number of parameters (for example, using an autoregressive or moving
average model). In these approaches, the task is to estimate the parameters of the model that describes the
stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or
the spectrum of the process without assuming that the process has any particular structure.
Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate.

Time Series and Panel Data


A time series is one type of Panel data. Panel data is the general class, a multidimensional data set, whereas a
time series data set is a one-dimensional panel (as is a cross-sectional dataset). A data set may exhibit
characteristics of both panel data and time series data. One way to tell is to ask what makes one data record
unique from the other records. If the answer is the time data field, then this is a time series data set candidate.
If determining a unique record requires a time data field and an additional identifier which is unrelated to time
(student ID, stock symbol, country code), then it is panel data candidate. If the differentiation lies on the nontime identifier, then the data set is a cross-sectional data set candidate.

Models[
Models for time series data can have many forms and represent different stochastic processes. When modeling
variations in the level of a process, three broad classes of practical importance are the autoregressive (AR)
models, the integrated (I) models, and the moving average (MA) models. These three classes depend linearly
on previous data points. [27] Combinations of these ideas produce autoregressive moving average (ARMA)
and autoregressive integrated moving average(ARIMA) models. The autoregressive fractionally integrated
moving average (ARFIMA) model generalizes the former three. Extensions of these classes to deal with
vector-valued data are available under the heading of multivariate time-series models and sometimes the
preceding acronyms are extended by including an initial "V" for "vector", as in VAR for vector autoregression.
An additional set of extensions of these models is available for use where the observed time-series is driven by
some "forcing" time-series (which may not have a causal effect on the observed series): the distinction from
the multivariate case is that the forcing series may be deterministic or under the experimenter's control. For
these models, the acronyms are extended with a final "X" for "exogenous".
Non-linear dependence of the level of a series on previous data points is of interest, partly because of the
possibility of producing a chaotic time series. However, more importantly, empirical investigations can
indicate the advantage of using predictions derived from non-linear models, over those from linear models, as
for example in nonlinear autoregressive exogenous models. Further references on nonlinear time series
analysis: (Kantz and Schreiber), [28] and (Abarbanel) [29]
Among other types of non-linear time series models, there are models to represent the changes of variance over
time (heteroskedasticity). These models represent autoregressive conditional heteroskedasticity (ARCH) and
the collection comprises a wide variety of representation (GARCH, TARCH, EGARCH, FIGARCH,
CGARCH, etc.). Here changes in variability are related to, or predicted by, recent past values of the observed
series. This is in contrast to other possible representations of locally varying variability, where the variability
might be modelled as being driven by a separate time-varying process, as in a doubly stochastic model.
In recent work on model-free analyses, wavelet transform based methods (for example locally stationary
wavelets and wavelet decomposed neural networks) have gained favor. Multiscale (often referred to as
multiresolution) techniques decompose a given time series, attempting to illustrate time dependence at multiple
scales. See also Markov switching multifractal (MSMF) techniques for modeling volatility evolution.
A Hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed
to be a Markov process with unobserved (hidden) states. An HMM can be considered as the simplest
dynamic Bayesian network. HMM models are widely used in speech recognition, for translating a time series
of spoken words into text.

Applications
Time series analysis has vast application and is of huge importance in the field of Business and
Economics as well as in decision making thereof. Calculating secular trend we can know
whether unit sales or revenues of a particular Business organization are increasing, decreasing or
remaining constant over time. Cyclical variation affords us to track the rise and fall of a time
series over a longer period of time. It helps us to realize the different stages the organization is
going through over time such as prosperity, depression, recession, recovery etc. Seasonal
variation helps us to clarify in which particular period of time within a year the Business is
making good sales or decreasing sales. Using last square method we can derive a trend equation
which dictates us at what particular amount the Business is increasing or decreasing its sales over
time. Calculation of Growth Rate makes us understand at what percentage the Business sales are
increasing or decreasing over time and Acceleration Rate lets us know whether the Growth Rate
will increase or decrease at an increasing or decreasing rate in the successive years.
The Classical Time Series Model, and different methods of measuring trend, application of the
Least Square Method and process to calculate growth rate and acceleration rate and their effect
on the business thereof.
Time Series: The most popular method of business forecasting is time series analysis. A time
series is a collection of observations of well-defined data items obtained through repeated
measurements over a period of time- weekly, monthly, quarterly, or yearly.
For example, measuring the value of retail sales each month of the year would comprise a time
series. This is because sales revenue is well defined, and consistently measured at equally spaced
intervals. Data collected irregularly or only once are not time series.
Uses of Time Series in Business Decision-Making: Various uses of time series in Business
Decision-Making are follows:1) To understand future behaviour: Time series analysis is helpful in predicting the future
behaviour. By observing data over a period of time, one can easily understand what changes
have taken place in the past & what will be in future.
2) Planning of future operations: Time series analysis is helpful in planning future operations.
The greatest potential of a time series lies in predicting an unknown value of the series. For
capital investment decisions, decisions regarding production and inventory etc are example of
planning future operations.
3) Evaluations of current activities: The actual performance can be compared with the expected
performance and the cause of variation analyzed. For example, if expected sales for 2007 were
15 lacs T-shirts and the actual sales were only 14 lacs; one can investigate the cause in the
shortfall in achievement.
4)It facilitates comparison: Various time series are often compared and important decisions are
taken. One statistician can not forecast 100 percent accuracy of future events. It is needed to
compare different time series in the same topic.
Components of a Time Series: The four basic types of variations which account for the changes
in the series over a period of time. The four types of patterns, variations or movements are called
components or elements of time series.
The four components are
1. Secular Trend
2. Seasonal Variation
3. Cyclical Variation
4. Irregular Variation

Classification of trends: Various types of trends are divided under two heads.
1) Linear or Straight Line Trends: The linear trend indicates that the increase or decrease of a
time series at constant amount. It is the simplest form in describing the trend movement.
2) Non Linear Trends: The non-linear trend indicates that a time series may have faster (or
slower) increase at early stage and have a slower (or faster increase at more recent times).
Classical Time Series Model: The model which states that all the components of a time series
are in a relationship of multiplicative in nature is known as Classical Time Series Model.
Mathematically the model can be written asY = T x S x (C x I)
Here,
Y = Value of the variable or result of the four elements
T = Trend
S = Seasonal Variation
C = Cyclical Variation
I = Irregular Variation
Methods of trend measurement: Several methods are used in measuring Trend. Some
measures the average of the trend and some measures the absolute value. Here we have listed
five main methods for measuring the Trend of the Time Series.
A Linear trend measurement
1. The freehand or Graphic method
2. The Semi-average method
3. The Least Squares method
B Non-linear trend measurement
1. The freehand or Graphic method
2. Moving average method
Time series analysis is of great significance in business decision making for the following
reasons:
1) It helps in the understanding of past behavior.
2) It helps in planning future operations
3) 3) It helps in evaluating current accomplishments.
1) 4) It facilitates comparison
Classical time series:Y T S C I
Here, Y=Trend,

S=Seasonal Variation,
C=Cyclical Variation,
I=Irregular Variation.
Another approach is to treat each observation of a time series as the sum of these four
components. Symbolically,
Y T S C I

Preliminary Adjustments before Analyzing Time Series


Before beginning the actual work of analyzing a time series it is necessary to make certain
adjustments in the raw data .The adjustments are:
1) Adjustments for Calendar Variation,
2) Adjustments for Population changes,
3) Adjustments for Price changes,
4) Adjustments for comparability
STRAIGHT LINE TREND-METHODS OF MEASUREMENT
The following methods are used for measuring trend:
1) The Freehand or Graphic Method,
2) The Semi-average Method,
3) The Method of Least Squares.
The following are the methods of measuring non-linear trends:
1) Freehand or Graphic Method,
2) Moving Average Method,
3) Second Degree Parabola.
CONVERSION OF ANNUAL TREND VALUES TO MONTHLY TREND VALUES
In converting straight line trends from an annual to a monthly basis, two situations must be
clearly distinguished. For series such as sales, production or earnings, the annual figure is the
total of monthly figures. Here it is necessary to divide both a and b by 12 to reduce them to
monthly level. The b values must then be divided 12 once again in order to convert from annual
to monthly increments.
Example: Convert the following annual trend equation for Tea production to a monthly trend
equation:
Y= 108+1.58X
(Origin 2003 , time unit I year, Y =tea production in million Kg.)
Solution: Monthly trend equation will be obtained by dividing a by 12 and b by 144.
thus the monthly trend equation will be:
Y=108/12+1.58/144
(Origin July 1,2003, time unit 1 month, Y monthly production in million Kg.)
SHIFTING THE TREND ORIGIN

In computing trends, the middle of the time series is often used as the origin in order to
cut short the computations. But very often we need to change the origin of the trend

equation to some other point in the series his is either to facilitate the comparison of trend
values among neighboring years or to convert a trend equation from any annual to a
monthly basis. For example , consider the trend equation:
Y= 110-2X
(Origin 2005, time unit 1year)
If we wish to shift the trend equation to 2008, We note that precedes the stated origin of
2005 by 7 time units. Consequently we must deduct 7 times annual increment that is b(-7), from
the trend value of 2005 as below:
Y= 110-2(-7)= 110 + 14 = 124
The value 124 becomes the trend value at the new origin 2008 and the trend equation may
now be written as:
Y = 124 2 X
Measurement of Seasonal Variations
When data are expressed annually, there is no seasonal variation. However, monthly or quarterly
data frequently exhibit strong seasonal movement and considerable interest attaches to devising a
pattern of average seasonal variation.
Example: If we observe the sales of a book seller, we find that of the quarter July September
sales are maximum. If we know by how much the sales of this quarter are usually above or
below the previous quarter for seasonal reasons, we shall be able to answer that there is upward
tendency or downward tendency?
Before attempting to measure seasonal variation certain preliminary decisions must be made.
To obtain a statistical description of a pattern of seasonal variation it will be desirable to
first free the data from the effect of trend, cycles and irregular variation. Once these other
components have been eliminated we can calculate, I index form, a measure of seasonal
variations which is usually referred to sseasonal index. Thus the measures of seasonal variation
are called seasonal indices(%).
There are four methods of measuring seasonal variations.
1) Methods of simple averages (weekly, monthly or quarterly).
Year

Jan.

Feb
.

Mar
.

Apr
.

May

Jun
e

July

Aug.

Sept.

Oct.

Nov.

Dec.

2004

318

281

278

250

231

216

223

2
4
5

269

302

325

347

2005

342

309

299

268

249

236

242

262

288

321

342

3
6
4

2006

367

328

320

287

269

251

259

284

309

345

3
6
7

394

2007

392

349

342

311

290

273

282

305

328

364

389

417

2008

420

378

370

334

314

296

305

330

356

396

422

452

2) Ratio to trend method.


3) Ratio to moving average method.
4) Link relatives method.
Example:-consumption of monthly electric power in million of Kw hours for street lighting in
one of the states in India during 2004-2008 is given below:
Find out seasonal variation by the method of monthly averages.
Solution: COMPUTATION OF SEASONAL INDICES BY THE METHOD OF MONTHLY
AVERAGES
Month

Consumption
power

of

monthly

electric

Monthly
total for 5
years

Five
yearly
Average

Percent-age

2004

2005

2006

2007

2008

Jan.

318

342

367

392

420

1839

367.8

116.1

Feb.

281

309

328

349

378

1645

329

103.9

March

278

299

320

342

370

1609

321.8

101.6

April

250

268

287

311

334

1450

290

91.6

May

231

249

269

290

314

1353

270.6

85.4

June

216

236

251

273

296

1272

254.4

80.3

July

323

242

259

282

305

1311

262.2

82.8

Aug.

245

262

284

305

330

1426

285.2

90.1

Sept.

269

288

309

328

356
550

310

97.9

Oct.

302

321

345

364

396

1728

345.6

109.1

Nov.

325

342

367

389

422

1845

369

116.5

Dec.

347

364

394

417

452

1974

394.8

124.7

Total
Average

19002
1.5835

3800.4

1200

316.7

100

The above calculations are explained below:


1) Column No 7 gives the total for each month for five years.
2) In column No 8 each total of column no 7 has been divided by five to obtain an average
to each month.
3) The average of monthly averages is obtained by dividing the total of monthly averages by
12.
4) In column No 9 each monthly average has been expressed as a percentage of the average
of monthly averages. Thus, the percentage for January

367.8
100 116.1
316.7

Percentage for February

329.0
100 103.9
316.7

If instead of monthly data we are given weekly or quarterly data, we shall compute weekly
or quarterly averages by following the same procedure as explained above.

Das könnte Ihnen auch gefallen