You are on page 1of 68

# Regression Analysis: An overview

## Professor Bimal Sinha

Department of Mathematics and Statistics
University of Maryland, Baltimore County (UMBC)
January 2018

## Bimal Sinha (UMBC) Regression Analysis December , 2017 1 / 68

Outline of Topics

## 1 What is Regression Analysis?

2 Why do we use Regression Analysis?
3 What are the types of Regressions?
4 Linear Regression
5 Logistic Regression
6 Polynomial Regression
7 Stepwise Regression
8 Ridge Regression
9 Lasso Regression
10 ElasticNet Regression
11 How to select the right Regression Model?

## Bimal Sinha (UMBC) Regression Analysis December , 2017 2 / 68

What is Regression Analysis?

## Regression analysis: a form of predictive modeling technique which

investigates the relationship between a dependent (target/response)
variable Y and a set of independent or predictor variables
X = (X1 , X2 , ..., Xp ). This technique is widely used for forecasting,
time series modelling and finding the causal effect relationship
between the variables.
A standard procedure is to collect data on (Y , X), plot on a Scatter
Plot (one predictor variable) or create response surface (multiple
predictor variables), and fit a line / curve / a surface to the data
points, so as to minimize (in some sense) the differences between the
distances of data points from the curve or line - use of L1 or L2 norms!

## Bimal Sinha (UMBC) Regression Analysis December , 2017 3 / 68

Why do we use Regression Analysis?

Multiple benefits
It explores the significant relationships between dependent variable and
independent variable.
It explores the strength of impact of multiple independent variables on
a dependent variable.
Regression analysis also allows us to compare the effects of variables
measured on different scales, such as the effect of price changes and
the number of promotional activities. These benefits help market
researchers / data analysts / data scientists to eliminate and evaluate
the best set of variables to be used for building predictive models.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 4 / 68

How many types of regression techniques?

## Many - mostly driven by three metrics (number of independent

variables, type of dependent variables and shape of regression line).
Regression Types?
Linear Regression Most widely known modeling technique - dependent
variable is continuous, independent variable(s) can be continuous or
discrete, and nature of regression line is linear.
It is represented by an equation Y = a + bX + e, where a is intercept,
b is slope of the line and e is error term. This equation can be used to
predict the value of target variable based on given predictor variable(s).

## Bimal Sinha (UMBC) Regression Analysis December , 2017 5 / 68

1-Linear Regression

## multiple linear regression has p(> 1) independent variables, with

Y = a + b1 X1 + ... + bp Xp + e.
Challenge: to efficiently estimate the intercept and slope parameters -
usually accomplished by Least Squares Method. We can evaluate the
model performance by using the metric R 2

## Bimal Sinha (UMBC) Regression Analysis December , 2017 6 / 68

More on Linear Regression

Important Points:
There must be linear relationship between independent and dependent
variables.
Multiple regression suffers from multicollinearity, autocorrelation,
heteroskedasticity.
Linear Regression is very sensitive to Outliers. It can terribly affect the
regression line and eventually the forecasted values.
Multicollinearity can increase the variance of the coefficient estimates
and make the estimates very sensitive to minor changes in the model.
The result is that the coefficient estimates are unstable.
In case of multiple independent variables, we can go with forward
selection, backward elimination and step wise approach for selection of
most significant independent variables.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 7 / 68

2- Logistic Regression

## Logistic regression is used when the response is dichotomous

(success/failure, survival/death) to model the probability of response,
irrespective of the nature of predictor variables.
p probability of event occurrence
odds ratio= = probability of event none occurrence
(1 − p)
p
log(odds ratio) = ln( (1−p) )
 
p
logit(p) = ln = b0 + b1 X1 + b2 X2 + b3 X3 + · · · + bp Xp
(1 − p)

## Bimal Sinha (UMBC) Regression Analysis December , 2017 8 / 68

2- Logistic Regression

Important Points:
Widely used for classification problems
Logistic regression can handle various types of relationships because it
applies a non-linear log transformation to the predicted odds ratio
To avoid over fitting and under fitting, we should include all
significant variables. A good approach to ensure this practice is to use
a step wise method to estimate the logistic regression
It requires large sample sizes because maximum likelihood estimates
are less powerful at low sample sizes than ordinary least squares
Modifiied minimum chisquare is an alternative method - long and rich
history!

## Bimal Sinha (UMBC) Regression Analysis December , 2017 9 / 68

2-Logistic Regression

## The independent variables should not be correlated with each other

i.e. no multi collinearity - can include interaction effects of categorical
variables in the analysis and in the model.
If the values of dependent variable are ordinal, then it is called,
Ordinal logistic regression
If dependent variable is multi class then we call it: Multinomial
Logistic Regression

## Bimal Sinha (UMBC) Regression Analysis December , 2017 10 / 68

3- Polynomial Regression

y = a + bx + cx 2 + ...
Best fit is quadratic, cubic, quartic,...

## Bimal Sinha (UMBC) Regression Analysis December , 2017 11 / 68

3- Polynomial Regression

Important Point:
Usually a temptation to fit a higher degree polynomial to get lower
error, this can result in over-fitting. Always plot the relationships to
see the fit and focus on making sure that the curve fits the nature of
the problem.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 12 / 68

4- Stepwise Regression

## This form of regression is used when we deal with multiple

independent variables - selection of independent variables is done with
the help of an automatic process, which involves no human
intervention.
This feat is achieved by observing statistical values like R-square,
t-stats and AIC metric to discern significant variables. Stepwise
regression basically fits the regression model by adding/dropping
co-variates one at a time based on a specified criterion. Some of the
most commonly used Stepwise regression methods are listed below:

## Bimal Sinha (UMBC) Regression Analysis December , 2017 13 / 68

4- Stepwise Regression

## Standard stepwise regression does two things. It adds and removes

predictors as needed for each step.
Forward selection starts with most significant predictor in the model
and adds variable for each step.
Backward elimination starts with all predictors in the model and
removes the least significant variable for each step.
The aim of this modeling technique is to maximize the prediction
power with minimum number of predictor variables. It is one of the
methods to handle higher dimensionality of data set.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 14 / 68

5- Ridge Regression: huge literature

## Ridge Regression is a technique used when the data suffers from

multicollinearity (independent variables are highly correlated). In
multicollinearity, even though the least squares estimates (OLS) are
unbiased, their variances are large which deviates the observed value
far from the true value. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard errors.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 15 / 68

y = a + y = a + b1 x1 + b2 x2 + · · · + e for multiple independent
variables.
Ridge regression solves the multicollinearity problem through
shrinkage parameter λ by choosing estimates of above model
parameters by minimizing (penalty function)
Pn Pp
i=1 (yi − a − b1 x1i − ... − bp xpi )2 + λ[ 2
i=1 bi ]
Effect is to shrink estimates to have low variance.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 16 / 68

6- Lasso Regression: huge literature

## Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and

Selection Operator) also penalizes the absolute size of the regression
coefficients. In addition, it is capable of reducing the variability and
improving the accuracy of linear regression models.
Penalty
Pn function: Pp
2
i=1 i − a − b1 x1i − ... − bp xpi ) + λ[ i=1 |bi |]
(y
Lasso regression differs from ridge regression in a way that it uses
to penalizing (or equivalently constraining the sum of the absolute
values of the estimates) values which causes some of the parameter
estimates to turn out exactly zero. Larger the penalty applied, further
the estimates get shrunk towards absolute zero. This results in
variable selection out of given p variables

## Bimal Sinha (UMBC) Regression Analysis December , 2017 17 / 68

6- Lasso Regression

Important Points:
The assumptions of this regression is same as least squared regression
except normality is not to be assumed
It shrinks coefficients to zero (exactly zero), which certainly helps in
feature selection
If group of predictors are highly correlated, Lasso picks only one of
them and shrinks the others to zero

## Bimal Sinha (UMBC) Regression Analysis December , 2017 18 / 68

7- ElasticNet Regression

## ElasticNet is hybrid of Lasso and Ridge Regression techniques. It is

trained with L1 and L2 prior as regularizer. Elastic-net is useful when
there are multiple features which are correlated. Lasso is likely to pick
one of these at random, while elastic-net is likely to pick both.
Important Points:
It encourages group effect in case of highly correlated variables
There are no limitations on the number of selected variables
It can suffer with double shrinkage

## Bimal Sinha (UMBC) Regression Analysis December , 2017 19 / 68

How to Select the Right Regression Model?

## Beyond these 7 most commonly used regression techniques, there are

other models like Bayesian, Ecological and Robust regression.
Life is usually simple, when you know only one or two techniques. For
a single response variable, - if the outcome is continuous - apply linear
regression. If it is binary - use logistic regression! However, higher the
number of options available at our disposal, more difficult it becomes
to choose the right one.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 20 / 68

How to Select the Right Regression Model?

## Within multiple types of regression models, it is important to choose

the best suited technique based on type of independent and
dependent variables, dimensionality in the data and other essential
characteristics of the data. Below are some key factors that can be
used to select the right regression model:
Data exploration is an inevitable part of building predictive model -
must try to identify the relationship and impact of variables

## Bimal Sinha (UMBC) Regression Analysis December , 2017 21 / 68

How to Select the Right Regression Model?

## To compare the goodness of fit for different models, we can analyse

different metrics like statistical significance of parameters, R-square,
Adjusted r-square, AIC, BIC and error term, Mallow’s Cp criterion.
This essentially checks for possible bias in the selected model, by
comparing the model with all possible submodels (or a careful
selection of them).
Cross - validation is the best way to evaluate models used for
prediction. Here you divide data set into two group (train and
validate). A simple mean squared difference between the observed
and predicted values gives measure for the prediction accuracy.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 22 / 68

How to Select the Right Regression Model?

## If data set has multiple confounding variables, do not choose

automatic model selection method because we do not want to put
these in a model at the same time.
It also depends on our objective. It can occur that a less powerful
model is easy to implement as compared to a highly statistically
significant model.
Regression regularization methods(Lasso, Ridge and ElasticNet) works
well in case of high dimensionality and multicollinearity among the
variables in the data set.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 23 / 68

Education and Economic Growth: A Meta-Regression
Analysis

## Objective: to examine the effect of education on economic growth.

Primary analysis or secondary analysis?
56 studies with 979 estimates and show that there is substantial
publication selection bias towards a positive impact of education on
growth. Once we account for this, we find evidence of a genuine
effect of education on economic growth.
The variation in reported estimates can be attributed to differences in
the measurement of education and study characteristics, most
importantly model specification, estimation methodology, type of data
and the research outlet where studies were published.
e.g. academic journals vs. working papers.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 24 / 68

Some Regression Topics in Economics
Is consumption truly a ”random walk”?
Estimating the male-female wage gap, and what causes it.
Does campaign funding lead to good election results?
The effect of advertising on demand for a good.
The Relationship Between Annual GDP Growth and Income
Inequality: Developed and Undeveloped Countries
GDP versus Manufacturing Output: Proof of Movement of
Standardized Processes
Economic Patterns in Voting
Economic Factors Affecting Homelessness in India
Factors Explaining Life Satisfaction Across Countries
The Economic Impact of Research and Development
Is College Worth the Money? A Look on the Effects of Bachelors
Degrees to the Unemployment Rate
Bimal Sinha (UMBC) Regression Analysis December , 2017 25 / 68
Some Regression Topics in Economics

## A Study of How Individual School Characteristics Affect School

Performance
An Examination of the Economic Effects of the Winter Olympics
Factors Affecting Corruption in Developing and Emerging Countries
Modern Day Evaluation of the Preston Curve: The Relationship
Between Life Expectancy and Income
Econometric Analysis: Effect of Barriers on Trade
Income Inequality as a Determinant of Economic Growth: A
Cross-Country Analysis

## Bimal Sinha (UMBC) Regression Analysis December , 2017 26 / 68

Some Regression Topics in Economics

## Quality of Public Education based on the States Economics

Happiness and Traffic: An Analysis of Long Term Effects
Effect of GDP Per Capita on National Life Expectancy
Impact of Educational Attainment on Crime in the United States: A
Cross-Metropolitan Analysis
Understanding How Unique Attributes Might Affect Poverty
The Effect of Inequality on Satisfaction
Regression Analysis of Electrical Energy Consumption with
Cross-Country data

## Bimal Sinha (UMBC) Regression Analysis December , 2017 27 / 68

Key Steps in Economic Regression Analysis (Econometrics)

I The Model
The model and the data are the starting points of an econometric
project.
The first step in formulating a model is to select a topic of interest
and to consider the model’s scope and purpose.
State and understand objectives of the study, what boundaries to
place on the topic, what hypotheses might be tested, what variables
might be predicted, and what policies might be evaluated.
Close attention must be paid, however, to the availability of adequate
data. In particular the model must involve causal relations among
measurable variables.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 28 / 68

I. The Model: Choice of topics?

particular market (the market for Pitzer graduates, the market for
economists, the market for ice cream, the markets for private education), a
process (economic development, inflation, unemployment), demographic
phenomena (birth rates, death rates), environmental phenomena (water
quality, air quality), political phenomena (elections, voting behavior of
legislatures), some combination of these, or some other topic.
”Air pollution and Population”
”Birth Rates, Death Rates, and Economic Growth in Developing
Economies”
”Demand for and Supply of Higher Education”
”Differential Growth in Indian cities”

I. The Model

## ”Discrimination in the Retail Food Markets”

”Divorce Rates, Birth Rates, and Female Participation in the Labor
Force”
”Economic and Social Determinants of Infant Mortality in India”
”The Effect of Unemployment on Crime”
”Elections and Money”
”Medical School Applications”
”Police Expenditures and the Deterrence of Crime”
”The Relationship between Exports and Growth in Less Developed
Countries”
”Unionization and Strike Activities”

I. The Model

## Interest lies in the impact of some independent variable X on a

dependent variable Y. But since there are many variables X that have
influence on the variable Y, it is important to include all those
variables
To ensure that the model is both interesting and manageable, it
should contain at least three to four independent variables
The model should be formulated as an algebraic, linear, stochastic
equation along with a corresponding verbal statement of the meaning
of the equation.

II. The Data

## Data form an essential ingredient in any econometric study, and

obtaining an adequate and relevant set of data is an important and
often critical part of the econometric project. Data must be available
for all the variables in the model. Huge literature to deal with missing
data!
National Statistical Abstracts, Statistical Yearbooks, or Statistical
Handbooks, published annually by most major countries provide both
summary statistics and references to primary sources.

II. The Data

## For international data, the United Nations Statistical Yearbook provides a

wealth of data on member countries, as do statistical yearbooks of other
international organizations like the OECD. The Federal Reserve Bank of
St. Louis puts out International Economic Conditions which gives
comparative data for Canada, France, Germany, Italy, Japan, Netherlands,
Switzerland, United Kingdom, and the U.S. Various almanacs, sources on
the WWW like www.census.gov, and other reference works also abound in
statistics. Take a look at the course homepage and the economics
department homepage. All of these sources contain data on so many
topics that they may suggest a topic for the econometric project.

II. The Data

## Data can be either time-series or cross-section.

Also it is best to avoid data sets which are too small, say less than
thirty observations.
The data should be examined, and if necessary, refined to make them
suitable for the purposes of the model.
For time-series data it may be necessary to use seasonal adjustments
or perhaps to eliminate certain trends. For both time-series and
cross-section consideration should be given to whether to divide the
data into separate samples or perhaps exclude certain observations.

II. The Data

## Thus in time-series data it may (or may not) be appropriate to

exclude war years or years of a recession. In a cross-section of nations
it may be inappropriate to include all countries that are UN members.
The developed countries might be treated as one group and the
developing countries as another group.
Dividing the data this way into subsamples not only leads to more
homogenous data sets but also facilitates the study by allowing
comparative analyses.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 35 / 68

III. The Estimation

After both the model and data have been developed, the next step is
to utilize econometric techniques to estimate the model.
We can use STATA 14 or any other statistical package for the
statistical analysis. Basic statistics packages include Minitab and
Excel. For careful work in econometrics we will want to use EViews,
STATA, SAS, TSP, LimDep, SPSS or Shazam.
Make sure that we have enough observations for all the variables and
that the dependent and independent variables show some variation
over the observations.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 36 / 68

IV. Specification of the Model

Define and discuss the specification of the selected model What variables
are included in the model? Explain why we chose those variables and the
role they play in the model. Have we included all the important variables
in the model? What are the expected signs of all the coefficients?

V. Data Description
Provide complete description of all the data, their sources, refinements
used, and their possible biases or other possible weaknesses.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 37 / 68

VI. Results

Present the estimates of the model and its related statistics such as
standard errors, t statistics and the R 2 . Discuss which coefficients are
significant at the 5% and 1% levels. If relevant, a discussion of possible
serial correlation and its correction; a discussion of possible
heteroscedasticity and its correction; and a discussion of possible
multicollinearity and its correction. Estimate alternative models to test the
robustness of the results.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 38 / 68

VII. Discussion

Discuss the signs and magnitudes of the estimated coefficients and their
comparisons to predicted or theoretical signs and magnitudes. What have
we learned? Consider how the model might be reformulated in future
studies, and implications for future econometric research.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 39 / 68

VIII. Conclusions

## Sum up the major results of your study.

IX. Bibliography
Include complete citations of all items referred to in the paper.

X. Data
If reasonable, provide a table of all the data used. At a minimum, provide
the summary statistics for the data.

Forecasting

## There are three methods of forecasting that are commonly used in

Causal methods
time series methods
qualitative methods.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 41 / 68

Forecasting

Each of these three different methods has various tools and techniques
that fall underneath the silo in question. And each of these methods is
going to be appropriate in different kinds of circumstances.
Causal methods typically involves regression analysis and some of the
different types of specialized regression analysis that are going to be
useful in various circumstances.
Time series methods often involves various forms of trend
analysis.Things like exponential smoothing, trend prediction, et
cetera.
And then,
qualitative methods involve using surveys and other subjective ad hoc
methods of gathering data in order to make predictions. In causal
forecasting we’re relying on relationships between variables.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 42 / 68

Website for books on Regression

## Best 25+ Regression analysis ideas on Pinterest — Statistics,

Statistics. https://www.pinterest.com/explore/regression-analysis/
Contents: The nature of econometrics and economic data. Part I:
REGRESSION ANALYSIS WITH CROSS-SECTIONAL DATA : The
simple regression model. Multiple regression analysis: Estimation.
Multiple regression analysis: Inference. Multiple regression analysis:
OLS asymptotics. Multiple regression analysis:

## Bimal Sinha (UMBC) Regression Analysis December , 2017 43 / 68

Modeling of United States Airline Fares Using the Official Airline
Guide (OAG) and Airline Origin and Destination Survey (DB1B)

A Case Study

Krishna Rama-Murthy
Master’s Thesis, Virginia Polytechnic Institute & State University, 2006

## Bimal Sinha (UMBC) Regression Analysis December , 2017 44 / 68

Motivation
Travel cost is one of the major factors that a traveler considers when
he/she chooses the transportation mode for the trip.
National Aeronautics and Space Administration (NASA) intends to
reduce inter-city travel time in the United States by one-half within
10 years and by two-thirds within 25 years, while keeping costs low
and improving safety.
For inter-city transportation system mode choice analysis, knowing the
cost of travel by each existing transportation mode have an impact of
the introduction of a new mode of transportation. The travel cost will
also help to determine the future trend in travels whether there is
going to be congestion or more demand of a particular mode.
NASA, in collaboration with the Federal Aviation Administration
(FAA), industry, and several universities, has launched the Small
Aircraft Transportation System (SATS) research program whose
critical task is transportation system demand estimation.
Bimal Sinha (UMBC) Regression Analysis December , 2017 45 / 68
Motivation

## Transportation System demand estimation analysis on the SATS

utilizes a cost model which is split into two sub-categories:
i. Cost model for Supply side, also referred to as “Transportation Vehicle
Performance Models”
ii. Cost model for Demand side, also referred to as “Generic Fare Model”
Rama-Murthy developed this generic fare model as a demand side
cost metric. The ratio of average fare to distance (fare per mile) is
used as a measure of this cost of travel.
Compared to other transportation mode fares, it is not easy to typify
air fare since it is affected by many factors. To better understand the
variation in the cost of air travel, Rama-murthy formulated several
statistical models.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 46 / 68

Understanding Airfares

## The Airline Deregulation Act of 1978 has brought enormous changes

to the US domestic airline industry. In particular, the total numbers
of enplanements and passenger miles have more than doubled since
then, and the overall airfare has been considerably lower than it would
After the removal of the restrictions posted on airline industry in
regulation years, airfares have taken a more and more complex
structure. Airfares are heavily influenced by factors such as
i. scale economies
ii. level of competition
iii. airport congestion, and
iv. airline marketing strategies

## Bimal Sinha (UMBC) Regression Analysis December , 2017 47 / 68

Understanding Airfares

## Some general patterns on the average cost are:

i. Longer flights tend to have lower average cost because the fixed costs
associated with each flight can be spread over a longer distance.
ii. Markets with larger passenger volume tend to have lower average cost
since airlines in those markets are able to use larger planes and achieve

Methodology

## Two sets of regression models were developed to estimate the cost of

travel in US:
i. Non-linear model which estimates the relationship average round-trip
fare and yield
ii. Multiple regression models that try to understand the causal
relationship between average fare between any origin and destination
pair and other defined explanatory variables
A list of 685 commercial airports classified by the FAA was used in
the analysis. These airports were clustered into four separate
categories of airports based on the total number of enplanements: (1)
Large Hub, (2) Medium Hub, (3) Small Hub, and (4) Non-Hub.
Nearly 95% of these enplanements in National Airspace System go
through the Large and Medium Hubs.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 49 / 68

Determination of Fare Class Category

## The fare paid depends on the type of ticket purchased. In this

analysis, the fares were grouped into two types: First and Business
In order to determine the proper fare class category to be used for the
analysis, a set of fare class groups was created using the fare class
categories. They are as follows:
a. Business Class - Unrestricted First Class (F), Restricted First Class (G),
b. Coach Class - Unrestricted Coach Class (Y) and Restricted Coach Class
(X)
c. Restricted Coach Class (X)
d. Unrestricted Coach Class (Y)
Using these categories as a basis for class determination, non-linear
regression models were generated using the distance traveled as an
independent variable.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 50 / 68

The results are presented below:

Based from the figure above, the fare model for the Unrestricted
Coach Class(Y) behaves similar to the Business Class fare model.
Hence, Unrestricted Coach Class fares were combined with Business
class fares for the analysis.
Bimal Sinha (UMBC) Regression Analysis December , 2017 51 / 68
The final cluster of fare class groups that were used to develop the
models is given below:
a. Business Class - Unrestricted First Class (F), Restricted First Class (G),
Unrestricted Coach Class (Y)
b. Non-Business Class - Restricted Coach Class (X)

## Bimal Sinha (UMBC) Regression Analysis December , 2017 52 / 68

Model Variables
1. Round Trip Distance: This used to be a prominent independent
variable for modeling airfare but after the deregulation period the
relationship between airfares and distance has broken down (Anderson, et.
al. 2002).

## 2. Market Concentration: Competition between airlines has an

important impact on the cost of air travel. To understand this
competition, the percentage of total number of seats offered by each
carrier, denoted by pa , is calculated.
X X ta
ta = fab sb sa = ta pa =
a
sa
b
where
a: total number of airlines at origin (i) airport
b: types of aircraft by each airline from origin (i) airport
ta : total number of seats for each airline a from origin (i) airport
fab : frequency of aircraft type b offered by each airline a
sb : number of seats for aircraft type b
sa : total
Bimal
number of seats offered byRegression
Sinha (UMBC)
an airline a from origin (i) airportDecember , 2017
Analysis 53 / 68
Model Variables

## Market concentration or competition for each Origin-Destination airport

pair can be measured by calculating the Herfindahl Index (HI).
X
HI = pa2
a

## Interpretation: The value of HI varies from 0 to 1. A value of 1

corresponds to a monopoly; 0.5 corresponds to an industry with two
equal-sized firms, 0.33 corresponds to an industry with three equal sized
firms and so on and so forth. As a rule, any market having a HI greater
than 0.4 is considered a highly concentrated market and less than 0.18 a
less concentrated market. The higher the concentration, the more likely
the fare will increase in that market segment.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 54 / 68

Model Variables
3. Passenger Flows: A large number of passenger flows tend to reduce
the average fare. However this may not be true for certain cases having
higher HI values, thereby increasing the average fare. Hence, the
relationship between Herfindahl Index and passenger flow was observed.

Model Variables

## 4. Low Cost carrier presence: It is a general trend that presence of low

cost carrier will tend to reduce the average fare between an Origin
-Destination pair. Low Cost carriers usually don’t offer business service and
only offer point-to-point service, thereby reducing their operating costs.

Model Variables

## 5. Origin and Destination Type: Airports are classified into the

following types depending on the number of enplanements. It is usually
believed that traveling between Large Hubs is inexpensive than traveling
from other airports. Also on a macro level the overall supply and demand,
expenses and revenue would tend to drive the costs down in large airports.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 57 / 68

Fare Models
Using all the variables mentioned previously, a family of “Fare Models”
was created for both Business and Coach Class. They are as follows:
1 Table Function: The Table Function is the weighted average of the mean
fare paid between 685 x 685 airports. The mean fare for a single
Origin-Destination pair is determined using the following formula:
X
µ= xp(x)
x

## where x − Fare(\$) and p(x) - probability of x.

2 Non-Linear Regression Fare Model: A generic fare per mile model is used
to predict fare per mile only using the mean round trip distance traveled as
an independent variable. It is a non-linear regression model also known as
Harris Model. The model is given below:
1
y=
a + bx c
where y is the fare per mile (\$/mile), a, b and c are the model parameters,
and x is the round trip distance in statue miles.
Bimal Sinha (UMBC) Regression Analysis December , 2017 58 / 68
Fare Flow Model
A family of generic fare models was developed as an input for the “Fare
Flow Model”. The “Fare Flow Model” is a combination of Table Function
and the generic fare models.
Check for fare value between any Origin-Destination pair in Table function.
If the fare value in Table Function is not available, check whether one of the
Origin-Destination airports is in Alaska or Hawaii.
If Origin and Destination airport is in Alaska or Hawaii, check the distance.
If distance is less than 1500 miles, use the Harris Model within AK & HI.
If No, then check if distance is greater than 1500 miles and less than 3000 miles and use
the Harris Model developed for that distance category.
If the distance is greater than 3000 miles, use the Harris Model for distance greater than
3000 miles.
If the Origin-Destination airport is not Alaska or Hawaii, check for Origin-Destination pair
airports with distance less than 500 miles.
If the distance is less than 500 miles, use the Harris Model developed for that category of
distance and Origin-Destination pairs.
If the distance is greater than 500 miles, use the Harris Model for distance greater than
500 miles.
Finally if Origin-Destination pair doesn’t fall in any of the above category it uses the
Generic Fare model developed using the Great Circle Distance. The great circle distance is
the minimum distance between the Origin-Destination airport pair.
Bimal Sinha (UMBC) Regression Analysis December , 2017 59 / 68
Statistical Validation of “Fare Flow Model”

The Fare flow model was then tested using a non parametric statistical
test for non-similarity between the generic fare models. The Wilcoxon
Rank Sum Test is a nonparametric alternative to the two-sample t-test
which is based solely on the order in which the observations from the two
samples fall. The results from Wilcoxon Rank Sum test performed on the
“Fare flow models” indicate that the models are dissimilar and are
independent from each other. The p-values imply that the models are
statistically significant.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 60 / 68

Multiple Linear Regression Model
To test the hypotheses about the factors that affect the cost of air travel,
multiple regression equations were undertaken on the basis of fare class.

## Coach Class Fare Analysis:

fcij = β0 + β1 dij + β2 pcij + β3 hi + β4 hj + β5 lcij + β6 oi + β7 dj + eij
where
fcij : annual average round-trip fare for coach class between i and j
dij : round trip distance in statue miles between i and j
hi : Herfindahl Index at the origin airport i
hj : Herfindahl Index at the origin airport j
pcij : annual coach class type passenger flows between i and j
lcij : low cost carrier presence between i and j dummy variable 0 or 1
oi : origin airport type (i) [1, 2, 3, and 4]
dj : destination airport type (j) [1, 2, 3, and 4]
β , β , β , β , β , β , β , β : model
Bimal Sinha (UMBC) parameters
Regression Analysis to be estimated.
December , 2017 61 / 68
Results

Parameter Estimate for Coach Fare Class Regression, Average Coach Fare

Interpretation

## The parameter estimate for average distance has a positive sign,

showing that longer trips have more average fare value.
Competition is one of the main causes that affect airfares. The higher
the competition, fares tends to be lower. The positive sign on the
competition parameters, Herfindahl Index at the origin and
destination airport, indicate that lesser the competition more the
average fare between the O-D pair. It also shows that the
competition at the destination airport is more critical than the
competition at the origin airport.
The annual passenger flows are higher between larger airport pairs.
This flow is one of the main reason for congestion in these large
airports; leading to more indirect operating costs. These costs are
directly passed on to the passengers leading to higher fares, as
indicated by the positive sign on annual average passenger flows.

Interpretation

## Low-cost carriers have completely changed the scenario of air travel in

the US. These airlines have a successful business model to reduce
indirect operating costs, thereby offering cheaper fares. Any presence
of low-cost carrier at the origin airport tends to reduce the average
fare. This is indicated by the negative sign of the causal variable
low-cost carrier presence.
The origin and destination airport type variables both have positive
effects, suggesting that airfare tends to be higher at smaller airports.
Again, the destination airport type is more critical that the origin
airport type.

## Bimal Sinha (UMBC) Regression Analysis December , 2017 64 / 68

Multiple Linear Regression Model

## fbij = β0 + β1 dij + β2 pbij + β3 hi + β4 hj + β5 oi + β6 dj + eij

where
fbij : annual average round-trip fare for business class between i and j
dij : round trip distance in statue miles between i and j
hi : Herfindahl Index at the origin airport i
hj : Herfindahl Index at the origin airport j
pcij : annual coach class type passenger flows between i and j
oi : origin airport type (i) [1, 2, 3, and 4]
dj : destination airport type (j) [1, 2, 3, and 4]
β0 , β1 , β2 , β3 , β4 , β5 , β6 : model parameters to be estimated.
eij : residual
Bimal Sinha (UMBC) Regression Analysis December , 2017 65 / 68
Results

Fare

## Bimal Sinha (UMBC) Regression Analysis December , 2017 66 / 68

Interpretation
The parameter estimate for average distance has a positive sign,
showing that longer business trips have more average fare value.
The positive sign on the competition parameters, Herfindahl Index at
the origin and destination airport, indicate that competition also
affects business class fares; lesser the competition more the average
fare between the O-D pair. It also shows that the competition at the
origin airport is more critical than the competition at the destination
airport in case of business class trips.
The annual passenger flows variable has a contradictory affect on the