Regressions: Understand the hypothesized relationships between variables Explanatory and predictive Trying to fit a line to data Assumptions:
Gaussian distributions data is normal
Shouldn't have multicollinearity be non-correlated
Errors are normally distributed because its based a point estimates
Homoscedasticity it means that anything youre observing, variables should have the same structure. Unequal variances or errors.
You can still violate these assumptions and build a model
When should it be used?
Limitations Interpretations When it should not be used How to diagnose common data problems How to use dummy variables Y variable Criterion variable, dependent variable. Correlation is a good place to start to figure out something about the data The hypothesized relationship, there is something here. RSquared value and T-test
Model summary provides most of the insights
Data that appears to be coded strangely
Know your data
Know the source Is something missing Distribution of variables whether the samples are representative 80% of the work is data prep, knowing the data really well What is the relationship between key economic variables and female life expectancy If data is missing there are ways to get around it, either by excluding records etc. Q-Q plot and other normality distributions tell us whether the data is normally distributed Data transformation makes it harder to draw inferences Linear regression assumes that data is serial and countable Ordinal variable -- Likelihood scale is 1 5 scale Nominal variable name etc, gender Interval set of data 2 is greater than 1 and is also twice as larger as 1 The scatter plot with all the variables can tell you if something is linear or curvo-linear Graphs Legacy Dialogs scatter Matrix scatter IVs are correlated which is bad Transform the variables if they arent normal The most common type of transformation is a log transformation You want to preserve its inherent properties and preserve its inherent outcome. But you want to make it linear Collinearity option tells you about the Variance Inflation Factor Throw all of them - Default Forward selection adds them in one by one Backward selection -Step-wise selection it enters variables
Parsimony build a parsimonious model
The F value, the higher the better the model The significance number should be as small as possible Significance value tells you how confident you are that there isnt chance involved You want VIF to be less than 10 Beta coefficients if its positive, then the more phones in a house hold the more life expectancy You have to make sure the direction of the relationship makes sense Sign flipping is a big problem no sign flipping is not Unstandardized B is 1 unit change Interpretation is different because a variable is logged. Why would you standardize a variable becomes important when you have two variables on different scale Standardized Beta a 1 standard deviation change of Beta = 1 stdev change * coefficient change You need to find the difference between the point value and the actual value and look at the Residuals. Residuals are important If the error terms are not normally distributed.