Sie sind auf Seite 1von 13

R Tutorial

Capital One Data Mining Cup UW Statistics Club Saturday, March 23, 2013

Who Will Benefit From This?

Aimed at Students who


Have the statistical background but lack the (R) modelling expertise Never taken a linear regression course (or simply forgot the one they did!)

What We Will Be Doing Today

Walkthrough example of a statistical prediction problem using Kaggle test data (Titanic problem) The goal is to predict who will survive given different factors such as

Age
Ticket Fare Sex Cabin Number of family aboard

R Basics

Opening R (RStudio) Navigating to the working directory Running commands Installing packages Loading packages

Basic Guideline to Data Analysis


1.
2. 3. 4. 5. 6. 7.

Define the question


Define the ideal data set Determine what data you can access Obtain the data Clean the data Exploratory data analysis Statistical prediction/modelling

8.
9. 10. 11.

Interpret results
Challenge results Synthesize/write up results Create reproducible code

Cleaning the Data (skipped)


Fix variable names Merge data sets Fix missing content Fix inconsistent data

Exploratory Data Analysis

Make use of

Aggregation Tables Charts

We use two different R packages here: ggplot2, plyr

Testing Your Model


Before we build our model we need to have a methodology on how we will test it.
A nave analyst would use the entire data set to build the model and then test it on the same data set. This causes overfitting! Instead: partition training data set into a real training set and a validation set. To create validation set use:

Random sub-sampling K-fold

Leave-one-out

What measurement do we use to compare?

Adjusted 2 , AIC, BIC

Building Our First Model - Simple Linear Regression

Why is this a good starting point?

Easy to implement in R

Black box (i.e. no tuning parameters)


Easy to interpret/explain

Disadvantage: performs poorly in non-linear setting

Building Our First Model - Simple Linear Regression


After we have run our first model we want to:

Examine Residuals plot Examine Q-Q plot Use the Model Testing process to pick a proper model

Using the step function in R

Understanding Interaction (optional)

Checking for Multicollinearity (optional)


Multiple predictor variables are highly correlated Can be caused by:

Creating a new predictor variable from existing ones

Having multiple predictors that explain the same thing

Consequence: standard error blows up on estimate Use R to compute correlation between all predictors. If there exists sets of predictors above 0.90 0.95 then either:

Remove all but one Combine into a new composite variable

What Next?

Taking our Simple Linear Regression to the next level


Higher order terms Interaction terms

Data Transformations
Check for multicollinearity

Different Types of Models (not covered here but check the R Code!)

Generalized Linear Models Trees Random Forest

Ensemble Methods

Das könnte Ihnen auch gefallen