Sie sind auf Seite 1von 29

# Data Mining Project

Rossmann Store
Sales

2016-2017
Content

1 Problem Introduction

2 Exploratory data
analysis

3 Models

3 Conclusion

4/29/17 2
2
Problem introduction
Problem Introduction

## Forecast sale using store, promotion and competitor data

1115 Stores
Historical Data of sales from January 2013 to July 2015

4/29/17 4
4
Exploratory Data
Analysis
Dataset Details

Train
Variables : store, day of week, date, sales,customers, open, promo,
state holiday, school holiday
Store
store, storetype, assortment, competition distance, competition open since
month, promo2, promo2since week, promo2since year, promo interval

4/29/17 6
6
Dataset Details
Id - an Id that represents a Store Assortment -describes an assortment
Customers - the number of customers on level: a = basic, b = extra,c = extended
a given day Open - an indicator for CompetitionDistance - distance in meters
whether the store was open: 0 = closed, to the nearest competitor store
1 = open CompetitionOpenSince[Month/Year] -
StateHoliday -indicatesa state holiday. gives the approximate year and month of
Normally all stores, with few exceptions, the time the nearest competitor was
are closed on state holidays. Note that opened
all schools are closed on public holidays Promo - indicates whether a store is
and weekends. a = public holiday, b = running a promo on that day
Easter holiday, c = Christmas, 0 = None Promo2 - Promo2is a continuing and
SchoolHoliday -indicatesif theStore was consecutive promotion for some stores:
affected by the closure of public schools 0 = store is not participating, 1 = store is
StoreType-differentiates between 4 participating
different store models: a, b, c, d Promo2Since[Year/Week] -describes the
PromoInterval -describesthe year and calendar week when the store
consecutive intervals Promo2is started, startedparticipating inPromo2
naming the months the promotion is
started anew. E.g. "Feb,May,Aug,Nov"
means each round starts in February,
May, August, November of any given
year 4/29/17
for that store 7
7
Dataset Details

First description

4/29/17 8
8
Dataset Details

## As we can see there no stores closed in the train data

Proportion of open stores against those closed in the train
data

## Proportion of Sales against the fact if the store was closed or

open

4/29/17 9
9
Dataset Details
Proportion of the number of customers against closed and
opened stores
As we can see there some days when some stores were open
without having any customers

## Proportion of sales against the fact if there have been a promo

or not

4/29/17 10
1
Dataset Details
testing<-train[which(train\$Sales!=0 &
train\$Customers != 0),]
ggplot(testing,aes(x = factor(testing\$Promo), y =
testing\$Sales)) +geom_jitter(alpha = 0.1) +
geom_boxplot(color = "yellow", outlier.colour =
NA, fill = NA)

## From the graphic we can see the effect of a promo

on Sales

4/29/17 11
1
Dataset Details

testing<-train[which(train\$Sales != 0),]
ggplot(testing,aes(x = factor(testing\$DayOfWeek),
y = testing\$Sales)) +
geom_jitter(alpha = 0.1) +
geom_boxplot(color = "yellow", outlier.colour =
NA, fill = NA)

## The plot next to the left shows as the effect of day

on Sales as we can see for days of week 2,3,4 the
sales are mostly the same

4/29/17 12
1
Dataset Details
testing<-train[which(train\$Sales != 0),]
ggplot(testing, aes(x =
factor(testing\$SchoolHoliday), y =
testing\$Sales)) +
geom_jitter(alpha = 0.1, color = "lightblue")
+
geom_boxplot(color = "red", outlier.colour =
NA, fill = NA)

## This plot shows as the fact of having a

SchoolHoliday on Sales Amount

4/29/17 13
1
Dataset Details
testing<-train[which(train\$Sales != 0 &
train\$Customers != 0),]
ggplot(testing,
aes(x = factor(testing\$Promo), y =
testing\$Customers)) +
geom_jitter(alpha = 0.1, color = "lightblue") +
geom_boxplot(color = "hotpink", outlier.colour =
NA, fill = NA)

## The fact of having a promo on the the number of

customers

4/29/17 14
1
Data Conversion and
preprocessing

15
Data Conversion

train\$Date<-as.Date(train\$Date)
train\$month <- as.integer(format(train\$Date, "%m"))
train\$year <- as.integer(format(train\$Date, "%y"))
train\$day <- as.integer(format(train\$Date, "%d"))
train\$SchoolHoliday<-as.factor(train\$SchoolHoliday)
train\$Promo<-as.factor(train\$Promo)

4/29/17 16
1
Data Preprocessing

## train_store <- merge(train, store, by = "Store")

train_store <- train_store[train_store\$Open != 0, ]

set.seed(123)
trainsample<- sample(1:nrow(train_store),
0.7*nrow(train_store))
test <- sample(setdiff(seq_len(nrow(train_store)),
trainsample), 0.3*nrow(train_store))

4/29/17 17
1
Models Used

18
Linear regression Model

## lr.model <- lm(Sales ~ Promo +DayOfWeek + StateHoliday +month + year +

day+StoreType+ StoreType+CompetitionDistance , train_store[trainsample,])

Model Formula

value of 34941

## Value being close to 0

indicating that the model

4/29/17 19
1
Linear regression Model

The rmse given from the above model used We try to improve our model using the StepAIC
for variables selection

4/29/17 20
2
Random Forest Model

library(randomForest)
rf <- randomForest(Sales ~Promo
+DayOfWeek + StateHoliday
+month + year + day+StoreType+
StoreType+SchoolHoliday,train_store
[trainsample,], ntree=20)

4/29/17 21
2
Random Forest Model

varImpPlot(rf) plot(rf)
This plot shows the We can see through this
importance of variables plot the evolution of errror
used by the model within the number of trees

4/29/17 22
2
Random Forest Model

## Using h2o library

4/29/17 23
2
Random Forest Model

4/29/17 24
2
SVM Model

## For using the svm regression model we took the store 1

because it took a long time to process all data

4/29/17 25
2
SVM Model

## model.SVM <- svm(Sales~Promo +DayOfWeek +

StateHoliday +month + year + day+StoreType+
StoreType+SchoolHoliday , train_store1[sample,])
summary(model.SVM)

4/29/17 26
2
SVM Model

4/29/17 27
2
SVM Model

4/29/17 28
2
Conclusion

VVV 29