Sie sind auf Seite 1von 29

Data Mining Project

Rossmann Store
Ayari Nadhmi 4 erp-bi 3
Sales

2016-2017
Content

1 Problem Introduction

2 Exploratory data
analysis

3 Models

3 Conclusion

4/29/17 2
2
Problem introduction
Problem Introduction

Forecast sale using store, promotion and competitor data


1115 Stores
Historical Data of sales from January 2013 to July 2015

4/29/17 4
4
Exploratory Data
Analysis
Dataset Details

Train
Variables : store, day of week, date, sales,customers, open, promo,
state holiday, school holiday
Store
store, storetype, assortment, competition distance, competition open since
month, promo2, promo2since week, promo2since year, promo interval

4/29/17 6
6
Dataset Details
Id - an Id that represents a Store Assortment -describes an assortment
Customers - the number of customers on level: a = basic, b = extra,c = extended
a given day Open - an indicator for CompetitionDistance - distance in meters
whether the store was open: 0 = closed, to the nearest competitor store
1 = open CompetitionOpenSince[Month/Year] -
StateHoliday -indicatesa state holiday. gives the approximate year and month of
Normally all stores, with few exceptions, the time the nearest competitor was
are closed on state holidays. Note that opened
all schools are closed on public holidays Promo - indicates whether a store is
and weekends. a = public holiday, b = running a promo on that day
Easter holiday, c = Christmas, 0 = None Promo2 - Promo2is a continuing and
SchoolHoliday -indicatesif theStore was consecutive promotion for some stores:
affected by the closure of public schools 0 = store is not participating, 1 = store is
StoreType-differentiates between 4 participating
different store models: a, b, c, d Promo2Since[Year/Week] -describes the
PromoInterval -describesthe year and calendar week when the store
consecutive intervals Promo2is started, startedparticipating inPromo2
naming the months the promotion is
started anew. E.g. "Feb,May,Aug,Nov"
means each round starts in February,
May, August, November of any given
year 4/29/17
for that store 7
7
Dataset Details

First description

4/29/17 8
8
Dataset Details

Test if there are some stores closed in the training Data

As we can see there no stores closed in the train data


Proportion of open stores against those closed in the train
data

Proportion of Sales against the fact if the store was closed or


open

4/29/17 9
9
Dataset Details
Proportion of the number of customers against closed and
opened stores
As we can see there some days when some stores were open
without having any customers

Proportion of sales against the fact if there have been a promo


or not

4/29/17 10
1
Dataset Details
testing<-train[which(train$Sales!=0 &
train$Customers != 0),]
ggplot(testing,aes(x = factor(testing$Promo), y =
testing$Sales)) +geom_jitter(alpha = 0.1) +
geom_boxplot(color = "yellow", outlier.colour =
NA, fill = NA)

From the graphic we can see the effect of a promo


on Sales

4/29/17 11
1
Dataset Details

testing<-train[which(train$Sales != 0),]
ggplot(testing,aes(x = factor(testing$DayOfWeek),
y = testing$Sales)) +
geom_jitter(alpha = 0.1) +
geom_boxplot(color = "yellow", outlier.colour =
NA, fill = NA)

The plot next to the left shows as the effect of day


on Sales as we can see for days of week 2,3,4 the
sales are mostly the same

4/29/17 12
1
Dataset Details
testing<-train[which(train$Sales != 0),]
ggplot(testing, aes(x =
factor(testing$SchoolHoliday), y =
testing$Sales)) +
geom_jitter(alpha = 0.1, color = "lightblue")
+
geom_boxplot(color = "red", outlier.colour =
NA, fill = NA)

This plot shows as the fact of having a


SchoolHoliday on Sales Amount

4/29/17 13
1
Dataset Details
testing<-train[which(train$Sales != 0 &
train$Customers != 0),]
ggplot(testing,
aes(x = factor(testing$Promo), y =
testing$Customers)) +
geom_jitter(alpha = 0.1, color = "lightblue") +
geom_boxplot(color = "hotpink", outlier.colour =
NA, fill = NA)

The fact of having a promo on the the number of


customers

4/29/17 14
1
Data Conversion and
preprocessing

15
Data Conversion

train$Date<-as.Date(train$Date)
train$month <- as.integer(format(train$Date, "%m"))
train$year <- as.integer(format(train$Date, "%y"))
train$day <- as.integer(format(train$Date, "%d"))
train$SchoolHoliday<-as.factor(train$SchoolHoliday)
train$Promo<-as.factor(train$Promo)

4/29/17 16
1
Data Preprocessing

train_store <- merge(train, store, by = "Store")


train_store <- train_store[train_store$Open != 0, ]

set.seed(123)
trainsample<- sample(1:nrow(train_store),
0.7*nrow(train_store))
test <- sample(setdiff(seq_len(nrow(train_store)),
trainsample), 0.3*nrow(train_store))

4/29/17 17
1
Models Used

18
Linear regression Model

lr.model <- lm(Sales ~ Promo +DayOfWeek + StateHoliday +month + year +


day+StoreType+ StoreType+CompetitionDistance , train_store[trainsample,])

Model Formula

Residuals with a max


value of 34941

Value being close to 0


indicating that the model
is bad

4/29/17 19
1
Linear regression Model

The rmse given from the above model used We try to improve our model using the StepAIC
for variables selection

4/29/17 20
2
Random Forest Model

library(randomForest)
rf <- randomForest(Sales ~Promo
+DayOfWeek + StateHoliday
+month + year + day+StoreType+
StoreType+SchoolHoliday,train_store
[trainsample,], ntree=20)

4/29/17 21
2
Random Forest Model

varImpPlot(rf) plot(rf)
This plot shows the We can see through this
importance of variables plot the evolution of errror
used by the model within the number of trees

4/29/17 22
2
Random Forest Model

Using h2o library

4/29/17 23
2
Random Forest Model

4/29/17 24
2
SVM Model

For using the svm regression model we took the store 1


because it took a long time to process all data

4/29/17 25
2
SVM Model

model.SVM <- svm(Sales~Promo +DayOfWeek +


StateHoliday +month + year + day+StoreType+
StoreType+SchoolHoliday , train_store1[sample,])
summary(model.SVM)

4/29/17 26
2
SVM Model

4/29/17 27
2
SVM Model

4/29/17 28
2
Conclusion

VVV 29