Wal Mart Sales Forecasting

WAL-MART SALES
FORECASTING
94-832: Business Intelligence & Data Mining SAS
TEAM 7
MITHUN MATHEW
MEAGAN MUSGRAVE
AKASH PATEL
RENU THOMAS
IVY YANG
Report Team 7
Table of Contents
1 Introduction .......................................................................................................................................... 3
2 Business Questions ............................................................................................................................... 4
Question One ................................................................................................................................ 4
Question Two ................................................................................................................................ 4
3 Description and Preparation of Data .................................................................................................... 5
Data Source ................................................................................................................................... 5
Data Sets Utilized .......................................................................................................................... 5
Data Preparation: Merging, Cleaning, and Transforming the Data ............................................. 5
4 Exploratory Analysis .............................................................................................................................. 7
Top 10 Stores by Sales .................................................................................................................. 7
Top 5 Departments across the stores ........................................................................................... 8
Sales vs CPI & Sales vs Fuel price .................................................................................................. 9
5 Unsupervised Learning: Clustering .................................................................................................... 10
Initial Results ............................................................................................................................... 10
Insight from Cluster A ................................................................................................................. 11
Insight from Cluster B ................................................................................................................. 12
Insight from Cluster C ................................................................................................................. 12
Overall Insight ............................................................................................................................. 13
6 Supervised Learning: Regression ........................................................................................................ 14
Linear Regression with Full Data ................................................................................................. 14
Linear Regression with Imputed and Transformed Full Data ..................................................... 15
Linear Regression with Filtered Data .......................................................................................... 16
Linear Regression with Normalized Data .................................................................................... 17
7 Supervised Learning: Decision Tree .................................................................................................... 19
Two-way Split .............................................................................................................................. 19
Three-way Split ........................................................................................................................... 20
Two-way Split without DEPT and STORE .................................................................................... 21
Decision Tree on Sampled Data .................................................................................................. 22
8 Time Series Analysis ............................................................................................................................ 23
Data Exploration ......................................................................................................................... 23
Hierarchical Clustering [6]........................................................................................................... 25
Sales Forecasting [7] ................................................................................................................... 26
1|Page
Report Team 7
9 Business Implications .......................................................................................................................... 29

10 References ...................................................................................................................................... 30
Appendix A .................................................................................................................................................. 31
Appendix B .................................................................................................................................................. 32
2|Page
Report Team 7
1 Introduction
This project has been done for the fulfilment of the project requirement of the course 94-832: Business
Intelligence & Data Mining SAS. The data which formed part of our core analysis was the Walmart data
set obtained from Kaggle.
The data contained weekly sales of various departments within different stores over different period of
time. Most of the work put into the project evolves around staging the data for cleaning the data and
modelling around different parameters and methodologies.
Using different methodologies, clustering, regression and decision tree, different models were generated
and their errors were noted. Variables of importance were identified and clustering insights were drawn.
Time serie analysis was done for hierarchical clustering on sales trends, and portrayed how each cluster
was different from each other. To predict the sales for the end of the year holiday season of 2012, time
series forecasting was used.
3|Page
Report Team 7
2 Business Questions
Question One
Retailers face many challenges when trying to forecast sales due to several reasons: the scale of the
problem, the erratic sales at the each individual store, season changes, constant introduction of new
items, and repeated promotional activity [1]. In an attempt to eradicate these issues, retailers have turned
to large-scale demand-forecasting that is able to accommodate large amounts of transaction data. By
collecting these data, retailers can then mine it and project future customer behavior. The ability to
forecast at such on such a large scale allows retailers the opportunity to optimize their revenue system,
thus enabling better choices on promotions and pricing. For our project we take on this challenge and
attempt to correctly forecast sales at Walmart. Given the reputation Walmart has about its competitive
pricing structure, the ability to accurately project sales is key in its ability to function. However, research
out of the University of Michigan recently affirmed that clustering prior to forecasting sales greatly
increases the accuracy of forecasts [2]. By clustering stores based on sales, and attributes such as average
temperature, fuel prices, etc., stores can eliminate the need to control for seasonal indices and classes
(summer shoes versus winter shirts etc.). After applying hierarchical clustering to the data we hope to
determine which stores are similar, in terms of both sales and store attributes, so that we can ascern
which characteristics are key drivers and sales, thus allowing us to generate more accurate forecasts.
Question Two
Recent news reports have underscored the importance of getting an accurate forecast. In January of
2014, Walmart had several chains cut their forecasts due to the holiday season and “profit-eating”
discounts [3]. Moving forward to almost the end of 2014, Walmart again acknowledged that that it needs
to do a “better” job at forecasting in order to ensure that it is keeping appropriate levels of inventory [4].
Given these recent developments, it is clear that forecasting plays an integral role in an retailers’ success.
We will address Walmart’s challenge by leveraging sales data from 45 Walmart stores that are from
different regions within the United States. By taking these data we will be able to make predictions on
department-wide sales at each of the 45 stores. In addition to attempting to accurately predicting
department-wide sales, we will also attempt to understand the impact of markdowns (price reductions)
on holiday weeks. However, it is important to note that while we have data for each of the 45 stores
regarding department-wide sales, we will be modeling the effect of markdowns without possessing
complete historical data. Overall, we hope to understand which attributes significantly impact sales at
the store level via regression, time series analysis, and decision tree models. These results can then foster
an accurate prediction of 2012 sales data, thus allowing us to determine when is the best time to hire new
employees.
4|Page
Report Team 7
3 Description and Preparation of Data

Data Source
The Walmart Store Sales data is published as Walmart recruiting competition on Kaggle [5]. It covers
historical sales data for 45 Walmart stores in different regions of United States from 2010-02-05 to 2012-
11-01. There three files contained in the data set: “stores.csv”, “features.csv” and “train.csv”.
Data Sets Utilized

stores.csv
This file describes three important features of 45 stores. Each store (1-45) is defined with a store type (A-
C) and a store size (numeric).
features.csv
This file describes additional information about each store for the given weeks. Each record contains 5
types of promotion markdowns at the given week. It also involves the average temperature, fuel price,CPI
and unemployment rate for its corresponding geographic region in this week. As well, each record
indicates whether the week is a special holiday week.
train.csv
This is the main historical sales data for training. Each records represents weekly sales for a certain
department in the given store at given week. It also maintains the “isHoliday” field specifying whether the
week is a holiday week.
Based on preliminary analysis, we decided to use all the tables provided. Although we use the official train
data as our dataset, our business goals are not restricted to sales prediction in this project. Then the next
step focuses on data cleaning, merging and pre-processing.
Data Preparation: Merging, Cleaning, and Transforming the Data

To put together all the three .csv files (train.csv, features.csv and stores.csv), the PK – FK relations were
identified. Before denormalizing the data, all the ‘NA’ values in the table features.csv was changed to
NULL. The ‘TRUE’ and ‘FALSE’ values for the ISHOLIDAY attribute were changed to binary values 1 and 0
repsectively.
The following statements generated the denormalized Walmart_Train dataset, which was used for the
remainder of the project.
Combining Stores and Features table as Stores_Features:

CREATE TABLE Store_Features
AS
SELECT *
FROM Stores JOIN Features USING(Store);
Combining Stores_Features and Train table as Walmart_Train:

CREATE TABLE Walmart_Train
AS
SELECT *
5|Page
Report Team 7
FROM Train JOIN Store_Features USING(Store, Week, IsHoliday);
For analytical purposes and visualization, the variables TEMPERATURE, FUEL_PRICE and WEEKLY_SALES
were categorized into the following classes: (Refer appendix A for SQL queries)
Condition TEMP_CLASS
TEMPERATURE < 32 ‘Freezing’
TEMPERATURE >= 32 AND TEMPERATURE < 64 ‘Cold’
TEMPERATURE >= 64 AND TEMPERATURE < 79 ‘Comfortable’
TEMPERATURE >= 79 AND TEMPERATURE < 95 ‘Hot’
TEMPERATURE > 95 ‘Extremely Hot’
3-1: TEMP_CLASS
Condition FUEL_CLASS
FUEL_PRICE < 2.75 ‘Low’
FUEL_PRICE >= 2.75 AND FUEL_PRICE < 3.12 ‘Medium’
FUEL_PRICE > 3.12 ‘High’
3-2: FUEL_CLASS
Condition SALES_CLASS
WEEKLY_SALES <= 0 ‘Negative’
WEEKLY_SALES > 0 AND WEEKLY_SALES <= 10000 ‘Low’
WEEKLY_SALES > 10000 AND WEEKLY_SALES <= 25000 ‘Medium’
WEEKLY_SALES > 25000 AND WEEKLY_SALES <= 100000 ‘High’
WEEKLY_SALES > 100000 ‘Very High’
3-3: SALES_CLASS
To visualize the data from a better perspective, further categorical attributes were added, including the
HOLIDAY (‘Super Bowl’, ‘Labor Day’, ‘Thanksgiving’, ‘Christmas’). The two weeks before each holiday was
set as (‘Before Super Bowl’, ‘Before Labor Day’, ‘Before Thanksgiving’, ‘Before Christmas’).
Furthermore, unemployment and CPI were categorized into ‘Low’, ‘Medium’ and ‘High’. Store size was
categorized to ‘Small’, ‘Medium’ and ‘Large’. (Refer appendix B for SQL Queries)
6|Page
Report Team 7
4 Exploratory Analysis
Top 10 Stores by Sales
4-1: Top 10 Stores
The above chart shows the top 10 stores in terms of sales revenue and their percentage contribution to
the total sales generated between them. Store 20 was the highest contributor with a total of 301 Million.
The stores are mix of 7 large sized and 3 medium sized stores. Together, these 10 stores accounted for
39% of the revenue generated by the given 45 stores.
7|Page
Report Team 7
Top 5 Departments across the stores
4-2: Top 5 Departments
The above figure shows the top 5 departments across the 3 store types namely A,B & C. Interestingly,
Department number 72 showed a significant hike in sales across store type A and B. Store type A fetched
the most sales whereas Store type C fetched the least sales.
4-3: Top 10 Stores
4-4: Top 5 Departments
8|Page
Report Team 7
The above figure shows the pre-holiday sales registered by the 3 store types. The sales were the highest
before christmas followed by pre thanksgiving, pre labor day and pre super bowl sales. Store type A
registered the highest sales followed by store type B and store type C.
Sales vs CPI & Sales vs Fuel price
4-5: Sales vs CPI & Sales vs Fuel Price
No strong relationships were clear from visualizing the weekly sales data with respect to the CPI and the
fuel price during that week.
9|Page
Report Team 7
5 Unsupervised Learning: Clustering
5-1: Clustering Nodes
Initial Results
The clustering model utilizes all of the attributes within the data sans weekly sales and all of the markdown
variables and uses the store ID as the segment cluster variable role. We set the cluster variable role to
‘segment’ and indicate that the model should standardize the data. Utilizing the centroid clustering
method yields three unique clusters.
5-2: Clustering
Each of these clusters represents a group of stores that share similar values of each distinct attribute that
has been clustered around the store ID. Based on the initial results table, we can see that each cluster has
different averages across each attribute.
10 | P a g e
Report Team 7
Comparing these averages via the input means plot allows us to draw conclusions about each individual
segment (see sections 5.2-5.4)
5-3: Clusters
Insight from Cluster A

Cluster A represents the largest amount of stores within this data set. Based on the means input plot
(above), this cluster of stores has experienced lower than average fuel prices and unemployment rates.
This is further complemented by a higher than average consumer price index rating. We can also observe
that stores in this cluster are typically larger than the other stores. Overall, we might be able to infer that
Cluster A is filled with stores in richer, suburban regions, thus explaining the high CPI and low
unemployment rate and gas prices. However, because we do not have geographic information within this
dataset we are unable to make further conclusions. In terms of what variables are important within this
cluster, the chart below provides a visual of the importance of each attribute:
5-4: Cluster A - Variable Importance
11 | P a g e
Report Team 7
Per the Variable Importance graph, CPI, unemployment rate, and store size are the top three important
variables when considering this cluster of stores.
Insight from Cluster B

Cluster B, per the input means plot, has higher a than average unemployment rates and temperature, but
a lower than average fuel price, store size, and consumer price index. Again, because we do not have
geographic data pertaining to each of the stores we are unable to make any further assumptions about
the location of each of these stores within Cluster B. The variable importance graph (below) shows similar
results as the graph from Cluster A.
5-5: Cluster B - Variable Importance
Again, the consumer price index, unemployment rate, and store size are all important variables within this
cluster of stores. It appears that the same variables are important across Clusters A and B, but the
averages of each of the attributes differs slightly relative to the overall attribute averages.
Insight from Cluster C

Cluster C is completely different from Clusters A and B in that this is the only segment that addresses the
importance of holidays. Overall, Cluster C has lower than average fuel prices and and temperature, but
all other attributes are on par with the overall attribute average. Looking at the variable importance graph
below confirms that this cluster’s important variables are in stark contrast to Cluster and A and B.
12 | P a g e
Report Team 7
5-6: Cluster C - Variable Importance
This cluster of stores are grouped together because holidays have a large impact, with the variable
‘Holiday?’ dwarfing all other attribute values.
Overall Insight
Looking at all of the clusters relative to the overall population averages reveals that clustering prior to
forecasting can help eliminate errors that are often caused by seasonal changes or population disparity.
The impact of store size remains constant throughout each different cluster, but moving to attributes
beyond that reveal that the correlation between an attribute and weekly sales differs across each of the
three unique clusters.
5-7: Correlation with Weekly Sales
To sum, an initial clustering analysis reveals that different groups of stores have different relationships
with weekly sales depending on which cluster it belongs to. Holidays only appear to have an impact within
Cluster C, while the other attributes of interest are more relevant to Clusters B and C. We now move onto
our second method of unsupervised learning in an effort to test of the relationships seen above are
statistically significant.
13 | P a g e
Report Team 7
6 Supervised Learning: Regression

Linear Regression with Full Data
6-1: Regression
In this model, we maintains all the variables (CPI, DEPT, FUEL_PRICE, ISHOLIDAY, MARKDOWN1-5,
STORE_SIZE, STORE_TYPE, TEMPERATURE, UNEMPLOYMENT), also we have WEEK as Time ID, STORE as
ID and WEEKLY_SALES as target. We firstly use Data Partition node to split the data into 70% as training
set and 30% as validation set. And then we set the selection model as stepwise, forward and backward
separately, with validation error as the selection criterion.
Effect DF Sum of Squares F Value Pr > F
DEPT 80 2.46E+13 1456.02 <.0001
STORE_SIZE 1 2.15E+12 10162.5 <.0001
Table 6-1: Regression Error
The result of stepwise and forward models are pretty similar. But the backward model gives a worse result
hence we take the stepwise result here, which usually gives the best solution. In this model, we get the
average square training error of 2.0121E8 and validation error of 2.0451E8. Although the plot seems good
especially at the beginging, the overall error statistic does not perform well. As we can see from the Type
3 Analysis of Effects above, this result is caused by getting only two important variables in this model at
the end, which are DEPT and STORE_SIZE. This linear regression model contains all the values of DEPT,
which means the norminal values of department will affect the regression result deeply. The average price
of products in different departments may varies a lot. However, it does not make sense to predict the
sales only by looking at their departments. Also, STORE_SIZE contains large numbers compared with other
variables, it will cover the other variables’ effects and affect the accuracy of model.
14 | P a g e
Report Team 7
6-2: Linear Regression with Full Data
Linear Regression with Imputed and Transformed Full Data
6-3: Linear Regression with Inpute and Transform
To improve the results, we imputed the missing values of MARKDOWN1 – 5, and take the log of each
interval variable to remove their skewed. Then we got a better model whose average square training error
is 1.9549E8, and average squer validation error is 1.9831E8. Also, this model seems make more sence than
the before one. More attributes are involved in this model.
6-4: Linear Regression with Inputed and Transformed Full Data
From the screenshot of the model below, we can see the DEPT still has huge influence.
15 | P a g e
Report Team 7
Linear Regression with Filtered Data
6-7: Linear Regression with Filtered Data
To reduce the negative affect of DEPT, we filter out the department variable. We make the similar settings
for all other variables and get the new result. However, the result seems even worse. We get the average
square training error of 4.842E8 and validation error of 4.881E8. It means, the department in this dataset
is really important. And if we want to make our model more accuracte, we need to keep the department
in our regression model.
16 | P a g e
Report Team 7
6-8: Linear Regression with Filtered Data
Linear Regression with Normalized Data

To normalize the interval data, we simply modify the data in Microsoft Excel. For each variable, we create
a normalized variable using the original value devided by the largest value in this feature. Finally, we get
the normalized STORE_SIZE, TEMPERATURE, FUEL_PRICE, CPI, UNEMPLOYMENT and MARKDOWN 1-5. As
discussed above, we add the DEPT again to our model. Then we get the new model with these interval
variables. However, the average square training error and validation error are still not good, which are
4.83E8 and 4.87E8.
6-9: Linear Regression with Normalized Data
17 | P a g e
Report Team 7
6-10: Linear Regression with Normalized Data
Hence in the linear regression models, the first model (using full data) gives the best performance.
18 | P a g e
Report Team 7
7 Supervised Learning: Decision Tree
7-1: Decision Tree
Two-way Split
Variable Importance Number of Splitting Importance Validation Importance Ratio of Validation to Training Importance
Rules
DEPT 72.0 1.0 1.0 1.0
STORE 41.0 0.4964515440958596 0.4964821958816832 1.0000617417473832
MARKDOWN3 14.0 0.04084201073998688 0.04144053541125881 1.014654632826046
STORE_SIZE 7.0 0.34207810450886517 0.3480075010929593 1.0173334583708806
STORE_TYPE 6.0 0.09548333258131239 0.096222212274534 1.0077383106899038
UNEMPLOYMENT 4.0 0.03701878109843059 0.031383248828117584 0.8477655907867824
TEMPERATURE 4.0 0.023291162585169126 0.021652254529342163 0.9296339094352227
MARKDOWN5 3.0 0.012643930440875074 0.01080177216385246 0.8543049342420205
MARKDOWN1 2.0 0.0037723260628764943 0.0011206103465749833 0.297060839359281
CPI 1.0 0.034734597871631134 0.03245507329474942 0.9343730828464989
ISHOLIDAY 1.0 0.004704913825175917 0.00679092144395149 1.443367869484321
MARKDOWN4 1.0 0.0017796209442395006 0.0 0.0
MARKDOWN2 0.0 0.0 0.0 NaN
FUEL_PRICE 0.0 0.0 0.0 NaN
Table 7-1: Two Way Split
A two split decision tree was generated on the train dataset. The weekly sales classes which were
generated earlier were used as target classes. The model was heavily dependent on department (DEPT)
and store (STORE). Majority of the splitting rules were based on these two attributes. The two-way split
decision tree generated an average square error of 0.04222.
The WEEKLY_SALES is less dependent on the attribute ISHOLIDAY as opposed to the STORE_SIZE,
STORE_TYPE, UNEMPLOYMENT and TEMPERATURE. Looking at the data from a broader perspective, the
location of the store played a major factor in the weekly sales. A store located in a densely populated
urban area would have more sales as opposed to one in a rural area, regardless of the week being a holiday
or not. The holiday sales in a store located far off from the city might still be less compared to the average
sales in a store located in the city on a day which is not a public holiday. Stores in the cities would be larger
and would have larger amount of sales. To explore this scenario another approach was pursued. (Refer
section 4)
19 | P a g e
Report Team 7
7-2: Two-way Split Decision Tree
Three-way Split
Variable Importance Number of Importance Validation Importance Ratio of Validation to Training
Splitting Rules Importance
DEPT 150.0 1.0 1.0 1.0
STORE 73.0 0.7197389364808112 0.7172217848271354 0.9965026879524074
STORE_SIZE 39.0 0.16985483324111295 0.17565253608886602 1.034133281562398
CPI 69.0 0.13354907518567805 0.11875136590034423 0.8891964675550165
TEMPERATURE 49.0 0.12228743722593886 0.1227097368514723 1.003453336132584
STORE_TYPE 11.0 0.10896237311981295 0.1091215150571583 1.0014605219470611
UNEMPLOYMENT 43.0 0.0870216497060246 0.08140314276386681 0.9354355271229843
MARKDOWN3 42.0 0.0838513980255293 0.07108265278770555 0.8477217370432371
FUEL_PRICE 29.0 0.055805194471475056 0.045466506369912014 0.8147360976074067
MARKDOWN4 4.0 0.02260829885174643 0.01494100486542834 0.660863736957997
MARKDOWN5 4.0 0.01614486406343703 0.013490146227052534 0.8355688951016575
MARKDOWN2 4.0 0.014878670018523063 0.013747824483138145 0.9239955228540533
ISHOLIDAY 3.0 0.014085153665462117 0.008854538152353212 0.6286433476452034
MARKDOWN1 4.0 0.012841667939644022 0.016799803214149107 1.3082259479927658
Table 7-2: Three way split
In terms of variable importance, DEPT and STORE were the most important variables. However, the three-
way split provided more flexibility to the model in terms or decision making and hence the errors in
classifying them into the weekly sales classes, were less as expected. The average square error was found
to be 0.02765.
20 | P a g e
Report Team 7
7-3: Three-way Split Decision Tree
Two-way Split without DEPT and STORE

To explore how much effect the attributes, DEPT and STORE had on the decision tree model, these
attributes were rejected and the model was generated with the same parameters as before. The model
generated portrayed a two fold increase in the average squared error (0.11243). Surprisingly, without
information on which DEPT and which STORE, the sales belongs to, the model classified all other classes
other than LOW WEEKLY_SALES incorrectly in almost all the cases. This can be seen from the graph plots
shown below.
7-4: Two-way Split without Dept. and Store
21 | P a g e
Report Team 7
Decision Tree on Sampled Data
7-5: Decision Tree on Sampled Data
To observe how the weekly sales are dependent on the other features in the dataset, information on the
department and store ID was rejected. The data was filtered such that the classes NEGATIVE and VERY
HIGH WEEKLY_SALES were filtered out. The data was further sampled such that all the remaining classes,
LOW, MEDIUM and HIGH WEEKLY_SALES had the same number of observations.
The decision trees modeled on this data returned results as expected: The STORE_SIZE was one of the
major factors that determined the weekly sales and hence ended up as the most important variable for
splitting nodes.
Variable Importance Number of Importance Validation Importance Ratio of Validation to Training
Splitting Rules Importance
STORE_SIZE 16.0 1.0 1.0 1.0
CPI 4.0 0.2651285102856137 0.24717493036369145 0.9322834805559709
UNEMPLOYMENT 5.0 0.24441028209455445 0.20442631071106554 0.8364063449342921
STORE_TYPE 3.0 0.15295892375537554 0.15072179548628872 0.9853743200189836
MARKDOWN3 2.0 0.052709576258346755 0.0315604300793954 0.5987608385373367
FUEL_PRICE 1.0 0.022610228455857168 0.012600386988009991 0.5572870266485904
TEMPERATURE 1.0 0.02168392343651502 0.010275651333624921 0.4738833986252257
ISHOLIDAY 0.0 0.0 0.0 NaN
Table 7-3: Sampled Date Decision Tree
However, the average square error for both trees (two-way split and three-way split) turned out to be
0.211, hence producing nodes with lower levels of purities for the tree.
The following table summarizes the decision tree models that were generated for the WALMART_TRAIN
dataset.
Average Squared Error

Two-way Split 0.042
Three-way Split 0.028
Two-way Split without DEPT and STORE 0.112
Two-way & Three-way Split on Sampled
0.211
Data
Table 7-4: Avg. Error
22 | P a g e
Report Team 7
8 Time Series Analysis

Due to the nature of the data the results generated by the standard algorithms used in the previous
sections, provided little insight. To generate better results, the time series analysis tools of SAS were used.
Data Exploration
To analyze the data from a time series perspective, the time dimension was set up in conjunction with the
cross sectional dimensions, store and department; using SAS TS Data Preparation Node.
8-1: Dimension Cube
The setting up of this structure allowed flexibility in visualizing data on aggregation over different
dimensions.
The following plot shows the weekly sales for 100 of the store – department combination. It is quite
evident from the plot that the sales was recorded high during the holiday seasons: Christmas in December
and before summer in May. Other notable peaks in sales was during Thanksgiving in November,
Superbowl in February and Labor Day in September.
8-2: Weekly Sales for Store-Department
23 | P a g e
Report Team 7
For further analysis, the mean weekly sales for each store as well as each department was plotted. Some
of the departments had very high average weekly sales compared to the others. These departments
although not mentioned by Walmart for privacy purposes, might be the departments which sell products
required by people on a day to day basis – like groceries; or high grossing departments like electronics,
etc.
8-3: Mean Weekly Sales by Store
8-4: Mean Weekly Sales by Department
24 | P a g e
Report Team 7
Hierarchical Clustering [6]
8-5: Hierarchical Clustering
Based on the values of the different input variables such as CPI, UNEMPLOYMENT, ISHOLIDAY,
TEMPERATURE and different MARKDOWN values, the time series inputs were used for clustering. The
clustering mechanism used mean squared error between the total weekly sales of the stores as the
similarity measure.
The following dendogram shows the distance between the different clusters that were generated.
8-6: Clustering Dendogram
Based on the minimum distance between clusters, at a value of 0.1 distance, three main clusters were
generated. Stores 7, 16, 17, 38 and 44 were clustered together as cluster A. Stores 28, 30, 33, 36, 37, 42
and 43 were clustered together as cluster B. And the rest of the stores belonged to cluster C. The features
of these clusters became more evident during the forecasting process.
25 | P a g e
Report Team 7
The following graph shows how the different stores were clustered in terms of their trends on weekly
sales based on the trends of other attributes.
8-7: Clustering Graph
The features of these clusters became more evident when the trends in sales for the stores were analyzed.
Stores from the same cluster showed similar trends in weekly sales.
Sales Forecasting [7]

Using SAS Enterprise Miner’s Time Series Exponential Smoothing Tool, the sales for the stores was
forecasted for the next six weeks, until December 2012. This sales forecasting methodology is
independent of any of the earlier mentioned input variables. The forecasting takes into consideration
seasonal effects and trends in sales over the period of February 2010 to October 2012.
8-8: Sales Forecasting
For each store, different models were used to forecast the sales. The model with the least standard error
was automatically selected as the best model for forecasting sales for that store. The additive winters
model and seasonal models proved to be the best fit for most stores. The following table illustrates which
model was used for each store, and the paaremeter estimate and the associated standard error.
26 | P a g e
Report Team 7
Time Series ID Store Model Parameter Parameter Estimate Standard Error

1.0 1.0 ADDWINTERS LEVEL 0.0034631198964860067 0.00441693090198638
1.0 1.0 ADDWINTERS SEASON 0.6055475095192919 0.040597985445483306
1.0 1.0 ADDWINTERS TREND 0.001 0.002625959422274718
2.0 2.0 WINTERS LEVEL 0.16545582491041058 0.02507185069755535
2.0 2.0 WINTERS SEASON 0.921247729943568 0.05853998565921424
2.0 2.0 WINTERS TREND 0.001 0.009608906293211008
6.0 6.0 SEASONAL SEASON 0.7157152817395214 0.04219586666398573
6.0 6.0 SEASONAL LEVEL 0.12046874034834318 0.016873522037056683
Table 8-1: Models
Based on the models, the weekly sales of each store was forecasted for 12 weeks, covering the holiday
season in December (the forecasted sales are shown after the vertical line on the graph). The following
graph shows the forecasted sales of a store that is doing fairly well. Store 1 is a store from the cluster B.
All stores in the cluster show a similar trends – very high peak of sales during Christmas.
8-9: Store 1 - Cluster B
The following graph shows the sales for Store 7. The store show a good amount of sales from May to
September and from November to January. This store could have good potential growth in the future.
This store was selected from cluster A. All stores in this cluster have similar trend, which brings in a steady
amount of income in addition to higher sales during holidays. These can be considered as stores with
steady growth rates.
27 | P a g e
Report Team 7
8-10: Store 7 - Cluster A
The following graph shows the sales for Store 36, which Walmart should focus on. The store has been
losing out on sales and is likely to go out of business over the next couple of years. The total sales for the
store decreased by half over a period of 2 years. Store 36 was taken off from cluster C. Stores from this
cluster generally showed a declining trend.
8-11: Store 36 – Cluster C
28 | P a g e
Report Team 7
9 Business Implications
Based on the analysis made, the Walmart should hire personnel a few weeks before the holiday seasons,
especially Thanksgiving and Christmas. This allows them to perform better when the sales go up gradually
as the holidays get closer.
Using the cluster information from the section 8.2 can be used in conjunction with sales forecasting to
come up with more accurated prediction.
Wal-mart should keep a close eye on the stores which are running out of business. Also provide an
incentive to other stores to improve their sales, and hire the right sales representatives.
29 | P a g e
Report Team 7
10 References
[1] M. Gilliland, "Demand Forecasting in Retail," [Online]. Available:

http://www.sas.com/news/feature/retail/aug06forecast.html.
[2] M. K. &. R. R. Nitin Patel, "Clustering Models to Improve Forecasts in Retails Merchandising,"
[Online]. Available: http://www.cytel.com/Papers/INFORMS_Prac_%2004.pdf.
[3] L. C.-L. &. R. Dudley, "Wal-Mart Sees Profit at Low End of Forecast," [Online]. Available:
http://www.bloomberg.com/news/2014-01-31/wal-mart-sees-profit-at-low-end-of-forecast.html.
[4] R. Dudley, "Wal-Mart Cuts Annual Sales Forecast as Supercenters Struggle," [Online]. Available:
http://www.businessweek.com/news/2014-10-16/wal-mart-cuts-annual-sales-forecast-as-its-
supercenters-.
[5] "Kaggle - Walmart Recruiting - Stores Sales Forecasting," [Online]. Available:

https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting.
[6] T. L. Sascha Schubert, "TIme Series Data Mining with SAS Enterprise Miner," [Online]. Available:
http://support.sas.com/resources/papers/proceedings11/160-2011.pdf.
[7] S. J. Satyajit Dwivedi, "Time-series Data Mining," [Online]. Available:

http://www.iasri.res.in/sscnars/data_mining/10-
SAS%20Enterprise%20Miner%207.1%20Time%20Series%20Data%20Mining.pdf.
30 | P a g e
Report Team 7
Appendix A
ALTER TABLE WALMART_TRAIN
ADD TEMP_CLASS VARCHAR2(15);
UPDATE WALMART_TRAIN
SET TEMP_CLASS = (CASE
WHEN TEMPERATURE < 32 THEN 'Freezing'
WHEN TEMPERATURE >= 32 AND TEMPERATURE < 64 THEN 'Cold'
WHEN TEMPERATURE >= 64 AND TEMPERATURE < 79 THEN 'Comfortable'
WHEN TEMPERATURE >= 79 AND TEMPERATURE < 95 THEN 'Hot'
WHEN TEMPERATURE > 95 THEN 'Extremely Hot'
ELSE NULL
END);
-- http://www.gasbuddy.com/gb_gastemperaturemap.aspx

ADD FUEL_CLASS VARCHAR2(15);
SET FUEL_CLASS = (CASE
WHEN FUEL_PRICE < 2.75 THEN 'Low'
WHEN FUEL_PRICE >= 2.75 AND FUEL_PRICE < 3.12 THEN 'Medium'
WHEN FUEL_PRICE > 3.12 THEN 'High'
ELSE NULL
END);
-- http://www.statisticbrain.com/wal-mart-company-statistics/

ADD SALES_CLASS VARCHAR2(15);
SET SALES_CLASS = (CASE
WHEN WEEKLY_SALES <= 0 THEN 'Negative'
WHEN WEEKLY_SALES > 0 AND WEEKLY_SALES <= 10000 THEN 'Low'
WHEN WEEKLY_SALES > 10000 AND WEEKLY_SALES <= 25000 THEN 'Medium'
WHEN WEEKLY_SALES > 25000 AND WEEKLY_SALES <= 100000 THEN 'High'
WHEN WEEKLY_SALES > 100000 THEN 'Very High'
ELSE NULL
END);
31 | P a g e
Report Team 7
Appendix B
CREATE TABLE WALMART_TRAIN_HOLIDAY
AS
SELECT *
FROM WALMART_TRAIN;
ALTER TABLE WALMART_TRAIN_HOLIDAY

ADD HOLIDAY VARCHAR2(25);
UPDATE WALMART_TRAIN_HOLIDAY
SET HOLIDAY ='Super Bowl'
WHERE WEEK IN (TO_DATE('12-Feb-10', 'DD-Mon-RR'), TO_DATE('11-Feb-11', 'DD-Mon-RR'), TO_DATE('10-
Feb-12', 'DD-Mon-RR'), TO_DATE('08-Feb-13', 'DD-Mon-RR'));
UPDATE
WALMART_TRAIN_HOLIDAY
SET HOLIDAY ='Labor

Day'
WHERE WEEK IN (TO_DATE('10-Sep-10', 'DD-Mon-RR'), TO_DATE('09-Sep-11', 'DD-Mon-RR'), TO_DATE('07-

Sep-12', 'DD-Mon-RR'), TO_DATE('06-Sep-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY
='Thanksgiving'
WHERE WEEK IN (TO_DATE('26-Nov-10', 'DD-Mon-RR'), TO_DATE('25-Nov-11', 'DD-Mon-RR'), TO_DATE('23-

Nov-12', 'DD-Mon-RR'), TO_DATE('29-Nov-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY
='Christmas'
WHERE WEEK IN (TO_DATE('31-Dec-10', 'DD-Mon-RR'), TO_DATE('30-Dec-11', 'DD-Mon-RR'), TO_DATE('28-

Dec-12', 'DD-Mon-RR'), TO_DATE('27-Dec-13', 'DD-Mon-RR'));
SET HOLIDAY ='Before Super Bowl'
WHERE (WEEK BETWEEN (TO_DATE('12-Feb-10', 'DD-Mon-RR') - 14) AND TO_DATE('12-Feb-10', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('11-Feb-11', 'DD-Mon-RR') - 14) AND TO_DATE('11-Feb-11', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('10-Feb-12', 'DD-Mon-RR') - 14) AND TO_DATE('10-Feb-12', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('08-Feb-13', 'DD-Mon-RR') - 14) AND TO_DATE('08-Feb-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY ='Before Labor Day'

WHERE (WEEK BETWEEN (TO_DATE('10-Sep-10', 'DD-Mon-RR') - 14) AND TO_DATE('10-Sep-10', 'DD-Mon-RR'))
32 | P a g e
Report Team 7
OR (WEEK BETWEEN (TO_DATE('09-Sep-11', 'DD-Mon-RR') - 14) AND TO_DATE('09-Sep-11', 'DD-Mon-RR'))

OR (WEEK BETWEEN (TO_DATE('07-Sep-12', 'DD-Mon-RR') - 14) AND TO_DATE('07-Sep-12', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('06-Sep-13', 'DD-Mon-RR') - 14) AND TO_DATE('06-Sep-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY ='Before Thanksgiving'

WHERE (WEEK BETWEEN (TO_DATE('26-Nov-10', 'DD-Mon-RR') - 14) AND TO_DATE('26-Nov-10', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('25-Nov-11', 'DD-Mon-RR') - 14) AND TO_DATE('25-Nov-11', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('23-Nov-12', 'DD-Mon-RR') - 14) AND TO_DATE('23-Nov-12', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('29-Nov-13', 'DD-Mon-RR') - 14) AND TO_DATE('29-Nov-13', 'DD-Mon-RR'));
UPDATE
SET HOLIDAY ='Before Christmas'

WHERE (WEEK BETWEEN (TO_DATE('31-Dec-10', 'DD-Mon-RR') - 14) AND TO_DATE('31-Dec-10', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('30-Dec-11', 'DD-Mon-RR') - 14) AND TO_DATE('30-Dec-11', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('28-Dec-12', 'DD-Mon-RR') - 14) AND TO_DATE('28-Dec-12', 'DD-Mon-RR'))
OR (WEEK BETWEEN (TO_DATE('27-Dec-13', 'DD-Mon-RR') - 14) AND TO_DATE('27-Dec-13', 'DD-Mon-
RR'));
SET HOLIDAY ='Not Holiday'
WHERE HOLIDAY IS NULL;

ADD STORE_SIZE_CLASS VARCHAR2(10);
SET STORE_SIZE_CLASS = CASE
WHEN STORE_SIZE < 100000 THEN 'Small'
WHEN STORE_SIZE >= 100000 AND STORE_SIZE < 200000 THEN 'Medium'
WHEN STORE_SIZE >= 200000 THEN 'Large'
END;

ADD UNEMPLOYMENT_CLASS VARCHAR2(10);
SET UNEMPLOYMENT_CLASS = CASE
WHEN UNEMPLOYMENT < 7 THEN 'Low'
WHEN UNEMPLOYMENT >= 7 AND UNEMPLOYMENT < 11 THEN 'Medium'
WHEN UNEMPLOYMENT >= 11 THEN 'High'
END;

ADD CPI_CLASS VARCHAR2(10);
SET CPI_CLASS = CASE
WHEN CPI < 159 THEN 'Low'
WHEN CPI >= 159 AND UNEMPLOYMENT < 192 THEN 'Medium'
WHEN CPI >= 192 THEN 'High'
END;
33 | P a g e
Report Team 7

ADD DEPT_CLASS VARCHAR2(12);
UPDATE WALMART_TRAIN_HOLIDAY OH
SET DEPT_CLASS = 'Low Sales'
WHERE DEPT IN ( SELECT DEPT
FROM ( SELECT DEPT, MEDIAN(WEEKLY_SALES) MD
FROM WALMART_TRAIN_HOLIDAY
GROUP BY DEPT)
WHERE MD < 20000);
SET DEPT_CLASS = 'Medium Sales'
GROUP BY DEPT)
WHERE MD > = 20000 AND MD < 40000);
SET DEPT_CLASS = 'High Sales'
GROUP BY DEPT)
WHERE MD > = 40000);
34 | P a g e

Wal Mart Sales Forecasting

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Wal Mart Sales Forecasting

Hochgeladen von

Copyright:

Verfügbare Formate

WAL-MART SALES

9 Business Implications .......................................................................................................................... 29

3 Description and Preparation of Data

Data Sets Utilized

Data Preparation: Merging, Cleaning, and Transforming the Data

Combining Stores and Features table as Stores_Features:

Combining Stores_Features and Train table as Walmart_Train:

FROM Train JOIN Store_Features USING(Store, Week, IsHoliday);

4-1: Top 10 Stores

Top 5 Departments across the stores

4-2: Top 5 Departments

4-3: Top 10 Stores

4-4: Top 5 Departments

Sales vs CPI & Sales vs Fuel price

4-5: Sales vs CPI & Sales vs Fuel Price

5 Unsupervised Learning: Clustering

5-1: Clustering Nodes

Insight from Cluster A

5-4: Cluster A - Variable Importance

Insight from Cluster B

5-5: Cluster B - Variable Importance

Insight from Cluster C

5-6: Cluster C - Variable Importance

5-7: Correlation with Weekly Sales

6 Supervised Learning: Regression

6-2: Linear Regression with Full Data

Linear Regression with Imputed and Transformed Full Data

6-3: Linear Regression with Inpute and Transform

6-4: Linear Regression with Inputed and Transformed Full Data

6-5: Linear Regression with Inputed and Transformed Full Data

6-6: Linear Regression with Inputed and Transformed Full Data

Linear Regression with Filtered Data

6-7: Linear Regression with Filtered Data

6-8: Linear Regression with Filtered Data

Linear Regression with Normalized Data

6-9: Linear Regression with Normalized Data

6-10: Linear Regression with Normalized Data

7 Supervised Learning: Decision Tree

7-1: Decision Tree

7-2: Two-way Split Decision Tree

7-3: Three-way Split Decision Tree

Two-way Split without DEPT and STORE

7-4: Two-way Split without Dept. and Store

Decision Tree on Sampled Data

7-5: Decision Tree on Sampled Data

Average Squared Error

8 Time Series Analysis

8-1: Dimension Cube

8-2: Weekly Sales for Store-Department

8-3: Mean Weekly Sales by Store

8-4: Mean Weekly Sales by Department

Hierarchical Clustering [6]

8-5: Hierarchical Clustering

8-6: Clustering Dendogram

8-7: Clustering Graph

Sales Forecasting [7]

8-8: Sales Forecasting

Time Series ID Store Model Parameter Parameter Estimate Standard Error

8-9: Store 1 - Cluster B

8-10: Store 7 - Cluster A

8-11: Store 36 – Cluster C

[1] M. Gilliland, "Demand Forecasting in Retail," [Online]. Available:

[5] "Kaggle - Walmart Recruiting - Stores Sales Forecasting," [Online]. Available: