Amit Kumar: Bigmart Sales Prediction A Project Report

BIGMART SALES PREDICTION
A PROJECT REPORT
Submitted by
AMIT KUMAR
(2017272001)
submitted to the Faculty of
INFORMATION AND COMMUNICATION ENGINEERING
in partial fulfillment
for the award of the degree
of
MASTER OF COMPUTER APPLICATION
DEPARTMENT OF INFORMATION SCIENCE AND TECHNOLOGY

COLLEGE OF ENGINEERING, GUINDY
ANNA UNIVERSITY
CHENNAI 600 025
APRIL 2020
ii
ANNA UNIVERSITY
CHENNAI - 600 025
BONA FIDE CERTIFICATE
Certified that this project report BIG MART SALES PREDICTION

is the bona fide work of AMIT KUMAR(2017272001) who carried out project
work under my supervision. Certified further that to the best of my knowledge
and belief, the work reported herein does not form part of any other thesis or
dissertation on the basis of which a degree or an award was conferred on an
earlier occasion on this or any other candidate.
PLACE:Chennai DR D NARASHIMAN
DATE:07-04-2020 TEACHING FELLOW
PROJECT GUIDE
DEPARTMENT OF IST, CEG
ANNA UNIVERSITY
CHENNAI 600025
COUNTERSIGNED
Dr. SASWATI MUKHERJEE

HEAD OF THE DEPARTMENT
DEPARTMENT OF INFORMATION SCIENCE AND TECHNOLOGY
COLLEGE OF ENGINEERING, GUINDY
ANNA UNIVERSITY
CHENNAI 600025
iii
ABSTRACT
Now a day’s Data Analysis is need for every data applications to

extract the useful information from it and to draw conclusion according to the
information. Data analytic techniques and algorithms are mostly used by the
commercial industries which enables them to take precise business decisions. It
is also used by the analysts and experts to authenticate or negate experimental
layouts, assumptions and conclusions. In this project , sales prediction is carried
out based on item outlet sales by analysing or exploring the Big mart sales
data set. Inside the data frame or data set which is used for predictions here is
the counter variable (item outlet sales and forecaster variables (eg-item weight,
item visibility, item outlet count etc). The Various machine learning and data
extraction models is considered for this prediction are Linear regression, Decision
tree, Random Forest. The outcomes of this project is - High quality analysis
compared to others.
iv
v
ACKNOWLEDGEMENT
First and Foremost I would like to express my sincere thanks and

deep sense of gratitude of my internal guide DR D NARSHIMAN Teaching
fellow, Department of information science and technology,for his valuable
guidance suggestions and constant encouragement paved way for the successful
completion of the project work.
I would like to thankful to Dr. SASWATI MUKHERJEE,Professor

and Head, Department of Information Science and Technology, CEG, Anna
University, for giving me an opportunity to do the project work in the domain
which I am interested and providing me all support and guidance to complete
the project on time.
I am bound to convey my sincere thanks to the the project committee

member DR P YOGESH,Professor and Head,Department of Information
Science and technology,DR E UMA Assistant ,Professor,Department of
information science and Technology,DR V PANDIYARAJU ,Teaching
Fellow,DR T.J.VIJAYAKUMAR ,Teaching Fellow, Department of information
science and Technology For their suggestions and motivation.
My special thanks to my family members,friends and one and all who

supported me in completing the project.
AMIT KUMAR
vi
TABLE OF CONTENTS
ABSTRACT iii
ABSTRACT (TAMIL) iii
ACKNOWLEDGEMENT v
LIST OF TABLES vi
LIST OF FIGURES viii
LIST OF SYMBOLS AND ABBREVIATIONS ix
1 INTRODUCTION 1
1.1 PREDICTION METHODOLOGY 1
1.1.1 MOTIVATION 2
1.1.2 PROPOSED WORK 2
1.2 ORGANIZATION OF REPORT 3
2 LITERATURE SURVEY 4
2.1 Regression Algorithm 4
2.2 Classifier Algorithm 5
3 SYSTEM ARCHITECTURE 9
3.1 MODULE DESCRIPTION 10
3.1.1 DATASET DESCRIPTION OF BIG MART 11
3.1.2 DATA EXPLORATION 11
3.1.3 DATA CLEANING 11
3.1.4 FEATURE ENGINEERING 12
3.1.5 MODEL BUILDING 12
4 PSEUDO CODE/ALGORITHM 16
4.1 LINEAR REGRESSION 16
4.2 RANDOM FOREST 16
4.3 DECISION TREE REGRESSOR 17
4.4 XGBOOST REGRESSOR 18
5 IMPLEMENTATION/ RESULTS AND ANALYSIS 20

5.1 IMPLEMENTATION 20
5.2 RESULT 21
5.3 RESULT 25
vii
5.3.1 LINEAR REGRESSION 26

5.3.1.1 CODE 26
5.3.1.2 OUTPUT 27
5.3.2 RANDOM FOREST 27
5.3.2.1 CODE 28
5.3.2.2 OUTPUT 29
5.3.3 DECISION TREE REGRESSOR 29
5.3.3.1 CODE 30
5.3.3.2 OUTPUT 31
5.3.4 XGBOOST REGRESSOR 31
5.3.4.1 CODE 32
5.3.4.2 OUTPUT 33
5.4 ANALYSIS 34
6 CONCLUSION AND FUTURE WORK 35

6.1 CONCLUSION 35
6.2 FUTURE WORK 35
REFERENCES 38
viii
LIST OF FIGURES
3.1 ARCHITECTURE DIAGRAM OF BIGMART SALES

PREDICTION SYSTEM 9
3.2 WORKING PROCEDURE OF PROPOSED MODEL. 10
3.3 EQUATION OF DECISION TREE 13
3.4 EQUATION OF LINEAR REGRESSION 14
3.5 EQUATION OF XGBOOST ALGORITHM 14
3.6 FRAMEWORK OF PROPOSED MODEL 15
5.1 IMPLEMENTATION DIAGRAM 21

5.2 EQUATION DIAGRAM 21
5.3 DATASET DIAGRAM 22
5.4 MISSING VALUE GRAPH 23
5.5 MEAN() 24
5.6 LABEL ENCODER() 25
5.7 SKLEARN LINEAR REGRESSION 26
5.8 SALES RECORD THROUGH LINEAR REGRESSION
27
5.9 SKLEARN RANDOM FOREST 28
5.10 SALES RECORD THROUGH RANDOM FOREST
29
5.11 SKLEARN RANDOM FOREST 30
5.12 SALES RECORD THROUGH DECISION TREE 31
5.13 SKLEARN XGBOOST 32
5.14 SALES RECORD THROUGH XGBOOST REGRESSOR
33
6.1 COMPARISION RESULT OF ALGORITHM 36

ix
LIST OF SYMBOLS AND ABBREVIATIONS
ARIMA Auto Regressive Integrated Moving Average

DECSR Decision Tree Regressor
GARCH Generalized Auto Regressive Conditionally Heteroskedastic
LINRG LinearRegression
MAE Mean Absolute Error
RMSE Root Mean Square Error
RI Rule Induction
RFC Randon Forest Classifier
XGB XGboost Regressor
1
CHAPTER 1
INTRODUCTION
This chapter gives an explanation about the Bigmart sales prediction.

Bigmart is a big supermarket chain, with stores all around the country. Big Mart
is an international brand started its journey in 2007 as One Stop Shopping center
and Free Marketplace. Buy, sell and advertise without fee or at low cost. Big
Mart provide Grocery Products, Dairy Products, Organic Foods, Bakery Foods,
Frozen Food and Free Home Delivery. The data scientists at Big-Mart have
collected 2013 sales data for 1559 products across 10 stores in different cities.
The main objective is to understand whether specific properties of products
and/or stores play a significant role in terms of increasing or decreasing sales
volume. To achieve this goal, a predictive model is developed to find out the
sales of each product at a particular store. This will help Big-Mart to boost their
sales by learning optimized product organization inside stores.
1.1 PREDICTION METHODOLOGY
This section gives an overview of various prediction methods used

on data analysis.
Linear Regression: linear regression is a linear approach of modelling

the relationship between a scalar response (or dependent variable) and one or
more explanatory variables (or independent variables). In the case of one explanatory
variable is called simple linear regression.
Decision Tree: Decision tree is one of the predictive modeling approaches

used in statistics, data mining and machine learning. It uses a decision tree
(as a predictive model) to observe about an item (represented in the branches)
to conclusions about the item’s target value (represented in the leaves). In
Tree models where the target variable can take a discrete set of values are
called classification trees; in these tree structures, leaves represent class labels
and branches represent conjunctions of features that lead to those class labels.
2
Decision trees where the target variable can take continuous values (typically
real numbers) are called regression trees. In decision analysis, a decision tree
can be used to visually and explicitly represent decisions making. In data mining,
a decision tree describes data (but the resulting classification tree can be an input
for decision making).
Random Forest Classifier: Random forests or Random Decision

Forests are an ensemble learning method for classification, regression and other
tasks that operate by constructing a multitude of decision trees at training time
and outputting the class that is the mode of the classes (classification) or mean
prediction (regression) of the individual trees. Random decision forests correct
for decision trees’ habit of overfitting to their training set.
XgBoost Regressor: XGBoost is a decision-tree-based ensemble

Machine Learning algorithm that uses a gradient boosting framework. In prediction
problems involving unstructured data (images, text, etc.) .
The next subdivision highlights about the motivation of this project.
1.1.1 MOTIVATION
The Bigmart Sales Prediction is to Predict the Sales of on product

by comparing the previous sales of the product and current sales of the product.
Based on the result of the prediction, an product effect on the overall profit can
be determined and which helps to avoid the wastage of that product, and to
increase the sales of the other products.
1.1.2 PROPOSED WORK
The proposed work Bigmart Sales Prediction develops a model for

predicting the effect of an product sales on overall profit of the stores. For
building the model Big Mart sales uses XgBoost techniques and undergoes
several sequence of steps.Every step plays a vital role for building the proposed
model. The model uses 2013 Big mart dataset. After preprocessing and filling
3
missing values, the proposed ensemble classifier uses Decision trees, Linear
regression, Random forest classsifier .Both Mean Absolute Error(MAE) and
Root Square Mean Error(RSME) are used as accuracy metrics for predicting
the sales in Big Mart.
1.2 ORGANIZATION OF REPORT
• Chapter 1 :Gives an introduction of the various prediction

methods.It includes the basic information of an Bigmart stores.It also
includes the challenges that will be faced while implementing the
system with real-world problems.
• Chapter 2 This chapter gives an overview about the various research

carried out in this field.
• Chapter 3 will consist of the System design which includes all

functionality of the system and also includes a working model of
a system. This chapter consists of detailed information about the
system and system model, which will illustrate the working model of
the entire system
• Chapter 4 will consist of the implementation details of the system.It

has implementation and performance analysis of the proposed
system. It will consist of the working process of a system
• Chapter 5 will conclude implementation of the project.result of the

algorithm and compare the algorithm
• Chapter 6 will conclude the entire work and also provide the future
enhancements of the existing system
4
CHAPTER 2
LITERATURE SURVEY
This chapter gives a overview of previous working in Bigmart sales

prediction project.
2.1 Regression Algorithm
The data scientists at BigMart have collected 2013 sales data for
1559 products across 10 stores in different cities. Barun Waldron and Sanjeev
trivedi has introduced this method[1]. The main objective is to understand
whether specific properties of products and/or stores play a significant role in
terms of increasing or decreasing sales volume. To achieve this goal, they
built a predictive model and find out the sales of each product at a particular
store which helps BigMart to boost their sales by learning optimised product
organization inside stores. They used Linear Regression approach.They can
refine the methodologies easy to analyse.but disadvantage is that it is not full
accurate.
Sales forecasting is an important aspect of different companies which

are engaged in retailing, logistics, manufacturing, marketing and wholesaling.
It allows companies to efficiently allocate resources in future. Conley, T G
and Galeson, D W[2] estimated achievable sales revenue and to plan a better
strategy for future growth of the company. In thier paper, prediction of sales of
a product from a particular outlet is performed via a two-level approach which
produces a better predictive performance compared to any of the popular single
model predictive learning algorithms. This approach is performed on Big Mart
Sales data of the year 2013. Data exploration, data transformation and feature
engineering play a vital role in predicting accurate results.
5
The result demonstrated that the two-level statistical approach

performed better than a single model approach as the former provided more
information that leads to better prediction.They used linear regression, Decision
tree. Its advantage is high reliable. But less accuracy .
Analysing the big data and extracting knowledge of full information

from the data is bit tough. so, for mining of complex data sets a powerful and
effective data mining tool to extract the information and take better decisions
in future is needed. To achieve this a powerful and efficient xgboost is used.
Decision tree regressor has several inbuilt packages which provides us efficiency
like-ggplot2, VIM etc. Alishahi et.al.[3], used R an open-source data analysis
environment and programming language for the process of converting data into
knowledge and understanding it, which is a critical part of statistics. For
the effective processing and analysis of big data, it allows users to conduct
a number of tasks that are essential. R consists of numerous ready-to-use
statistical modelling algorithms and machine learning which allow users to
create reproducible research and develop data products. They used Linear
Regression, Random forest, Decision Tree,R algorithm. Deal with such giant
dataset which has missing values as well as regularities. But as compared to
previous accuracy increased.
2.2 Classifier Algorithm
Sales Prediction is used to predict sales of different products sold

at various outlets in different cities of a Big Mart Company.Aghion et.al.[4]
in their paper discussed about the volume of products. Outlets are growing
exponentially predicting them manually becomes cumbersome. Predicting the
right demand for a product is an important phenomenon in terms of space, time
and money for the sellers. Sellers may have limited time or need to sell their
products as soon as possible due to the storage of time and money

restrictions.In this method they used -Linear Regression,Xgboost’s. They have
high accuracy as compared To other its disadvantage is not fully accurate. Sales
forecasting as well as analysis of sale forecasting has been conducted by many
authors as summarized: The statistical and computational methods are studied
6
in [5] also this paper elaborates the automated process of knowledge acquisition.
Machine learning is the process where a machine will learn from data in
the form of statistically or computationally method and process knowledge
acquisition from experiences. Various machine learning techniques with their
applications in different sectors has been presented in [6]. Pat Langley and
Herbert A pointed out most widely used data mining technique. As in today’s
era data analysis is so important to every one to make better decisions in their
field. Analysing the big data and extracting knowledge full information from
the data is little bit tough. so, for mining of complex datasets powerful and
effective data mining tool to extract the information and take better decisions
in future is needed. R here which is an open source free data mining tool and
efficient too. R has several inbuilt packagesFikes, Richard E et.al.[7] which
provides us efficiency like-ggplot2, VIM etc. R is an open-source data analysis
environment and programming language. The process of converting data into
knowledge, insight and understanding is Data analysis, which is a critical part
of statistics. For the effective processing and analysis of big data, it allows
users to conduct a number of tasks that are essential. R consists of numerous
ready-to-use statistical modelling algorithms and machine learning which allow
users to create reproducible research and develop data products.in this method
we can use Linear Regression , Decision tree , K-Means , Naı̈ve Bayes. we
can implement random forest algorithm, éclat algorithm etc which we will do in
our future work. Kmeans is also used by us for clustering the dataset according
to their categories and at last we have implemented naive bayes classifier for
variable item fat content .The main motive of this paper is to show you how to
tackle or deal with such giant dataset which has missing values regular.he field
of business is the Rule Induction
(RI)technique as compared to other data mining techniques. Where

as sale prediction of a pharmaceutical distribution company has been described
in [8]. Also this paper focuses on two issues: (i) stock state should not
undergo out of stock, and (ii) it avoids the customer dissatisfaction by predicting
the sales that manages the stock level of medicines. Handling of footwear
sale fluctuation in a period of time has been addressed in [3]. Also this
paper focuses on using neural network for predicting of weekly retail sales,
which decrease the uncertainty present in the short term planning of sales.
Linear and non-linear [1] a comparative analysis model for sales forecasting
7
is proposed for the retailing sector. Beheshti-Kashi and Samaneh[9]performed

sales prediction in fashion market. A two level statistical method [7] is
elaborated for forecasting the big mart sales prediction. Xia and Wong proposed
the differences between classical methods (based on mathematical and statistical
models) and modern heuristic methods and also named exponential smoothing,
regression, Auto Regressive Integrated Moving Average (ARIMA), Generalized
Auto Regressive Conditionally Heteroskedastic (GARCH) methods. Most of
these models are linear and are not able to deal with the asymmetric behavior
in most real-world sales data . Some of the challenging factors like lack of
historical data, consumer-oriented markets face uncertain demands, and short
life cycles of prediction methods results in inaccurate forecast.Author’s [9]
basically interprets that what is data analysis and how we can do it efficiently?.In
this paper author recommends R for data analysis because of its tremendous
capability of data exploration, several inbuilt packages, easy to implement
several machine learning algorithms etc. As R is a statistical language as
well as programming language which helps in effective model prediction and
better visualization techniques. So after survey authors found that the with
R data analysis is much more efficient. Analyzes and compares three famous
mechanisms or tools of data mining known as – Rapid Miner,Weka,R .
This paper interprets several functionalities of R,Weka,Rapidminer

in time series analysis for structural health monitoring like visualizing,filtering,
applying statistical models etc.Author’s basically interprets that what is data
analysis and how can do it efficiently?.In this paper author recommends Random
Forest for data analysis because of its tremendous capability of data exploration,
several inbuilt packages, easy to implement several machine learning algorithms
etc. As we know that R is a statistical language as well as programming
language which helps in effective model prediction and better visualization
techniques. So after survey authors found that the with xgboost data analysis
is much more efficient. Analyzes and compares three famous mechanisms
or tools of data mining known as – Rapid Miner,Weka,R respectively to
use their expertise in the area of structural health monitoring. This paper
interprets several functionalities of R,Weka, Rapidminer in time series analysis
for structural health monitoring like visualizing,filtering, applying statistical
models etc.
8
This paper elaborate on various work that has been carried out in
predicting the sales of an product and profit based on the product. The next
chapter gives an overview about the proposed system implementation.In this
paper, we examine the problem of demand forecasting on an e-commerce web
site. We proposed stacked generalization method consists of sub-level regressors.
We have also tested results of single classifiers separately together with the
general model. Experiments have shown that our approach predicts demand
at least as good as single classifiers do, even better using much less training
data (only 20percent of the dataset). We think that our approach will predict
much better when more data is used. Because the difference is not statistically
significant between the proposed model and random forest, the proposed method
can be used to forecast demand due to its accuracy with fewer data. In the future,
we will use the output of this project as part of the price optimization problem
which we are planning to work on. The method for long term electric power
forecasting using long term annual growth factors was proposed. Prediction and
analysis of aero-material consumption based on multivariate linear regression
model was proposed by collecting the data of basic monitoring indicators of
aircraft tire consumption from 2001 to 2016 . The forecast for bicycle rental
demand based on random forests and multiple linear regressions was proposed
based on weather data .
9
CHAPTER 3
SYSTEM ARCHITECTURE
This chapter gives an detail description of the Bigmart sales prediction

system and the work carried out by its sub modules
Figure 3.1: ARCHITECTURE DIAGRAM OF BIGMART SALES

PREDICTION SYSTEM
10
3.1 MODULE DESCRIPTION
Text-The system take the data set of sales record of Big mart sales of
1987-2013,
get the details of the particular sales record year wise.
Text processing-cleaning the data, filling the missing value by using mean and
label the chronological value to integer.
Compare with algorithm- the prediction Algorithm that is Linear regression,
Decision Tree, Random Forest to get the Sales of Each product. It will compare
and then find which have best Accuracy. It was found that Random forest has
best accuracy with 60.81percent.
Dataset-I used The Kaggle dataset which have sales record year up to 1987 to
2013.
Predict the sales - We used algorithm for Predicting The sales
Output:Predict the sales.Dataset: Kaggle.
Module of Proposed Work are as follow:
Figure 3.2: WORKING PROCEDURE OF PROPOSED MODEL.

11
3.1.1 DATASET DESCRIPTION OF BIG MART
In our work we have used 2013 Sales data of Big Mart as the dataset.
Where the dataset consists of 12 attributes like Item Fat, Item Type, Item
MRP,Outlet Type, Item Visibility, Item Weight, Outlet Identifier, Outlet Size,
Outlet Establishment Year, Outlet Location Type, Item Identifier and Item Outlet
Sales.Out of these attributes response variable is the Item Outlet Sales attribute
and remaining attributes are used as the predictor variables. The data-set
consists of 8523 products across different cities and locations. The data-set
is also based on hypotheses of store level and product level. Where store level
involves attributes like: city, population density, store capacity, location, etc
and the product level hypotheses involves attributes like: brand, advertisement,
promotional over, etc. After considering all, a dataset is formed and finally the
data-set was divided into two parts, training set and test set in the ratio 80 : 20.
3.1.2 DATA EXPLORATION
In this phase useful information about the data has been extracted
from the dataset. That is trying to identify the information from hypotheses vs
available data. Which shows that the attributes Outlet size and Item weight face
the problem of missing values, also the minimum value of Item Visibility is zero
which is not actually practically possible. Establishment year of Outlet varies
from 1985 to 2009. These values may not be appropriate in this form. So, we
need to convert them into how old a particular outlet is. There are 1559 unique
products, as well as 10 unique outlets, present in the dataset. The attribute Item
type contains 16 unique values. Where as two types of Item Fat Content are there
but some of them are misspelled as regular instead of ’Regular’ and low fat, LF
instead of Low Fat. From Figure 2. It was found that the response variable
i.e. Item Outlet Sales was positively skewed. So, to remove the skewness of
response variable a log operation was performed on Item Outlet Sales.
3.1.3 DATA CLEANING
It was observed from the previous section that the attributes Outlet
Size and Item Weight has missing values. In our work in case of Outlet Size
12
missingvalue we replace it by the mode of that attribute and for the Item Weight
missing values we replace by mean of that particular attribute. The missing
attributes are numerical where the replacement by mean and mode diminishes
the correlation among imputed attributes. For our model we are assuming that
there is no relationship between the measured attribute and imputed attribute.
3.1.4 FEATURE ENGINEERING
Some nuances were observed in the data-set during data exploration

phase. So this phase is used in resolving all nuances found from the dataset
and make them ready for building the appropriate model.During this phase it
was noticed that the Item visibility attribute had a zero value, practically which
has no sense.So the mean value item visibility of that product will be used for
zero values attribute. This makes all products likely to sell. All categorical
attributes discrepancies are resolved by modifying all categorical attributes into
appropriate ones. In some cases, it was noticed that non-consumables and fat
content property are not specified. To avoid this we create a third category of
Item fat content.i.e. none. In the Item Identifier attribute, it was found that the
unique ID starts with either DR or FD or NC. So, we create a new attribute
Item Type New with three categories like Foods, Drinks and Non-consumables.
Finally, for determining how old a particular outlet is, we add an additional
attribute Year to the dataset.
3.1.5 MODEL BUILDING
After completing the previous phases, the dataset is now ready to

build proposed model. Once the model is build it is used as predictive model
to forecast sales of Big Mart. In our work, we propose a model using Xgboost
algorithm and compare it with other machine learning techniques like Linear
regression, Ridge regression , Decision tree etc.Decision Tree: A decision tree
classification is used in binary classification problem and it uses entropy and
information gain as metric and is defined in Equation 3 and Equation 4 respectively
for classifying an attribute which picks the highest information gain attribute to
split the data set.
Decision Tree: A decision tree classification is used in binary classification
13
problem and it uses entropy and information gain [16] as metric and is defined
in Equation 3 and Equation 4 respectively for classifying an attribute which
picks the highest information gain attribute to split the data set. where S: Set
Figure 3.3: EQUATION OF DECISION TREE
of attribute or dataset, H(S): Entropy of set S, T: Subset created from splitting

of S by attribute A. p(t): Proportion of the number of elements in t to number
of element in the set S. H(t): Entropy of subset t. The decision tree algorithm
is depicted in Algorithm 1. Linear Regression: A model which create a linear
relationship between the dependent variable and one or more independent variable,
mathematically linear regression is defined in Equation 5
14
Figure 3.4: EQUATION OF LINEAR REGRESSION
where y is dependent variable and x are independent variables or

attributes. In linear regression we find the value of optimal hyperplane which
corresponds to the best fitting line (trend) with minimum error. The loss function
for linear regression is estimated in terms of RMSE and MAE as mentioned in
the Equation 1 and 2.
Figure 3.5: EQUATION OF XGBOOST ALGORITHM

15
The Xgboost has following exclusive features:

1. Sparse Aware - that is the missing data values are automatic handled.
2. Supports parallelism of tree construction.
3. Continued training - so that the fitted model can further boost with new data.
All models received features as input, which are then segregated into training
and test set. The test dataset is used for sales prediction.
Figure 3.6: FRAMEWORK OF PROPOSED MODEL

16
CHAPTER 4
PSEUDO CODE/ALGORITHM
This chapter will show Psudocode and Algorithm of Random Forest,

Linear Regression and Decision tree method on the Big Mart sales data. There
are two models
1. Predicts the Sales of Item for that store
2. Predicts the quantity of Item sold instead of Item Sales. The idea behind this
implantation is that quantities sold might make more sense than the sale of item.
In the final step while checking the accuracy Item Sold is multiplied with Item
manufacturing retail price.
4.1 LINEAR REGRESSION
Linear Regression- linear regression is a linear approach to modelling

the relationship between a scalar response (or dependent variable) and one or
more explanatory variables (or independent variables). The case of one explanatory
variable is called simple linear regression.This Algorithm give the accuracy of
56.28percent. ’
1: start
2: import sklearn by using linear regression
3: take variable lr lr = LinearRegression(normalize=True)
4: get the X train = train df.drop(Item Outlet Sales,Item Identifier,Outlet
Identifier,axis=1)
5: Get the Y train = train df(Item Outlet Sales)
6: Get the X test = test df.drop(Item Identifier,Outlet Identifier),axis=1).copy()
7: fit in variable lr lr.fit(X train, Y train)
8: lr pred = lr.predict(X test)
9: lr accuracy = round(lr.score(X train,Y train) * 100,2)
10: lr accuracy
17
4.2 RANDOM FOREST
Random Forest Classifier- Random forests or random decision forests

are an ensemble learning method for classification, regression and other tasks
that operate by constructing a multitude of decision trees at training time and
out putting the class that is the mode of the classes (classification) or mean
prediction (regression) of the individual trees. Random decision forests correct
for decision trees’ habit of over fitting to their training set. In this Algorithm
give the accuracy of 60.81%.
1: start
2: import sklearn by using Random Forest
3: take variable rf rf = RandomForest
Identifier,axis=1)
7: fit in variable lr lr.fit(X train, Y train)
8: rf pred = lr.predict(X test)
9: rf accuracy = round(lr.score(X train,Y train) * 100,2)
10: rf accuracy
18
4.3 DECISION TREE REGRESSOR
Decision tree learning is one of the predictive modeling approaches

used in statistics, data mining and machine learning. It uses a decision tree
(as a predictive model) to go from observations about an item (represented
in the branches) to conclusions about the item’s target value (represented in
the leaves). Tree models where the target variable can take a discrete set of
values are called classification trees; in these tree structures, leaves represent
class labels and branches represent conjunctions of features that lead to those
class labels. Decision trees where the target variable can take continuous values
(typically real numbers) are called regression trees.
In decision analysis, a decision tree can be used to visually and explicitly

represent decisions and decision making. In data mining, a decision tree describes
data (but the resulting classification tree can be an input for decision making).In
this Algorithm give the accuracy of 61.47 %.
start
import sklearn by Decision tree Regressor
take variable tree tree = DecisiontreeRegressor
get the X train = train df.drop(Item Outlet Sales,Item Identifier,Outlet
Identifier,axis=1)
Get the Y train = train df(Item Outlet Sales)
Get the X test = test df.drop(Item Identifier,Outlet Identifier),axis=1).copy()
fit in variable lr lr.fit(X train, Y train)
tree pred = lr.predict(X test)
tree accuracy = round(lr.score(X train,Y train) * 100,2)
tree accuracy
19
4.4 XGBOOST REGRESSOR
XGBoost is a decision-tree-based ensemble Machine Learning algorithm

that uses a gradient boosting framework. In prediction problems involving
unstructured data (images, text, etc.) ... A wide range of applications: Can
be used to solve regression, classification, ranking, and user-defined prediction
problems.
1: start
2: import sklearn by Decision tree Regressor
3: take variable tree tree = XG Boost Regressor
Identifier,axis=1)
7: XGB.fit(X train, y train)
8: y pred = XGB.predict(X test)
9: rmse = sqrt(means quarede rror(ytest, ypred))
11:
10: rmse
12: X t = test datar(feat cols)
13: y result = XGB.predict(X t)
14: Display y result
20
CHAPTER 5
IMPLEMENTATION/ RESULTS AND ANALYSIS
This chapter should provide implementation details of your work

with results and analysis.
5.1 IMPLEMENTATION
In our work we set cross-validation as 20 fold cross-validation to test

accuracy of different models. Where in the cross-validation stage the dataset
is divided randomly into 20 subsets with roughly equal sizes. Out of the 20
subsets, 19 subsets are used as training data and the remaining subset forms the
test data also called leave-one-out cross validation. Every models is first trained
by using the training data and then used to predict accuracy by using test data
and this continues until each subset is tested once. From data visualization, it
was observed that lowest sales were produced in smallest locations. However,
in some cases it was found that medium size location produced highest sales
though it was type-3 (there are three type of supermarket e.g. super market
type-1, type-2 and type-3) super market instead of largest size location as shown
in Figure 4. to increase the product sales of Big mart in a particular outlet,
more locations should be switched to Type 3 Supermarkets. However, the
21
Figure 5.1: IMPLEMENTATION DIAGRAM
proposed model gives better predictions among other models for future sales
at all locations. For example, how item MRP is correlated with outlet sales
is shown in Figure 5. Also Figure 5 shows that Item Outlet Sales is strongly
correlated with Item MRP, where the correlation is defined in this Equation.
Figure 5.2: EQUATION DIAGRAM
From Figure 8 it is also observed that target attribute Item Outlet

Sales is affected by sales of the Item Type. Similarly, from Figure 6 it is also
observed that highest sales is made by OUT027 which is actually a medium
size outlet in the super market type-3. Figure 7 describes that the less visible
products are sold more compared to the higher visibility products which is not
possible practically. Thus, we should reject the null hypothesis H0 - that the
visibility does not effect the sales. From Figure 9 it is observed that less number
of high outlet size stores exist in comparison to the medium and small outlet
size in terms of count. The cross-validation score along with MAE and RMSE
of the proposed model and existing models is shown in Table 1 and Table 2
respectively. Similarly the root mean squared error for existing model and
proposed model is presented in Table 2. From the results we observe that and
found that the proposed model is significantly improved over the other model.
22
5.2 RESULT
In our work I take the dataset,both train and test datsset and combine
them to check the missing value.
Figure 5.3: DATASET DIAGRAM

23
In my project I can check here how much missing value through the
graph.
Figure 5.4: MISSING VALUE GRAPH

24
After this we have to fill missing value mean of that particular value.fill
this value .
Figure 5.5: MEAN()

25
Now we can use sklearn label encoding because algorithm doesnot

take String value we have to convert them by using integer.
Figure 5.6: LABEL ENCODER()

26
5.3 RESULT
5.3.1 LINEAR REGRESSION
Multiple linear regression or commonly known as Multivariate regression

is a popular technique for prediction. The main idea behind this is to visualize
the data as an equation where each variable is assigned a weight.
Y = B1X1 + B2X2 + B3X3 +C (5.1)
5.3.1.1 CODE
Figure 5.7: SKLEARN LINEAR REGRESSION

27
5.3.1.2 OUTPUT
In output we get Item identifier,outlet identifier,item outlet sales item

identifier- to identify the item through item id. Outlet identifier-to get location
wise sales item outlet sales-give the sales record.
Figure 5.8: SALES RECORD THROUGH LINEAR REGRESSION

28
5.3.2 RANDOM FOREST
Random Forest is a popular ensemble learning method. As the name

suggests it creates a forest of decision trees and out of those trees the one which
has the highest majority is chosen as a final model which will be used for
prediction. Random forest takes N attributes form the dataset and then it splits
the data into edges, just like decision trees which uses Gini Index or Entropy to
determine the split points Random Forest also considers those metrics to choose
the best split point. It will create N number of trees with each tree is made on
subset of data and in the end it calculates the votes that each tree has and chooses
the one which has the majority votes.
5.3.2.1 CODE
Figure 5.9: SKLEARN RANDOM FOREST

29
5.3.2.2 OUTPUT
Figure 5.10: SALES RECORD THROUGH RANDOM FOREST
5.3.3 DECISION TREE REGRESSOR
Decision trees is a machine learning technique that are used for classification
and regression problems. The idea behind this algorithm goes in a top-down
approach where you all the train cases at the node and then you split the tree in
to branches until you the reach the leaf node. Decision tree uses Gini Index /
Entropy to split the nodes. Gini Index measures the impurity of the attributes
and choses the attributes which are the purest. The attribute with Gini score 0 is
the purest.
30
5.3.3.1 CODE
Figure 5.11: SKLEARN RANDOM FOREST

31
5.3.3.2 OUTPUT
Figure 5.12: SALES RECORD THROUGH DECISION TREE
5.3.4 XGBOOST REGRESSOR
XGBoost is a decision-tree-based ensemble Machine Learning algorithm

that uses a gradient boosting framework. In prediction problems involving
unstructured data (images, text, etc.) ... A wide range of applications: Can
be used to solve regression, classification.
32
5.3.4.1 CODE
Figure 5.13: SKLEARN XGBOOST

33
5.3.4.2 OUTPUT
Figure 5.14: SALES RECORD THROUGH XGBOOST REGRESSOR

34
5.4 ANALYSIS
This Project took us through the entire journey of solving a data

science problem. We started with making some hypothesis about the data without
looking at it. Then we moved on to data exploration where we found out
some nuances in the data which required remediation. Next, we performed data
cleaning and feature engineering, where we imputed missing values and solved
other irregularities, made new features and also made the data model-friendly
by one-hot-coding. Finally we made regression, decision tree and random forest
model and got a glimpse of how to tune them for better results.
I believe everyone reading this article should attain a good score in

BigMart Sales now. For beginners, you should achieve at least a score of 1150
and for the ones already on the top, you can use some feature engineering tips
from here to go further up.
35
CHAPTER 6
CONCLUSION AND FUTURE WORK
This chapter includes the conclusion of the work completed and things
that can be enhanced in this project in future
6.1 CONCLUSION
The ML algorithm that perform the best was XGBoost with RMSE
1199.
In present era of digitally connected world every shopping mall desires

to know the customer demands beforehand to avoid the shortfall of sale items
in all seasons. Day to day the companies or the malls are predicting more
accurately the demand of product sales or user demands. Extensive research
in this area at enterprise level is happening for accurate sales prediction. As the
profit made by a company is directly proportional to the accurate predictions
of sales, the Big marts are desiring more accurate prediction algorithm so that
the company will not sure any losses. In this research work, we have designed a
predictive model by modifying Gradient boosting machines as Xgboost technique
and ex- perimented it on the 2013 Big Mart dataset for predicting sales of
the product from a particular outlet. Experiments support that our technique
produce more accurate prediction compared to than other available techniques
like decision trees, ridge regression etc. Finally a comparison of different models
is summarized it is also concluded that our model with lowest MAE and RMSE
performs better compared to existing models.
36
Figure 6.1: COMPARISION RESULT OF ALGORITHM

37
6.2 FUTURE WORK
The next step will be looking at Hyperparameter Tuning and

Ensembling.
We want best accuracy among these prediction Algorithm Machine Learning
techniques applied to “what if” scenarios serve only to provide a guide on what
may yield the best results.Even though were provided with sales data, were
are still not sure of the seasonality of the shopping habits observed, which
can certainly have an impact on the quality of the recommendation produced.
A better version of this system would be able to find the best placement
options for multiple products while allowing users to prioritize one product
over another.Tuning machine learning hyperparameters is a tedious yet crucial
task, as the performance of an algorithm can be highly dependent on the choice
of hyperparameters. Manual tuning takes time away from important steps of
the machine learning pipeline like feature engineering and interpreting results.
Grid and random search are hands-off, but require long run times because they
waste time evaluating unpromising areas of the search space. Increasingly,
hyperparameter tuning is done by automated methods that aim to find optimal
hyperparameters in less time using an informed search with no manual effort
necessary beyond the initial set-up.
we will use the output of this project as part of the price optimization problem
which we are planning to work on.
38
REFERENCES
[1] Sanjeev trivedi Barun Waldron. Prediction of sales volume. IEEE

Transactions on Bussiness Theory, 49:2017–2309, 2017.
[2] T G Conley and D W Galeson. Outlet sales using rand various algorithm.
Journal of Computer Aplication, 58:468–493, 2017.
[3] K Alishahi, F Marvasti, V A Aref, and P Pad. The role of big data and
predictive analytics in retailing. journal of retailing, 55:3577–3593, 2016.
[4] Philippe Aghion and Steven Durlauf, editors. The Role of Big Data and
Predictive Analytics in Retailing, volume 1. Elsevier, 1 edition, 2015.
[5] A H Cookson. A comparative study of linear and nonlinear models for
aggregate retail sales forecasting, 2013. US Patent 4554399.
[6] J Ionesco. A two-stage dynamic sales forecasting model for the fashion
retail. Expert Systems with Applications. The Advertiser, page 10, 2013.
[7] Richard E Fikes and Nils J Nilsson. A survey on retail sales forecasting
and prediction in fashion markets. pages 608–620, 2014.
[8] Praveen naigi. Big mart dataset@ONLINE. Forecasting methods and
applications. Accessed: 20 March 2015.
[9] D H Holt. Data Visualization. Prentice-Hall, Sydney, 2017.
[10] Dan Riley. Industrial relations in Australian education / edited by Dan
Riley. Social Science Press [Wentworth Falls, N.S.W.], 1992.
[11] Weiguo Fan, Michael D Gordon, and Praveen Pathak. Business data
mininga machine learning perspective. Information management. In
Proceedings of the Twenty First International Conference on Information
Systems, ICIS ’00, pages 20–34, 2013.
[12] J P Hos. Prediction of retail sales of footwear using feed forward and
recurrent neural networks. 2016.

Amit Kumar: Bigmart Sales Prediction A Project Report

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Amit Kumar: Bigmart Sales Prediction A Project Report

Hochgeladen von

Copyright:

Verfügbare Formate

BIGMART SALES PREDICTION

submitted to the Faculty of

INFORMATION AND COMMUNICATION ENGINEERING

MASTER OF COMPUTER APPLICATION

DEPARTMENT OF INFORMATION SCIENCE AND TECHNOLOGY

Certified that this project report BIG MART SALES PREDICTION

Dr. SASWATI MUKHERJEE

Now a day’s Data Analysis is need for every data applications to

First and Foremost I would like to express my sincere thanks and

I would like to thankful to Dr. SASWATI MUKHERJEE,Professor

I am bound to convey my sincere thanks to the the project committee

My special thanks to my family members,friends and one and all who

5 IMPLEMENTATION/ RESULTS AND ANALYSIS 20

5.3.1 LINEAR REGRESSION 26

6 CONCLUSION AND FUTURE WORK 35

3.1 ARCHITECTURE DIAGRAM OF BIGMART SALES

5.1 IMPLEMENTATION DIAGRAM 21

6.1 COMPARISION RESULT OF ALGORITHM 36

LIST OF SYMBOLS AND ABBREVIATIONS

ARIMA Auto Regressive Integrated Moving Average

This chapter gives an explanation about the Bigmart sales prediction.

1.1 PREDICTION METHODOLOGY

This section gives an overview of various prediction methods used

Linear Regression: linear regression is a linear approach of modelling

Decision Tree: Decision tree is one of the predictive modeling approaches

Random Forest Classifier: Random forests or Random Decision

XgBoost Regressor: XGBoost is a decision-tree-based ensemble

The next subdivision highlights about the motivation of this project.

The Bigmart Sales Prediction is to Predict the Sales of on product

1.1.2 PROPOSED WORK

The proposed work Bigmart Sales Prediction develops a model for

1.2 ORGANIZATION OF REPORT

• Chapter 1 :Gives an introduction of the various prediction

• Chapter 2 This chapter gives an overview about the various research

• Chapter 3 will consist of the System design which includes all

• Chapter 4 will consist of the implementation details of the system.It

• Chapter 5 will conclude implementation of the project.result of the

This chapter gives a overview of previous working in Bigmart sales

2.1 Regression Algorithm

Sales forecasting is an important aspect of different companies which

The result demonstrated that the two-level statistical approach

Analysing the big data and extracting knowledge of full information

2.2 Classifier Algorithm

Sales Prediction is used to predict sales of different products sold

products as soon as possible due to the storage of time and money

(RI)technique as compared to other data mining techniques. Where

is proposed for the retailing sector. Beheshti-Kashi and Samaneh[9]performed

This paper interprets several functionalities of R,Weka,Rapidminer

This chapter gives an detail description of the Bigmart sales prediction

Figure 3.1: ARCHITECTURE DIAGRAM OF BIGMART SALES

3.1 MODULE DESCRIPTION

Module of Proposed Work are as follow:

Figure 3.2: WORKING PROCEDURE OF PROPOSED MODEL.

3.1.1 DATASET DESCRIPTION OF BIG MART

3.1.2 DATA EXPLORATION

3.1.3 DATA CLEANING

3.1.4 FEATURE ENGINEERING

Some nuances were observed in the data-set during data exploration

3.1.5 MODEL BUILDING

After completing the previous phases, the dataset is now ready to

Figure 3.3: EQUATION OF DECISION TREE

of attribute or dataset, H(S): Entropy of set S, T: Subset created from splitting