Beruflich Dokumente
Kultur Dokumente
A PROJECT REPORT
Submitted by
AMIT KUMAR
(2017272001)
in partial fulfillment
for the award of the degree
of
APRIL 2020
ii
ANNA UNIVERSITY
CHENNAI - 600 025
BONA FIDE CERTIFICATE
PLACE:Chennai DR D NARASHIMAN
DATE:07-04-2020 TEACHING FELLOW
PROJECT GUIDE
DEPARTMENT OF IST, CEG
ANNA UNIVERSITY
CHENNAI 600025
COUNTERSIGNED
ABSTRACT
ACKNOWLEDGEMENT
AMIT KUMAR
vi
TABLE OF CONTENTS
ABSTRACT iii
ABSTRACT (TAMIL) iii
ACKNOWLEDGEMENT v
LIST OF TABLES vi
LIST OF FIGURES viii
LIST OF SYMBOLS AND ABBREVIATIONS ix
1 INTRODUCTION 1
1.1 PREDICTION METHODOLOGY 1
1.1.1 MOTIVATION 2
1.1.2 PROPOSED WORK 2
1.2 ORGANIZATION OF REPORT 3
2 LITERATURE SURVEY 4
2.1 Regression Algorithm 4
2.2 Classifier Algorithm 5
3 SYSTEM ARCHITECTURE 9
3.1 MODULE DESCRIPTION 10
3.1.1 DATASET DESCRIPTION OF BIG MART 11
3.1.2 DATA EXPLORATION 11
3.1.3 DATA CLEANING 11
3.1.4 FEATURE ENGINEERING 12
3.1.5 MODEL BUILDING 12
4 PSEUDO CODE/ALGORITHM 16
4.1 LINEAR REGRESSION 16
4.2 RANDOM FOREST 16
4.3 DECISION TREE REGRESSOR 17
4.4 XGBOOST REGRESSOR 18
REFERENCES 38
viii
LIST OF FIGURES
CHAPTER 1
INTRODUCTION
Decision trees where the target variable can take continuous values (typically
real numbers) are called regression trees. In decision analysis, a decision tree
can be used to visually and explicitly represent decisions making. In data mining,
a decision tree describes data (but the resulting classification tree can be an input
for decision making).
1.1.1 MOTIVATION
missing values, the proposed ensemble classifier uses Decision trees, Linear
regression, Random forest classsifier .Both Mean Absolute Error(MAE) and
Root Square Mean Error(RSME) are used as accuracy metrics for predicting
the sales in Big Mart.
• Chapter 6 will conclude the entire work and also provide the future
enhancements of the existing system
4
CHAPTER 2
LITERATURE SURVEY
The data scientists at BigMart have collected 2013 sales data for
1559 products across 10 stores in different cities. Barun Waldron and Sanjeev
trivedi has introduced this method[1]. The main objective is to understand
whether specific properties of products and/or stores play a significant role in
terms of increasing or decreasing sales volume. To achieve this goal, they
built a predictive model and find out the sales of each product at a particular
store which helps BigMart to boost their sales by learning optimised product
organization inside stores. They used Linear Regression approach.They can
refine the methodologies easy to analyse.but disadvantage is that it is not full
accurate.
in [5] also this paper elaborates the automated process of knowledge acquisition.
Machine learning is the process where a machine will learn from data in
the form of statistically or computationally method and process knowledge
acquisition from experiences. Various machine learning techniques with their
applications in different sectors has been presented in [6]. Pat Langley and
Herbert A pointed out most widely used data mining technique. As in today’s
era data analysis is so important to every one to make better decisions in their
field. Analysing the big data and extracting knowledge full information from
the data is little bit tough. so, for mining of complex datasets powerful and
effective data mining tool to extract the information and take better decisions
in future is needed. R here which is an open source free data mining tool and
efficient too. R has several inbuilt packagesFikes, Richard E et.al.[7] which
provides us efficiency like-ggplot2, VIM etc. R is an open-source data analysis
environment and programming language. The process of converting data into
knowledge, insight and understanding is Data analysis, which is a critical part
of statistics. For the effective processing and analysis of big data, it allows
users to conduct a number of tasks that are essential. R consists of numerous
ready-to-use statistical modelling algorithms and machine learning which allow
users to create reproducible research and develop data products.in this method
we can use Linear Regression , Decision tree , K-Means , Naı̈ve Bayes. we
can implement random forest algorithm, éclat algorithm etc which we will do in
our future work. Kmeans is also used by us for clustering the dataset according
to their categories and at last we have implemented naive bayes classifier for
variable item fat content .The main motive of this paper is to show you how to
tackle or deal with such giant dataset which has missing values regular.he field
of business is the Rule Induction
This paper elaborate on various work that has been carried out in
predicting the sales of an product and profit based on the product. The next
chapter gives an overview about the proposed system implementation.In this
paper, we examine the problem of demand forecasting on an e-commerce web
site. We proposed stacked generalization method consists of sub-level regressors.
We have also tested results of single classifiers separately together with the
general model. Experiments have shown that our approach predicts demand
at least as good as single classifiers do, even better using much less training
data (only 20percent of the dataset). We think that our approach will predict
much better when more data is used. Because the difference is not statistically
significant between the proposed model and random forest, the proposed method
can be used to forecast demand due to its accuracy with fewer data. In the future,
we will use the output of this project as part of the price optimization problem
which we are planning to work on. The method for long term electric power
forecasting using long term annual growth factors was proposed. Prediction and
analysis of aero-material consumption based on multivariate linear regression
model was proposed by collecting the data of basic monitoring indicators of
aircraft tire consumption from 2001 to 2016 . The forecast for bicycle rental
demand based on random forests and multiple linear regressions was proposed
based on weather data .
9
CHAPTER 3
SYSTEM ARCHITECTURE
Text-The system take the data set of sales record of Big mart sales of
1987-2013,
get the details of the particular sales record year wise.
Text processing-cleaning the data, filling the missing value by using mean and
label the chronological value to integer.
Compare with algorithm- the prediction Algorithm that is Linear regression,
Decision Tree, Random Forest to get the Sales of Each product. It will compare
and then find which have best Accuracy. It was found that Random forest has
best accuracy with 60.81percent.
Dataset-I used The Kaggle dataset which have sales record year up to 1987 to
2013.
Predict the sales - We used algorithm for Predicting The sales
Output:Predict the sales.Dataset: Kaggle.
In our work we have used 2013 Sales data of Big Mart as the dataset.
Where the dataset consists of 12 attributes like Item Fat, Item Type, Item
MRP,Outlet Type, Item Visibility, Item Weight, Outlet Identifier, Outlet Size,
Outlet Establishment Year, Outlet Location Type, Item Identifier and Item Outlet
Sales.Out of these attributes response variable is the Item Outlet Sales attribute
and remaining attributes are used as the predictor variables. The data-set
consists of 8523 products across different cities and locations. The data-set
is also based on hypotheses of store level and product level. Where store level
involves attributes like: city, population density, store capacity, location, etc
and the product level hypotheses involves attributes like: brand, advertisement,
promotional over, etc. After considering all, a dataset is formed and finally the
data-set was divided into two parts, training set and test set in the ratio 80 : 20.
In this phase useful information about the data has been extracted
from the dataset. That is trying to identify the information from hypotheses vs
available data. Which shows that the attributes Outlet size and Item weight face
the problem of missing values, also the minimum value of Item Visibility is zero
which is not actually practically possible. Establishment year of Outlet varies
from 1985 to 2009. These values may not be appropriate in this form. So, we
need to convert them into how old a particular outlet is. There are 1559 unique
products, as well as 10 unique outlets, present in the dataset. The attribute Item
type contains 16 unique values. Where as two types of Item Fat Content are there
but some of them are misspelled as regular instead of ’Regular’ and low fat, LF
instead of Low Fat. From Figure 2. It was found that the response variable
i.e. Item Outlet Sales was positively skewed. So, to remove the skewness of
response variable a log operation was performed on Item Outlet Sales.
It was observed from the previous section that the attributes Outlet
Size and Item Weight has missing values. In our work in case of Outlet Size
12
missingvalue we replace it by the mode of that attribute and for the Item Weight
missing values we replace by mean of that particular attribute. The missing
attributes are numerical where the replacement by mean and mode diminishes
the correlation among imputed attributes. For our model we are assuming that
there is no relationship between the measured attribute and imputed attribute.
problem and it uses entropy and information gain [16] as metric and is defined
in Equation 3 and Equation 4 respectively for classifying an attribute which
picks the highest information gain attribute to split the data set. where S: Set
CHAPTER 4
PSEUDO CODE/ALGORITHM
1: start
2: import sklearn by using linear regression
3: take variable lr lr = LinearRegression(normalize=True)
4: get the X train = train df.drop(Item Outlet Sales,Item Identifier,Outlet
Identifier,axis=1)
5: Get the Y train = train df(Item Outlet Sales)
6: Get the X test = test df.drop(Item Identifier,Outlet Identifier),axis=1).copy()
7: fit in variable lr lr.fit(X train, Y train)
8: lr pred = lr.predict(X test)
9: lr accuracy = round(lr.score(X train,Y train) * 100,2)
10: lr accuracy
17
1: start
2: import sklearn by using Random Forest
3: take variable rf rf = RandomForest
4: get the X train = train df.drop(Item Outlet Sales,Item Identifier,Outlet
Identifier,axis=1)
5: Get the Y train = train df(Item Outlet Sales)
6: Get the X test = test df.drop(Item Identifier,Outlet Identifier),axis=1).copy()
7: fit in variable lr lr.fit(X train, Y train)
8: rf pred = lr.predict(X test)
9: rf accuracy = round(lr.score(X train,Y train) * 100,2)
10: rf accuracy
18
start
import sklearn by Decision tree Regressor
take variable tree tree = DecisiontreeRegressor
get the X train = train df.drop(Item Outlet Sales,Item Identifier,Outlet
Identifier,axis=1)
Get the Y train = train df(Item Outlet Sales)
Get the X test = test df.drop(Item Identifier,Outlet Identifier),axis=1).copy()
fit in variable lr lr.fit(X train, Y train)
tree pred = lr.predict(X test)
tree accuracy = round(lr.score(X train,Y train) * 100,2)
tree accuracy
19
1: start
2: import sklearn by Decision tree Regressor
3: take variable tree tree = XG Boost Regressor
4: get the X train = train df.drop(Item Outlet Sales,Item Identifier,Outlet
Identifier,axis=1)
5: Get the Y train = train df(Item Outlet Sales)
6: Get the X test = test df.drop(Item Identifier,Outlet Identifier),axis=1).copy()
7: XGB.fit(X train, y train)
8: y pred = XGB.predict(X test)
9: rmse = sqrt(means quarede rror(ytest, ypred))
11:
10: rmse
12: X t = test datar(feat cols)
13: y result = XGB.predict(X t)
14: Display y result
20
CHAPTER 5
5.1 IMPLEMENTATION
proposed model gives better predictions among other models for future sales
at all locations. For example, how item MRP is correlated with outlet sales
is shown in Figure 5. Also Figure 5 shows that Item Outlet Sales is strongly
correlated with Item MRP, where the correlation is defined in this Equation.
5.2 RESULT
In our work I take the dataset,both train and test datsset and combine
them to check the missing value.
In my project I can check here how much missing value through the
graph.
After this we have to fill missing value mean of that particular value.fill
this value .
5.3 RESULT
5.3.1.1 CODE
5.3.1.2 OUTPUT
5.3.2.1 CODE
5.3.2.2 OUTPUT
Decision trees is a machine learning technique that are used for classification
and regression problems. The idea behind this algorithm goes in a top-down
approach where you all the train cases at the node and then you split the tree in
to branches until you the reach the leaf node. Decision tree uses Gini Index /
Entropy to split the nodes. Gini Index measures the impurity of the attributes
and choses the attributes which are the purest. The attribute with Gini score 0 is
the purest.
30
5.3.3.1 CODE
5.3.3.2 OUTPUT
5.3.4.1 CODE
5.3.4.2 OUTPUT
5.4 ANALYSIS
CHAPTER 6
This chapter includes the conclusion of the work completed and things
that can be enhanced in this project in future
6.1 CONCLUSION
The ML algorithm that perform the best was XGBoost with RMSE
1199.