Sie sind auf Seite 1von 35

# 5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## Project 4: Final Project: BigMart Sales

Prediction
Submitter: Cathy Dai (Data Science Course)

## Chapter 1: Problem Statement

Come across supply chain, we always want to understand the trends of the sales
for products. Here, we have BigMart dataset which have collected in year 2013 for
1559 products across 10 stores in different cities. Also, certain attributes of each
product and store have been defined. The aim is to build a predictive model and
find out the sales of each product at a particular store.

In this project, I would like to use machine learning to build a predictive model and
find out the sales of each product at a particular store.

## Chapter 2: Load the data

I'm going to use the Kaggle dataset. This dataset can be found:

https://www.kaggle.com/devashish0507/big-mart-sales-prediction
(https://www.kaggle.com/devashish0507/big-mart-sales-prediction)

## The dataset contains 8523 observations in products. There are 12 columns of

measurements of them.

## 2.1 Import Libraries

Let's import all of the modules, functions and objects we are going to use.

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 1/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from pandas.plotting import scatter_matrix

## from sklearn import model_selection

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
plt.style.use("fivethirtyeight")

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 2/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [2]: #load the data, remove the first row

kaggle
Out[2]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_I

## 2 FDN15 17.500 Low Fat 0.016760 Meat 141.6180

Fruits and
3 FDX07 19.200 Regular 0.000000 182.0950
Vegetables

## ... ... ... ... ... ... ...

Snack
8518 FDF22 6.865 Low Fat 0.056783 214.5218
Foods

Baking
8519 FDS36 8.380 Regular 0.046982 108.1570
Goods

Health and
8520 NCJ29 10.600 Low Fat 0.035186 85.1224
Hygiene

Snack
8521 FDN46 7.210 Regular 0.145221 103.1332
Foods

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 3/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [3]: kaggle.dtypes

## Out[3]: Item_Identifier object

Item_Weight float64
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object

In [4]: kaggle.index.dtype

Out[4]: dtype('int64')

In [5]: kaggle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 7060 non-null float64
2 Item_Fat_Content 8523 non-null object
3 Item_Visibility 8523 non-null float64
4 Item_Type 8523 non-null object
5 Item_MRP 8523 non-null float64
6 Outlet_Identifier 8523 non-null object
7 Outlet_Establishment_Year 8523 non-null int64
8 Outlet_Size 6113 non-null object
9 Outlet_Location_Type 8523 non-null object
10 Outlet_Type 8523 non-null object
11 Item_Outlet_Sales 8523 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 566.0+ KB

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 4/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [6]: # count the item identifier number

kaggle.Item_Identifier.value_counts()
Out[6]: FDG33 10
FDW13 10
FDP25 9
NCF42 9
FDU12 9
..
FDK57 1
FDQ60 1
FDE52 1
FDN52 1
FDT35 1
Name: Item_Identifier, Length: 1559, dtype: int64

## Chapter 3: Summarize the dataset

In this chapter, we study the data by looking at the data a few different ways:

## Dimensions of the dataset.

Peek at the data itself.
Statistical summary of all attributes.
Breakdown of the data by the class variable.

In [7]: # shape
kaggle.shape
Out[7]: (8523, 12)

Out[8]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Iden

## 2 FDN15 17.50 Low Fat 0.016760 Meat 141.6180 OU

Fruits and
3 FDX07 19.20 Regular 0.000000 182.0950 OU
Vegetables

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 5/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [9]: # descripe
kaggle.describe(include=['object'])
Out[9]:
Item_Identifier Item_Fat_Content Item_Type Outlet_Identifier Outlet_Size Outlet_Location_

## count 8523 8523 8523 8523 6113

unique 1559 5 16 10 3

Fruits and
top FDG33 Low Fat OUT027 Medium T
Vegetables

## In [10]: # class distribution

kaggle.groupby('Outlet_Identifier').size()
Out[10]: Outlet_Identifier
OUT010 555
OUT013 932
OUT017 926
OUT018 928
OUT019 528
OUT027 935
OUT035 930
OUT045 929
OUT046 930
OUT049 930
dtype: int64

## In [11]: # Locate two measurements

kaggle.loc[:,['Item_MRP','Item_Outlet_Sales']]

Out[11]:
Item_MRP Item_Outlet_Sales

0 249.8092 3735.1380

1 48.2692 443.4228

2 141.6180 2097.2700

3 182.0950 732.3800

4 53.8614 994.7052

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 6/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## Chapter 4: Data Visualization

We now have a basic idea about the data. We need to extend that with some
visualizations.

## We are going to look at different charts to understand the correlation of the

measurements:

Univariate Plot
Multivariate Plot
Chart types
histogram
scatter plot
box plot

## In [12]: # count the number of missing values in each column

kaggle.isnull().sum(axis=0)

Out[12]: Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 7/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [13]: # Correlation matrix (ranges from 1 to -1)

kaggle.corr()
Out[13]:
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_

Conclusion:

## Item_MRP seems to have a good correlation with Item_Outlet_Sales and other

measurements have a very minimum correlation value with Item_Outlet_Sales.
From correlation data we can just drop the two measurement: Item_Visibility and
Item_Weight.

## In [14]: # Next is to see Item_Identifier: classified from 1-10

kaggle.Item_Identifier.value_counts()

Out[14]: FDG33 10
FDW13 10
FDP25 9
NCF42 9
FDU12 9
..
FDK57 1
FDQ60 1
FDE52 1
FDN52 1
FDT35 1
Name: Item_Identifier, Length: 1559, dtype: int64

## In [15]: # And see the Item_Fat_Content:

kaggle.Item_Fat_Content.value_counts()
Out[15]: Low Fat 5089
Regular 2889
LF 316
reg 117
low fat 112
Name: Item_Fat_Content, dtype: int64

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 8/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [16]: # Grouping same category of description = standardazation

kaggle.Item_Fat_Content=kaggle.Item_Fat_Content.replace('Low Fat', 'LF')
kaggle.Item_Fat_Content=kaggle.Item_Fat_Content.replace('reg','Regular')
kaggle.Item_Fat_Content=kaggle.Item_Fat_Content.replace('low fat','LF')

In [17]: kaggle.Item_Fat_Content.value_counts()

Out[17]: LF 5517
Regular 3006
Name: Item_Fat_Content, dtype: int64

In [18]: # Further processing the data: convert to the correct data type
kaggle.Item_Identifier=kaggle.Item_Identifier.astype('category')
kaggle.Item_Fat_Content=kaggle.Item_Fat_Content.astype('category')
kaggle.Item_Type=kaggle.Item_Type.astype('category')

kaggle.Outlet_Identifier=kaggle.Outlet_Identifier.astype('category')
kaggle.Outlet_Establishment_Year=kaggle.Outlet_Establishment_Year.astype('int64')

kaggle.Outlet_Size=kaggle.Outlet_Size.astype('category')
kaggle.Outlet_Location_Type=kaggle.Outlet_Location_Type.astype('category')
kaggle.Outlet_Type=kaggle.Outlet_Type.astype('category')

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 9/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [19]: # Histogram overview: Univariate Plots: start with individual measurement

kaggle.hist(figsize=(8,12))
Out[19]: array([[<matplotlib.axes._subplots.AxesSubplot object at 0x131CE568>,
<matplotlib.axes._subplots.AxesSubplot object at 0x131FDF70>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x1421C988>,
<matplotlib.axes._subplots.AxesSubplot object at 0x14240388>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x14253D60>,
<matplotlib.axes._subplots.AxesSubplot object at 0x14274790>]],
dtype=object)

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 10/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 11/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [20]: # seaborn plot: Multivariate Plots: interaction between variables

# https://seaborn.pydata.org/examples/scatterplot_matrix.html

## import seaborn as sns

sns.set(style="ticks")

sns.pairplot(kaggle, hue="Item_Fat_Content")
Out[20]: <seaborn.axisgrid.PairGrid at 0x1430edd8>

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 12/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [21]: # Pandas scatterplot in highest two measurement in correlation

fig,axes=plt.subplots(1,1,figsize=(12,8))
sns.scatterplot(x='Item_MRP',y='Item_Outlet_Sales',hue='Item_Fat_Content',size= '
Out[21]: <matplotlib.axes._subplots.AxesSubplot at 0x131187f0>

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 13/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [22]: kaggle.describe()

Out[22]:
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_Outlet_Sales

## max 21.350000 0.328391 266.888400 2009.000000 13086.964800

Item_MRP column contain prices which are in clusters so it would be better if we convert this
columnn into bins for further processing.

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 14/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [23]: # useing subplot

# https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.subplot.html
fig,axes=plt.subplots(1,1,figsize=(10,8))
sns.scatterplot(x='Item_MRP',y='Item_Outlet_Sales',hue='Item_Fat_Content',size='I
plt.plot([69,69],[0,5000])
plt.plot([137,137],[0,5000])
plt.plot([203,203],[0,9000])

## Out[23]: [<matplotlib.lines.Line2D at 0x15538cd0>]

We can use these perpendicular lines to divide data into proper bins.

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 15/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [24]: kaggle.Item_MRP=pd.cut(kaggle.Item_MRP,bins=[25,69,137,203,270],labels=['a','b','

Out[24]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Iden

## 2 FDN15 17.50 LF 0.016760 Meat c OU

Fruits and
3 FDX07 19.20 Regular 0.000000 c OU
Vegetables

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 16/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [25]: # Other measurements relationship

fig,axes=plt.subplots(4,1,figsize=(15,12))
sns.scatterplot(x='Item_Visibility',y='Item_Outlet_Sales',hue='Item_MRP',ax=axes[
sns.scatterplot(x='Item_Weight',y='Item_Outlet_Sales',hue='Item_MRP',ax=axes[1],d
sns.boxplot(x='Item_Type',y='Item_Outlet_Sales',ax=axes[2],data=kaggle)
sns.boxplot(x='Outlet_Identifier',y='Item_Outlet_Sales',ax=axes[3],data=kaggle)

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 17/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [26]: fig,axes=plt.subplots(2,2,figsize=(15,12))
sns.boxplot(x='Outlet_Establishment_Year',y='Item_Outlet_Sales',ax=axes[0,0],data
sns.boxplot(x='Outlet_Size',y='Item_Outlet_Sales',ax=axes[0,1],data=kaggle)
sns.boxplot(x='Outlet_Location_Type',y='Item_Outlet_Sales',ax=axes[1,0],data=kagg
sns.boxplot(x='Outlet_Type',y='Item_Outlet_Sales',ax=axes[1,1],data=kaggle)

## Drop: Item_Weight and Item_Visibility as our measurements due to the minimum

correctionship with targeted value(Item_Outlet_Sales).

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 18/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## Study feature matrix will be:

Item_Type,
Item_MRP,
Outlet_Identifier,
Outlet_Establishment_Year,
Outlet_Size,
Outlet_Location_Type,
Outlet_Type,
Item_Outlet_Sales, as further processing

In [27]: feature_col=['Item_MRP','Outlet_Type','Outlet_Location_Type','Outlet_Size','Outle

In [28]: fig,axes=plt.subplots(2,2,figsize=(15,12))
sns.boxplot(x='Outlet_Establishment_Year',y='Item_Outlet_Sales',hue='Outlet_Size'
sns.boxplot(x='Outlet_Size',y='Item_Outlet_Sales',hue='Outlet_Size',ax=axes[0,1],
sns.boxplot(x='Outlet_Location_Type',y='Item_Outlet_Sales',hue='Outlet_Size',ax=a
sns.boxplot(x='Outlet_Type',y='Item_Outlet_Sales',hue='Outlet_Size',ax=axes[1,1],

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 19/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [29]: data=kaggle[feature_col]

In [30]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_MRP 8523 non-null category
1 Outlet_Type 8523 non-null category
2 Outlet_Location_Type 8523 non-null category
3 Outlet_Size 6113 non-null category
4 Outlet_Establishment_Year 8523 non-null int64
5 Outlet_Identifier 8523 non-null category
6 Item_Type 8523 non-null category
7 Item_Outlet_Sales 8523 non-null float64
dtypes: category(6), float64(1), int64(1)
memory usage: 184.2 KB

In [31]: fig,axes=plt.subplots(1,1,figsize=(8,6))
sns.boxplot(x='Outlet_Location_Type', y='Item_Outlet_Sales', hue='Outlet_Type', d
Out[31]: <matplotlib.axes._subplots.AxesSubplot at 0x173bdf70>

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 20/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [32]: data[data.Outlet_Size.isnull()]

Out[32]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Outlet_

Grocery
3 c Tier 3 NaN 1998
Store

Supermarket
8 b Tier 2 NaN 2002
Type1

Supermarket
9 c Tier 2 NaN 2007
Type1

Supermarket
25 a Tier 2 NaN 2007
Type1

Grocery
28 a Tier 3 NaN 1998
Store

## ... ... ... ... ... ...

Supermarket
8502 d Tier 2 NaN 2002
Type1

Supermarket
8508 c Tier 2 NaN 2002
Type1

Grocery
8509 d Tier 3 NaN 1998
Store

Supermarket
8514 a Tier 2 NaN 2002
Type1

Supermarket
8519 b Tier 2 NaN 2002
Type1

## Observe that when Outlet_Type =Supermarket Type1 and Outlet_Location_Type is

Tier 2 then outlet size is null.

## Also, when Outlet_Type = Grocery Store and Outlet_Location_Type is Tier 3 then

outlet size is alwys null.

## Out[33]: Tier 3 555

Tier 1 528
Tier 2 0
Name: Outlet_Location_Type, dtype: int64

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 21/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [34]: data.groupby('Outlet_Type').get_group('Grocery Store')

Out[34]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Out

Grocery
3 c Tier 3 NaN 1998
Store

Grocery
23 b Tier 1 Small 1985
Store

Grocery
28 a Tier 3 NaN 1998
Store

Grocery
29 a Tier 1 Small 1985
Store

Grocery
30 a Tier 3 NaN 1998
Store

## ... ... ... ... ... ...

Grocery
8473 c Tier 3 NaN 1998
Store

Grocery
8480 c Tier 1 Small 1985
Store

Grocery
8486 a Tier 3 NaN 1998
Store

Grocery
8490 c Tier 1 Small 1985
Store

Grocery
8509 d Tier 3 NaN 1998
Store

## 1083 rows × 8 columns

In [35]: data.groupby(['Outlet_Location_Type','Outlet_Type'])['Outlet_Size'].value_counts(

## Out[35]: Outlet_Location_Type Outlet_Type Outlet_Size

Tier 1 Grocery Store Small 528
Supermarket Type1 Medium 930
Small 930
Tier 2 Supermarket Type1 Small 930
Tier 3 Supermarket Type1 High 932
Supermarket Type2 Medium 928
Supermarket Type3 Medium 935
Name: Outlet_Size, dtype: int64

In [36]: (data.Outlet_Identifier=='OUT010').value_counts()

## Out[36]: False 7968

True 555
Name: Outlet_Identifier, dtype: int64

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 22/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [37]: data.groupby('Outlet_Size').Outlet_Identifier.value_counts()

## Out[37]: Outlet_Size Outlet_Identifier

High OUT013 932
Medium OUT027 935
OUT049 930
OUT018 928
Small OUT035 930
OUT046 930
OUT019 528
Name: Outlet_Identifier, dtype: int64

Observed that:

Tier 1 have small and medium size shop. Tier 2 have small and (missing 1) type
shop. Tier 3 have 2-medium and 1 high and (missing 2) shop

## In [38]: def func(x):

if x.Outlet_Identifier == 'OUT010' :
x.Outlet_Size == 'High'
elif x.Outlet_Identifier == 'OUT045' :
x.Outlet_Size == 'Medium'
elif x.Outlet_Identifier == 'OUT017' :
x.Outlet_Size == 'Medium'
elif x.Outlet_Identifier == 'OUT013' :
x.Outlet_Size == 'High'
elif x.Outlet_Identifier == 'OUT046' :
x.Outlet_Size == 'Small'
elif x.Outlet_Identifier == 'OUT035' :
x.Outlet_Size == 'Small'
elif x.Outlet_Identifier == 'OUT019' :
x.Outlet_Size == 'Small'
elif x.Outlet_Identifier == 'OUT027' :
x.Outlet_Size == 'Medium'
elif x.Outlet_Identifier == 'OUT049' :
x.Outlet_Size == 'Medium'
elif x.Outlet_Identifier == 'OUT018' :
x.Outlet_Size == 'Medium'
return(x)

In [39]: data.Outlet_Size=data.apply(func,axis=1)

c:\python\python38-32\lib\site-packages\pandas\core\generic.py:5303: SettingWit
hCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/sta

ble/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pandas.pyd
ata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-c
opy)
self[name] = value

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 23/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

Out[40]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Outlet_Ide

Supermarket
0 d Tier 1 d 1999 OU
Type1

Supermarket
1 a Tier 3 a 2009 OU
Type2

Supermarket
2 c Tier 1 c 1999 OU
Type1

Grocery
3 c Tier 3 c 1998 OU
Store

Supermarket
4 a Tier 3 a 1987 OU
Type1

## In [41]: # Visualize Item_MRP and Item_Outlet_Sales

sns.boxplot(x='Item_MRP',y='Item_Outlet_Sales', data=data)
Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x179977c0>

In [42]: data[data.Item_MRP=='b'].Item_Outlet_Sales.max()

Out[42]: 7158.6816

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 24/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [43]: data[data.Item_Outlet_Sales==7158.6816]

Out[43]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Outlet_

Supermarket
7737 d Tier 3 d 1985
Type3

Supermarket
7796 b Tier 3 b 1985
Type3

In [44]: data=data.drop(index=7796)
data.groupby('Item_MRP').get_group('b')['Item_Outlet_Sales'].max()
Out[44]: 5582.733

## In [45]: # Visualize in boxplot

sns.boxplot(x='Outlet_Type', y='Item_Outlet_Sales', data=data)

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 25/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [47]: data[data.Outlet_Location_Type=='Tier 1'].Item_Outlet_Sales.max()

Out[47]: 9779.9362

In [48]: data[data['Item_Outlet_Sales']==9779.9362]
Out[48]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Outlet_

Supermarket
4289 d Tier 1 d 1997
Type1

In [49]: data=data.drop(index=4289)

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 26/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [50]: sns.boxplot(x='Outlet_Size',y='Item_Outlet_Sales', data=data)

Out[50]: <matplotlib.axes._subplots.AxesSubplot at 0x17449d60>

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 27/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## Out[51]: <matplotlib.axes._subplots.AxesSubplot at 0x173a96a0>

In [52]: data.Outlet_Establishment_Year=data.Outlet_Establishment_Year.astype('category')
data_label=data.Item_Outlet_Sales
data_dummy=pd.get_dummies(data.iloc[:,0:6])

In [53]: data_dummy['Item_Outlet_Sales']=data_label

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 28/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [54]: data_dummy.shape

## Chapter 5: Evaluate some algorithm

Here is what we are going to cover in this step:
Machine learning algorithm
Linear Regression Analysis
Create valid dataset
Test harness
Build modelss
Select the best model

## In [55]: from sklearn.model_selection import train_test_split

We will split the loaded dataset into two, 80% of which we will use to train our models and 20%
that we will hold back as a validation dataset.

## In [56]: # train test split:

train,test=train_test_split(data_dummy, test_size=0.2,random_state=2019)

## Out[57]: ((6816, 35), (1705, 35))

In [58]: train_label=train['Item_Outlet_Sales']
test_label=test['Item_Outlet_Sales']
del train['Item_Outlet_Sales']
del test['Item_Outlet_Sales']

## In [59]: from sklearn.linear_model import LinearRegression

In [60]: lr = LinearRegression()

In [61]: lr.fit(train,train_label)

Out[61]: LinearRegression()

## In [62]: from sklearn.metrics import mean_squared_error

In [63]: predict_lr=lr.predict(test)

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 29/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [64]: mse=mean_squared_error(test_label, predict_lr)

In [65]: lr.score=np.sqrt(mse)

In [66]: lr.score
Out[66]: 1169.1077432763884

## Chapter 5.3 Create Valid Dataset

We will split the loaded dataset into two, 80% of which we will use to train our
models and 20% that we will hold back as a validation dataset.

## In [67]: # Split-out validation dataset

array = data_dummy.values
X = array[:,0:6]
Y = array[:,6]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X

## We will use 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all
combinations of train-test splits.

## In [68]: # Test options and evaluation metric

# https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
seed = 9
scoring = 'accuracy'

## Logistic Regression (LR)

Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN).
Classification and Regression Trees (CART).
Gaussian Naive Bayes (NB).

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 30/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## Support Vector Machines (SVM).

Here, we use a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM)
algorithms. We reset the random number seed before each run to ensure that the evaluation of
each algorithm is performed using exactly the same data splits. It ensures the results are directly
comparable.

## In [69]: # Split-out validation dataset

array = data_dummy.values
X = array[:,0:6]
Y = array[:,6]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 31/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## In [70]: # Spot Check Algorithms

models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfol
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(

## LR: 0.894956 (0.008420)

LDA: 0.890847 (0.014584)

c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(
c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(

## KNN: 0.890258 (0.009133)

CART: 0.887913 (0.009095)
NB: 0.890847 (0.014584)

c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(
c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(
c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(
localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 32/35
5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

## Chapter 5.6 Selction Best Models

In this case, we can see that it looks like Logistic Regression (LR) has the largest
estimated accuracy score.

## In [71]: # Compare Algorithms

fig = plt.figure()
fig.suptitle('Algorithm Comparison')
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

## Chapter 6: Make prediction

From above, the Logistic Regression ( LR) algorithm is very simple and was an accurate model
based on our tests. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to
keep a validation set just in case you made a slip during training, such as overfitting to the training
set or a data leak. Both will result in an overly optimistic result.

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 33/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

We can run the Logistic Regression model directly on the validation set and summarize the results
as a final accuracy score, a confusion matrix and a classification report.

## In [72]: # Make predictions on validation dataset

LR = LogisticRegression()
LR.fit(X_train, Y_train)
predictions = LR.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.8821114369501466
[[1476 35]
[ 166 28]]
precision recall f1-score support

## 0.0 0.90 0.98 0.94 1511

1.0 0.44 0.14 0.22 194

## accuracy 0.88 1705

macro avg 0.67 0.56 0.58 1705
weighted avg 0.85 0.88 0.85 1705

We can see that the accuracy is 88.2%. Y validation is 1476. Finally, the classification report
provides a breakdown of each class by precision, recall, f1-score and support showing excellent
results.

## Chapter 7: Summary and Insights

In this project excersie, i learn to use a step-by-step on machine learning project using Python.

Based on the study, sales predict problem can be examined by using machine learning
metholodgies across different alogrithms.

Future work is to learn different ways of statistic models on used case and figure out what to use,
how to use and why to use.

References:
https://www.kaggle.com/devashish0507/big-mart-sales-prediction
(https://www.kaggle.com/devashish0507/big-mart-sales-prediction)
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
(https://machinelearningmastery.com/machine-learning-in-python-step-by-step/)
https://seaborn.pydata.org/tutorial/distributions.html
(https://seaborn.pydata.org/tutorial/distributions.html)
https://matplotlib.org/3.1.0/gallery/subplots_axes_and_figures/subplots_demo.html
(https://matplotlib.org/3.1.0/gallery/subplots_axes_and_figures/subplots_demo.html)

## localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 34/35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook