Sie sind auf Seite 1von 35

5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

Project 4: Final Project: BigMart Sales


Prediction
Submitter: Cathy Dai (Data Science Course)

Chapter 1: Problem Statement

Come across supply chain, we always want to understand the trends of the sales
for products. Here, we have BigMart dataset which have collected in year 2013 for
1559 products across 10 stores in different cities. Also, certain attributes of each
product and store have been defined. The aim is to build a predictive model and
find out the sales of each product at a particular store.

In this project, I would like to use machine learning to build a predictive model and
find out the sales of each product at a particular store.

Chapter 2: Load the data

I'm going to use the Kaggle dataset. This dataset can be found:

https://www.kaggle.com/devashish0507/big-mart-sales-prediction
(https://www.kaggle.com/devashish0507/big-mart-sales-prediction)

The dataset contains 8523 observations in products. There are 12 columns of


measurements of them.

2.1 Import Libraries

Let's import all of the modules, functions and objects we are going to use.

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 1/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from pandas.plotting import scatter_matrix

from sklearn import model_selection


from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
plt.style.use("fivethirtyeight")

2.2 Load data

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 2/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [2]: #load the data, remove the first row


kaggle = pd.read_csv('train.csv')
kaggle
Out[2]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_I

0 FDA15 9.300 Low Fat 0.016047 Dairy 249.8092

1 DRC01 5.920 Regular 0.019278 Soft Drinks 48.2692

2 FDN15 17.500 Low Fat 0.016760 Meat 141.6180

Fruits and
3 FDX07 19.200 Regular 0.000000 182.0950
Vegetables

4 NCD19 8.930 Low Fat 0.000000 Household 53.8614

... ... ... ... ... ... ...

Snack
8518 FDF22 6.865 Low Fat 0.056783 214.5218
Foods

Baking
8519 FDS36 8.380 Regular 0.046982 108.1570
Goods

Health and
8520 NCJ29 10.600 Low Fat 0.035186 85.1224
Hygiene

Snack
8521 FDN46 7.210 Regular 0.145221 103.1332
Foods

8522 DRG01 14.800 Low Fat 0.044878 Soft Drinks 75.4670

8523 rows × 12 columns

2.3 Explore Training Data

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 3/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [3]: kaggle.dtypes

Out[3]: Item_Identifier object


Item_Weight float64
Item_Fat_Content object
Item_Visibility float64
Item_Type object
Item_MRP float64
Outlet_Identifier object
Outlet_Establishment_Year int64
Outlet_Size object
Outlet_Location_Type object
Outlet_Type object
Item_Outlet_Sales float64
dtype: object

In [4]: kaggle.index.dtype

Out[4]: dtype('int64')

In [5]: kaggle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 7060 non-null float64
2 Item_Fat_Content 8523 non-null object
3 Item_Visibility 8523 non-null float64
4 Item_Type 8523 non-null object
5 Item_MRP 8523 non-null float64
6 Outlet_Identifier 8523 non-null object
7 Outlet_Establishment_Year 8523 non-null int64
8 Outlet_Size 6113 non-null object
9 Outlet_Location_Type 8523 non-null object
10 Outlet_Type 8523 non-null object
11 Item_Outlet_Sales 8523 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 566.0+ KB

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 4/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [6]: # count the item identifier number


kaggle.Item_Identifier.value_counts()
Out[6]: FDG33 10
FDW13 10
FDP25 9
NCF42 9
FDU12 9
..
FDK57 1
FDQ60 1
FDE52 1
FDN52 1
FDT35 1
Name: Item_Identifier, Length: 1559, dtype: int64

Chapter 3: Summarize the dataset

In this chapter, we study the data by looking at the data a few different ways:

Dimensions of the dataset.


Peek at the data itself.
Statistical summary of all attributes.
Breakdown of the data by the class variable.

In [7]: # shape
kaggle.shape
Out[7]: (8523, 12)

In [8]: kaggle.head()

Out[8]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Iden

0 FDA15 9.30 Low Fat 0.016047 Dairy 249.8092 OU

1 DRC01 5.92 Regular 0.019278 Soft Drinks 48.2692 OU

2 FDN15 17.50 Low Fat 0.016760 Meat 141.6180 OU

Fruits and
3 FDX07 19.20 Regular 0.000000 182.0950 OU
Vegetables

4 NCD19 8.93 Low Fat 0.000000 Household 53.8614 OU

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 5/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [9]: # descripe
kaggle.describe(include=['object'])
Out[9]:
Item_Identifier Item_Fat_Content Item_Type Outlet_Identifier Outlet_Size Outlet_Location_

count 8523 8523 8523 8523 6113

unique 1559 5 16 10 3

Fruits and
top FDG33 Low Fat OUT027 Medium T
Vegetables

freq 10 5089 1232 935 2793

In [10]: # class distribution


kaggle.groupby('Outlet_Identifier').size()
Out[10]: Outlet_Identifier
OUT010 555
OUT013 932
OUT017 926
OUT018 928
OUT019 528
OUT027 935
OUT035 930
OUT045 929
OUT046 930
OUT049 930
dtype: int64

In [11]: # Locate two measurements


kaggle.loc[:,['Item_MRP','Item_Outlet_Sales']]

Out[11]:
Item_MRP Item_Outlet_Sales

0 249.8092 3735.1380

1 48.2692 443.4228

2 141.6180 2097.2700

3 182.0950 732.3800

4 53.8614 994.7052

... ... ...

8518 214.5218 2778.3834

8519 108.1570 549.2850

8520 85.1224 1193.1136

8521 103.1332 1845.5976

8522 75.4670 765.6700

8523 rows × 2 columns

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 6/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

Chapter 4: Data Visualization

We now have a basic idea about the data. We need to extend that with some
visualizations.

We are going to look at different charts to understand the correlation of the


measurements:

Univariate Plot
Multivariate Plot
Chart types
histogram
scatter plot
box plot

Based on the observation, we could draw out some insights.

Training models using featured columns.

In [12]: # count the number of missing values in each column


kaggle.isnull().sum(axis=0)

Out[12]: Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

We can see that Item_Weight and Outlet_Size contain missing values.

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 7/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [13]: # Correlation matrix (ranges from 1 to -1)


kaggle.corr()
Out[13]:
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_

Item_Weight 1.000000 -0.014048 0.027141 -0.011588

Item_Visibility -0.014048 1.000000 -0.001315 -0.074834

Item_MRP 0.027141 -0.001315 1.000000 0.005020

Outlet_Establishment_Year -0.011588 -0.074834 0.005020 1.000000

Item_Outlet_Sales 0.014123 -0.128625 0.567574 -0.049135

Conclusion:

Item_MRP seems to have a good correlation with Item_Outlet_Sales and other


measurements have a very minimum correlation value with Item_Outlet_Sales.
From correlation data we can just drop the two measurement: Item_Visibility and
Item_Weight.

In [14]: # Next is to see Item_Identifier: classified from 1-10


kaggle.Item_Identifier.value_counts()

Out[14]: FDG33 10
FDW13 10
FDP25 9
NCF42 9
FDU12 9
..
FDK57 1
FDQ60 1
FDE52 1
FDN52 1
FDT35 1
Name: Item_Identifier, Length: 1559, dtype: int64

In [15]: # And see the Item_Fat_Content:


kaggle.Item_Fat_Content.value_counts()
Out[15]: Low Fat 5089
Regular 2889
LF 316
reg 117
low fat 112
Name: Item_Fat_Content, dtype: int64

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 8/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [16]: # Grouping same category of description = standardazation


kaggle.Item_Fat_Content=kaggle.Item_Fat_Content.replace('Low Fat', 'LF')
kaggle.Item_Fat_Content=kaggle.Item_Fat_Content.replace('reg','Regular')
kaggle.Item_Fat_Content=kaggle.Item_Fat_Content.replace('low fat','LF')

In [17]: kaggle.Item_Fat_Content.value_counts()

Out[17]: LF 5517
Regular 3006
Name: Item_Fat_Content, dtype: int64

In [18]: # Further processing the data: convert to the correct data type
kaggle.Item_Identifier=kaggle.Item_Identifier.astype('category')
kaggle.Item_Fat_Content=kaggle.Item_Fat_Content.astype('category')
kaggle.Item_Type=kaggle.Item_Type.astype('category')

kaggle.Outlet_Identifier=kaggle.Outlet_Identifier.astype('category')
kaggle.Outlet_Establishment_Year=kaggle.Outlet_Establishment_Year.astype('int64')

kaggle.Outlet_Size=kaggle.Outlet_Size.astype('category')
kaggle.Outlet_Location_Type=kaggle.Outlet_Location_Type.astype('category')
kaggle.Outlet_Type=kaggle.Outlet_Type.astype('category')

Chapter 4.1: Plots

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 9/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [19]: # Histogram overview: Univariate Plots: start with individual measurement


kaggle.hist(figsize=(8,12))
Out[19]: array([[<matplotlib.axes._subplots.AxesSubplot object at 0x131CE568>,
<matplotlib.axes._subplots.AxesSubplot object at 0x131FDF70>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x1421C988>,
<matplotlib.axes._subplots.AxesSubplot object at 0x14240388>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x14253D60>,
<matplotlib.axes._subplots.AxesSubplot object at 0x14274790>]],
dtype=object)

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 10/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 11/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [20]: # seaborn plot: Multivariate Plots: interaction between variables


# https://seaborn.pydata.org/examples/scatterplot_matrix.html

import seaborn as sns


sns.set(style="ticks")

sns.pairplot(kaggle, hue="Item_Fat_Content")
Out[20]: <seaborn.axisgrid.PairGrid at 0x1430edd8>

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 12/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [21]: # Pandas scatterplot in highest two measurement in correlation


fig,axes=plt.subplots(1,1,figsize=(12,8))
sns.scatterplot(x='Item_MRP',y='Item_Outlet_Sales',hue='Item_Fat_Content',size= '
Out[21]: <matplotlib.axes._subplots.AxesSubplot at 0x131187f0>

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 13/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [22]: kaggle.describe()

Out[22]:
Item_Weight Item_Visibility Item_MRP Outlet_Establishment_Year Item_Outlet_Sales

count 7060.000000 8523.000000 8523.000000 8523.000000 8523.000000

mean 12.857645 0.066132 140.992782 1997.831867 2181.288914

std 4.643456 0.051598 62.275067 8.371760 1706.499616

min 4.555000 0.000000 31.290000 1985.000000 33.290000

25% 8.773750 0.026989 93.826500 1987.000000 834.247400

50% 12.600000 0.053931 143.012800 1999.000000 1794.331000

75% 16.850000 0.094585 185.643700 2004.000000 3101.296400

max 21.350000 0.328391 266.888400 2009.000000 13086.964800

Item_MRP column contain prices which are in clusters so it would be better if we convert this
columnn into bins for further processing.

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 14/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [23]: # useing subplot


# https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.subplot.html
fig,axes=plt.subplots(1,1,figsize=(10,8))
sns.scatterplot(x='Item_MRP',y='Item_Outlet_Sales',hue='Item_Fat_Content',size='I
plt.plot([69,69],[0,5000])
plt.plot([137,137],[0,5000])
plt.plot([203,203],[0,9000])

Out[23]: [<matplotlib.lines.Line2D at 0x15538cd0>]

We can use these perpendicular lines to divide data into proper bins.

So we group the bin value as 25,69,137,203,270.

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 15/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [24]: kaggle.Item_MRP=pd.cut(kaggle.Item_MRP,bins=[25,69,137,203,270],labels=['a','b','
kaggle.head()

Out[24]:
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP Outlet_Iden

0 FDA15 9.30 LF 0.016047 Dairy d OU

1 DRC01 5.92 Regular 0.019278 Soft Drinks a OU

2 FDN15 17.50 LF 0.016760 Meat c OU

Fruits and
3 FDX07 19.20 Regular 0.000000 c OU
Vegetables

4 NCD19 8.93 LF 0.000000 Household a OU

Chapter 4.2: Observations

Now, let's explore other measurements

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 16/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [25]: # Other measurements relationship


fig,axes=plt.subplots(4,1,figsize=(15,12))
sns.scatterplot(x='Item_Visibility',y='Item_Outlet_Sales',hue='Item_MRP',ax=axes[
sns.scatterplot(x='Item_Weight',y='Item_Outlet_Sales',hue='Item_MRP',ax=axes[1],d
sns.boxplot(x='Item_Type',y='Item_Outlet_Sales',ax=axes[2],data=kaggle)
sns.boxplot(x='Outlet_Identifier',y='Item_Outlet_Sales',ax=axes[3],data=kaggle)

Out[25]: <matplotlib.axes._subplots.AxesSubplot at 0x16c42130>

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 17/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [26]: fig,axes=plt.subplots(2,2,figsize=(15,12))
sns.boxplot(x='Outlet_Establishment_Year',y='Item_Outlet_Sales',ax=axes[0,0],data
sns.boxplot(x='Outlet_Size',y='Item_Outlet_Sales',ax=axes[0,1],data=kaggle)
sns.boxplot(x='Outlet_Location_Type',y='Item_Outlet_Sales',ax=axes[1,0],data=kagg
sns.boxplot(x='Outlet_Type',y='Item_Outlet_Sales',ax=axes[1,1],data=kaggle)

Out[26]: <matplotlib.axes._subplots.AxesSubplot at 0x1711d5f8>

Drop: Item_Weight and Item_Visibility as our measurements due to the minimum


correctionship with targeted value(Item_Outlet_Sales).

Item_Identifier and Item_Fat_Content are two category.

Chapter 4.3: Training the model

After trim the measurement of model training:

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 18/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

Study feature matrix will be:

Item_Type,
Item_MRP,
Outlet_Identifier,
Outlet_Establishment_Year,
Outlet_Size,
Outlet_Location_Type,
Outlet_Type,
Item_Outlet_Sales, as further processing

In [27]: feature_col=['Item_MRP','Outlet_Type','Outlet_Location_Type','Outlet_Size','Outle

In [28]: fig,axes=plt.subplots(2,2,figsize=(15,12))
sns.boxplot(x='Outlet_Establishment_Year',y='Item_Outlet_Sales',hue='Outlet_Size'
sns.boxplot(x='Outlet_Size',y='Item_Outlet_Sales',hue='Outlet_Size',ax=axes[0,1],
sns.boxplot(x='Outlet_Location_Type',y='Item_Outlet_Sales',hue='Outlet_Size',ax=a
sns.boxplot(x='Outlet_Type',y='Item_Outlet_Sales',hue='Outlet_Size',ax=axes[1,1],

Out[28]: <matplotlib.axes._subplots.AxesSubplot at 0x1728b5c8>

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 19/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [29]: data=kaggle[feature_col]

In [30]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_MRP 8523 non-null category
1 Outlet_Type 8523 non-null category
2 Outlet_Location_Type 8523 non-null category
3 Outlet_Size 6113 non-null category
4 Outlet_Establishment_Year 8523 non-null int64
5 Outlet_Identifier 8523 non-null category
6 Item_Type 8523 non-null category
7 Item_Outlet_Sales 8523 non-null float64
dtypes: category(6), float64(1), int64(1)
memory usage: 184.2 KB

In [31]: fig,axes=plt.subplots(1,1,figsize=(8,6))
sns.boxplot(x='Outlet_Location_Type', y='Item_Outlet_Sales', hue='Outlet_Type', d
Out[31]: <matplotlib.axes._subplots.AxesSubplot at 0x173bdf70>

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 20/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [32]: data[data.Outlet_Size.isnull()]

Out[32]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Outlet_

Grocery
3 c Tier 3 NaN 1998
Store

Supermarket
8 b Tier 2 NaN 2002
Type1

Supermarket
9 c Tier 2 NaN 2007
Type1

Supermarket
25 a Tier 2 NaN 2007
Type1

Grocery
28 a Tier 3 NaN 1998
Store

... ... ... ... ... ...

Supermarket
8502 d Tier 2 NaN 2002
Type1

Supermarket
8508 c Tier 2 NaN 2002
Type1

Grocery
8509 d Tier 3 NaN 1998
Store

Supermarket
8514 a Tier 2 NaN 2002
Type1

Supermarket
8519 b Tier 2 NaN 2002
Type1

2410 rows × 8 columns

Observe that when Outlet_Type =Supermarket Type1 and Outlet_Location_Type is


Tier 2 then outlet size is null.

Also, when Outlet_Type = Grocery Store and Outlet_Location_Type is Tier 3 then


outlet size is alwys null.

In [33]: data.groupby('Outlet_Type').get_group('Grocery Store')['Outlet_Location_Type'].va

Out[33]: Tier 3 555


Tier 1 528
Tier 2 0
Name: Outlet_Location_Type, dtype: int64

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 21/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [34]: data.groupby('Outlet_Type').get_group('Grocery Store')


Out[34]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Out

Grocery
3 c Tier 3 NaN 1998
Store

Grocery
23 b Tier 1 Small 1985
Store

Grocery
28 a Tier 3 NaN 1998
Store

Grocery
29 a Tier 1 Small 1985
Store

Grocery
30 a Tier 3 NaN 1998
Store

... ... ... ... ... ...

Grocery
8473 c Tier 3 NaN 1998
Store

Grocery
8480 c Tier 1 Small 1985
Store

Grocery
8486 a Tier 3 NaN 1998
Store

Grocery
8490 c Tier 1 Small 1985
Store

Grocery
8509 d Tier 3 NaN 1998
Store

1083 rows × 8 columns

In [35]: data.groupby(['Outlet_Location_Type','Outlet_Type'])['Outlet_Size'].value_counts(

Out[35]: Outlet_Location_Type Outlet_Type Outlet_Size


Tier 1 Grocery Store Small 528
Supermarket Type1 Medium 930
Small 930
Tier 2 Supermarket Type1 Small 930
Tier 3 Supermarket Type1 High 932
Supermarket Type2 Medium 928
Supermarket Type3 Medium 935
Name: Outlet_Size, dtype: int64

In [36]: (data.Outlet_Identifier=='OUT010').value_counts()

Out[36]: False 7968


True 555
Name: Outlet_Identifier, dtype: int64

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 22/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [37]: data.groupby('Outlet_Size').Outlet_Identifier.value_counts()

Out[37]: Outlet_Size Outlet_Identifier


High OUT013 932
Medium OUT027 935
OUT049 930
OUT018 928
Small OUT035 930
OUT046 930
OUT019 528
Name: Outlet_Identifier, dtype: int64

Observed that:

Tier 1 have small and medium size shop. Tier 2 have small and (missing 1) type
shop. Tier 3 have 2-medium and 1 high and (missing 2) shop

In [38]: def func(x):


if x.Outlet_Identifier == 'OUT010' :
x.Outlet_Size == 'High'
elif x.Outlet_Identifier == 'OUT045' :
x.Outlet_Size == 'Medium'
elif x.Outlet_Identifier == 'OUT017' :
x.Outlet_Size == 'Medium'
elif x.Outlet_Identifier == 'OUT013' :
x.Outlet_Size == 'High'
elif x.Outlet_Identifier == 'OUT046' :
x.Outlet_Size == 'Small'
elif x.Outlet_Identifier == 'OUT035' :
x.Outlet_Size == 'Small'
elif x.Outlet_Identifier == 'OUT019' :
x.Outlet_Size == 'Small'
elif x.Outlet_Identifier == 'OUT027' :
x.Outlet_Size == 'Medium'
elif x.Outlet_Identifier == 'OUT049' :
x.Outlet_Size == 'Medium'
elif x.Outlet_Identifier == 'OUT018' :
x.Outlet_Size == 'Medium'
return(x)

In [39]: data.Outlet_Size=data.apply(func,axis=1)

c:\python\python38-32\lib\site-packages\pandas\core\generic.py:5303: SettingWit
hCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/sta


ble/user_guide/indexing.html#returning-a-view-versus-a-copy (https://pandas.pyd
ata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-c
opy)
self[name] = value

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 23/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [40]: data.head()

Out[40]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Outlet_Ide

Supermarket
0 d Tier 1 d 1999 OU
Type1

Supermarket
1 a Tier 3 a 2009 OU
Type2

Supermarket
2 c Tier 1 c 1999 OU
Type1

Grocery
3 c Tier 3 c 1998 OU
Store

Supermarket
4 a Tier 3 a 1987 OU
Type1

In [41]: # Visualize Item_MRP and Item_Outlet_Sales


sns.boxplot(x='Item_MRP',y='Item_Outlet_Sales', data=data)
Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x179977c0>

In [42]: data[data.Item_MRP=='b'].Item_Outlet_Sales.max()

Out[42]: 7158.6816

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 24/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [43]: data[data.Item_Outlet_Sales==7158.6816]

Out[43]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Outlet_

Supermarket
7737 d Tier 3 d 1985
Type3

Supermarket
7796 b Tier 3 b 1985
Type3

In [44]: data=data.drop(index=7796)
data.groupby('Item_MRP').get_group('b')['Item_Outlet_Sales'].max()
Out[44]: 5582.733

In [45]: # Visualize in boxplot


sns.boxplot(x='Outlet_Type', y='Item_Outlet_Sales', data=data)

Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0x17997bc8>

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 25/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [46]: sns.boxplot(x='Outlet_Location_Type', y='Item_Outlet_Sales', data=data)

Out[46]: <matplotlib.axes._subplots.AxesSubplot at 0x173d6f10>

In [47]: data[data.Outlet_Location_Type=='Tier 1'].Item_Outlet_Sales.max()

Out[47]: 9779.9362

In [48]: data[data['Item_Outlet_Sales']==9779.9362]
Out[48]:
Item_MRP Outlet_Type Outlet_Location_Type Outlet_Size Outlet_Establishment_Year Outlet_

Supermarket
4289 d Tier 1 d 1997
Type1

In [49]: data=data.drop(index=4289)

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 26/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [50]: sns.boxplot(x='Outlet_Size',y='Item_Outlet_Sales', data=data)


Out[50]: <matplotlib.axes._subplots.AxesSubplot at 0x17449d60>

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 27/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [51]: sns.boxplot(x='Outlet_Establishment_Year',y='Item_Outlet_Sales', data=data)

Out[51]: <matplotlib.axes._subplots.AxesSubplot at 0x173a96a0>

In [52]: data.Outlet_Establishment_Year=data.Outlet_Establishment_Year.astype('category')
data_label=data.Item_Outlet_Sales
data_dummy=pd.get_dummies(data.iloc[:,0:6])

In [53]: data_dummy['Item_Outlet_Sales']=data_label

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 28/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [54]: data_dummy.shape

Out[54]: (8521, 35)

Chapter 5: Evaluate some algorithm


Here is what we are going to cover in this step:
Machine learning algorithm
Linear Regression Analysis
Create valid dataset
Test harness
Build modelss
Select the best model

Chapter 5.1 Machine Learning Model

In [55]: from sklearn.model_selection import train_test_split

We will split the loaded dataset into two, 80% of which we will use to train our models and 20%
that we will hold back as a validation dataset.

In [56]: # train test split:


train,test=train_test_split(data_dummy, test_size=0.2,random_state=2019)

In [57]: train.shape , test.shape

Out[57]: ((6816, 35), (1705, 35))

In [58]: train_label=train['Item_Outlet_Sales']
test_label=test['Item_Outlet_Sales']
del train['Item_Outlet_Sales']
del test['Item_Outlet_Sales']

Chapter 5.2 Linear Regression Model

In [59]: from sklearn.linear_model import LinearRegression

In [60]: lr = LinearRegression()

In [61]: lr.fit(train,train_label)

Out[61]: LinearRegression()

In [62]: from sklearn.metrics import mean_squared_error

In [63]: predict_lr=lr.predict(test)

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 29/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [64]: mse=mean_squared_error(test_label, predict_lr)

In [65]: lr.score=np.sqrt(mse)

In [66]: lr.score
Out[66]: 1169.1077432763884

Chapter 5.3 Create Valid Dataset

We will split the loaded dataset into two, 80% of which we will use to train our
models and 20% that we will hold back as a validation dataset.

In [67]: # Split-out validation dataset


array = data_dummy.values
X = array[:,0:6]
Y = array[:,6]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X

Chapter 5.4 Test Harness

We will use 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all
combinations of train-test splits.

In [68]: # Test options and evaluation metric


# https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
seed = 9
scoring = 'accuracy'

Chapter 5.5 Build Models

Let’s evaluate 6 different algorithms:

Logistic Regression (LR)


Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN).
Classification and Regression Trees (CART).
Gaussian Naive Bayes (NB).

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 30/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

Support Vector Machines (SVM).

Here, we use a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM)
algorithms. We reset the random number seed before each run to ensure that the evaluation of
each algorithm is performed using exactly the same data splits. It ensures the results are directly
comparable.

In [69]: # Split-out validation dataset


array = data_dummy.values
X = array[:,0:6]
Y = array[:,6]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 31/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

In [70]: # Spot Check Algorithms


models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfol
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(

LR: 0.894956 (0.008420)


LDA: 0.890847 (0.014584)

c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(
c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(

KNN: 0.890258 (0.009133)


CART: 0.887913 (0.009095)
NB: 0.890847 (0.014584)

c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(
c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(
c:\python\python38-32\lib\site-packages\sklearn\model_selection\_split.py:293:
FutureWarning: Setting a random_state has no effect since shuffle is False. Thi
s will raise an error in 0.24. You should leave random_state to its default (No
ne), or set shuffle=True.
warnings.warn(
localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 32/35
5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

SVM: 0.889086 (0.010341)

Chapter 5.6 Selction Best Models

In this case, we can see that it looks like Logistic Regression (LR) has the largest
estimated accuracy score.

In [71]: # Compare Algorithms


fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

Chapter 6: Make prediction

From above, the Logistic Regression ( LR) algorithm is very simple and was an accurate model
based on our tests. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to
keep a validation set just in case you made a slip during training, such as overfitting to the training
set or a data leak. Both will result in an overly optimistic result.

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 33/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

We can run the Logistic Regression model directly on the validation set and summarize the results
as a final accuracy score, a confusion matrix and a classification report.

In [72]: # Make predictions on validation dataset


LR = LogisticRegression()
LR.fit(X_train, Y_train)
predictions = LR.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

0.8821114369501466
[[1476 35]
[ 166 28]]
precision recall f1-score support

0.0 0.90 0.98 0.94 1511


1.0 0.44 0.14 0.22 194

accuracy 0.88 1705


macro avg 0.67 0.56 0.58 1705
weighted avg 0.85 0.88 0.85 1705

We can see that the accuracy is 88.2%. Y validation is 1476. Finally, the classification report
provides a breakdown of each class by precision, recall, f1-score and support showing excellent
results.

Chapter 7: Summary and Insights

In this project excersie, i learn to use a step-by-step on machine learning project using Python.

Based on the study, sales predict problem can be examined by using machine learning
metholodgies across different alogrithms.

Future work is to learn different ways of statistic models on used case and figure out what to use,
how to use and why to use.

References:
https://www.kaggle.com/devashish0507/big-mart-sales-prediction
(https://www.kaggle.com/devashish0507/big-mart-sales-prediction)
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
(https://machinelearningmastery.com/machine-learning-in-python-step-by-step/)
https://seaborn.pydata.org/tutorial/distributions.html
(https://seaborn.pydata.org/tutorial/distributions.html)
https://matplotlib.org/3.1.0/gallery/subplots_axes_and_figures/subplots_demo.html
(https://matplotlib.org/3.1.0/gallery/subplots_axes_and_figures/subplots_demo.html)

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 34/35


5/25/2020 Project4-Kaggle-Final report - Jupyter Notebook

localhost:8888/notebooks/Project4-Kaggle-Final report.ipynb 35/35