Davide Notaristefano - Microsoft Professional Program Certificate in Data Science - Capstone Project

Analysis of Customer’s Average Month Spend Davide Notaristefano, April 2017
OVERVIEW
This document presents an analysis of data concerning customers and their characteristics. The analysis is
based upon two datasets: the first contains demographic data, the second includes the customers’ likelihood
to purchase a bike and their average month spend.
After exploring the data by calculating summary and descriptive statistics and by creating visualizations of the
data, several potential relationships between customers and their likelihood to purchase a bike and their
average month spend were identified. After exploring the data, two predictive models were created: the first
to classify wheter a customer will purchase a bike or not, the second to predict the average month spend
from their features was created.
After performing the analysis, the author presents the following conclusions. While many factors can help
indicate the average month spend of a customer, significant features found in this analysis were:
• Country
• Education
• Occupation
• Gender
• Marital Status
• Being an house owner or not
• The number of children (both total and living at home)
• Age
INITIAL DATA EXPLORATION
The initial exploration of the data began with some summary and descriptive statistics. Before proceeding, the
datasets were purged by duplicated customerID and then joined.
Individual Feature Statistics
Summary statistics for minimum, maximum, mean, median, standard deviation, and distinct count were
calculated for numeric columns, and the results taken from 18,355 observations are shown here:
Min Max Median Mean Std Dev DCount

HomeOwnerFlag 0 1 1 0.61 0.49 2
NumberCarsOwned 0 5 1 1.27 0.91 6
NumberChildrenAtHome 0 3 0 0.34 0.57 4
TotalChildren 0 3 0 0.85 0.93 4
YearlyIncome 25,435 139,115 61,851 72,758.95 30,687.66 15,355
Age 17 87 34 35.43 11.25 70
AvgMonthSpend 44.10 65.29 51.42 51.77 3.44 1,803
BikeBuyer 0 1 1 0.55 0.50 2
Since AvgMonthSpend is of interest of analysis, it was noted
that mean and median are similar and the variance is somehow
limited. The histogram shows that the distribution is slightly
right-skewed: in other words, more than a half of the customers
has a medium-high amount of monthly spend.
Another interesting feature comes out when analyzing the YearlyIncome column. The histogram shows that
YearlyIncome is divided into five groups, which have then binned together by the following ranges: 25,435-
37,374, 50,869-62.806, 76,294-88.226, 101,730-113,674, 127,166-139,115. Further analysis shows that the
mean gap between the five group is 13,494.5 and the distribution of those groups exactly reflects the
distribution of AvgMonthSpend by Occupation.
Finally, the column Age was calculated. The distribution is right-skewed, indicating that most customers are
young. Furthermore, it has been decided to group age into variables bins, as shown in the second histogram.
In addition to numeric values, the customer observations include categorical features, including:
• Name: first and last names and, where known, title (Mr, Mrs, Ms, Sr), middle name and suffix (Jr – in
just three cases)
• Address: the customers’ home address
• City: the location of the customer
• State/Province: a subdivision of Country (5 States in Australia, 3 in Canada, 16 in France, 6 in
Germany, 1 in the UK and 23 in the USA)
• Country/Region: Australia, Canada, France, Germany, United Kingdom and the USA
• Postal code for the customer’s address
• Phone number
• BirthDate
• Education: the maximum level of education achieved by the customer (Partial High School, High
School, Partial College, Bachelors, Graduate Degree)
• Occupation: the type of job in which the customer is employed (Manual, Skilled Manual, Clerical,
Management, Professional)
• Gender: Male or Female
• MaritalStatus: Married or Single
• LastUpdated: the date when the customer record was last modified (used to calculate Age)
Bar charts were created to show frequency of these features and indicate the following:
• United States is the most representated country, followed by Australia; the three European countries
have a lower impact as well as Canada
• There is a significant variation between States in each Country: California stands for 56.76% of US
total (followed by Washington at 29% and by Oregon at 13.75%), British Columbia stands for 99.76%
of Canada, South Australia and Tasmania are less frequent amont Australian customers, France and
Germany are almost fairly distributed
• Most customers have Bachelors; at the second place there are customers with Graduated Degree,
followed by those who have Partial College and, at a very slightly distance, those who have only
finished High School
• Skilled Manual is the most frequent occupation, followed by Clerical and Manual
• Customers are equally distributed between male and females
• There are more married customers than singles
• About six out of ten customers own a house
• The vast majority of the customers owns at least one car
• Half of the customers have at least one child, and half of those who have one child have them at
home
• Almost tre quarter of those who have three children doesn’t have anyone of them at home
• Slightly more customers have bought a bike (55%)
It was decided to make the columns NumberCarsOwned, NumberChildrenAtHome and TotalChildren

categorical by grouping their numeric counts into categories with less complexity: “Zero, One, Two-or-more”
for NumberCarsOwned and “Zero, One-or-more” for the other two.
CORRELATION AND APPARENT RELATIONSHIPS
After exploring the individual features, an attemp was made to identify relationships between features in the
data, in particolar between BikeBuyer and categorical variables and AvgMonthSpend and categorical
variables.
Apparent correlation with BikeBuyer
The following bar charts were generated to compare the likelihood of a customer purchasing a bike with
categorical variables. The key features are shown below.
About 55% of customers in each country purchase a bike. The same proportion is reflected among
States/Provinces, except for California and Washington, where the difference is slightly more evident.
Medium-high educated customers are more likely to buy a bike than those with a partial high school
education; high school is the only category where the yes/no is almost equal. The difference is more evident
when comparing BikeBuyer and Occupation: “yes” is far more higher among any occupation but manual,
where the proportion is opposite. Finally, customers over 26 years old are more frequent buyers (especially in
the class 30-40 years old); the evidence is opposite among the under 25. A little note: the classes 20-25 and
26-30 have almost the same weight, but in opposite sense: 58% for “no” and 57% for yes, respectively.
Males are more likely to buy a bike, females equally divides between buyers and non-buyers. Almost two
married customers out of three purchase a bike, the 55% percent of singles does not. A similar proportion is
found among home owners: 64% of them buys a car, 59% of not-home-owner do not buy a bike.
There is a high variance among those who owns a car or not: who does not own any car is generally a bike-
buyer (unless the proportion is not as high as expected), single-car owners are less frequent bike-buyers and
the proportion is completely different among those who owns at least two cars (almost the 75% of them buys
a bike). Comparing the likelihood of buying a bike and the number of children, who does not have any
children (in general or at home) tends not to buy a bike and who has at least one child is a more frequent
bike-buyer (especially when children are still at home).
Apparent correlation with AvgMonthSpend
After exploring relationship between categorical variables, an attempt was made to discern any apparent
relationship between categorical features and AvgMonthSpend. The following box-plots show the categorical
column that seem to exhibit a relationship with the AvgMonthSpend.
There are slight differences in terms of median and range of AvgMonthSpend for different categorical
features:
• Distribution of AvgMonthSpend across countries is similar in terms of mean, median and range
• Medium-high educated customers have a similar distribution; lower educated customers have a
narrower range
• There is a high difference in the range of 30 to 50 years old customers compared to the other
categories
• Altough the range of Males’s AvgMonthSpend is higher than Female’s, mean and median are only
slightly different
• There is no clear difference between married and single customers as well as between those who
don’t have any children and those who have at least one
• Home owners tend to spend more as well as those who own at least two cars
• AvgMonthSpend varies a lot across occupation (the same distribution is found across levels of income,
thus this category will not be considered)
Special relationships
Apparent relationships between AvgMonthSpend and individual features are helpful in determining predictive
heuristics but relationships are often more complex. Taking multiple variables into consideration and
combining them together, some faceted plots were created.
The following plots show how AvgMonthSpend varies across age and level of income. Customers aged 30-50
have a wider range, customers over 66 years old are more sparse across the average month spend. Clustering
by income level is evident in the second plot, where it is clear how the majority of the average month spend is
located in the lower lever of each group.
CLASSIFICATION OF CUSTOMERS
Based on the analysis of the customers’ likelihood purchasing a bike, a predictive model to classify customers
into two categories: 1 (customer buys a bike) and 0 (customer does not buy a bike).
Data were sampled and partitioned evenly into folds, trained with a random grid model hyperparameter and
the model was created using the Two-Class Boosted Decision Trees algorithm. This lead to the following
results:
• True Positives: 8618

• True Negatives: 6300
• False Positives: 1928
• False Negatives: 1509
The Received Operator Characteristic (ROC) curve for the

model is shown here, with the blue line indicating the
model’s performance at varying classification threshold
values, and the diagonal line showing the expected
results of a random guess.
This translates into the following standard performance

metrics for classification:
• Accuracy: 81.3%
• Precision: 81.7%
• Recall: 85.1%
• F1 Score: 83.4%
REGRESSION
After creating a classification model to predict the likelihood to purchasing a bike, a regression model to
predict the average month spend of customers was created. Based on the apparent relationships identified
when analyzing the data, a boosted decision tree regression model to predict the average month spend.
The model was trained with an even sample and partition fold, which lead to an RMSE of 1.876529.
CONCLUSION
This analysis has shown that the average month spend of a customer can be predicted from his
characteristics. In particular, country of living, education, occupation, gender, marital status, owning a home,
age and the number of children have an effect on the average month spend of a customers.

Davide Notaristefano - Microsoft Professional Program Certificate in Data Science - Capstone Project

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Davide Notaristefano - Microsoft Professional Program Certificate in Data Science - Capstone Project

Hochgeladen von

Copyright:

Verfügbare Formate

Analysis of Customer’s Average Month Spend Davide Notaristefano, April 2017

INITIAL DATA EXPLORATION

Individual Feature Statistics

Min Max Median Mean Std Dev DCount

It was decided to make the columns NumberCarsOwned, NumberChildrenAtHome and TotalChildren

Apparent correlation with BikeBuyer

Apparent correlation with AvgMonthSpend

• True Positives: 8618

The Received Operator Characteristic (ROC) curve for the

This translates into the following standard performance

Das könnte Ihnen auch gefallen