Sie sind auf Seite 1von 18

Summer Sale of Retail store

A Statistical View into Summer Sales


Table of Contents
Project Overview ........................................................................................................................................ 3
Confidentiality ............................................................................................................................................ 3
Data Source ................................................................................................................................................. 4
Problem Statement ...................................................................................................................................... 4
1. Decision Making Using Statistical Methods .......................................................................................... 4
1.1 Sale Summary (first month of summer season) ............................................................................ 4
1.2 Determining probability of event using PoissonDistribution ........................................................ 5
1.3 Sampling Distribution & Central LimitTheorem .......................................................................... 7
1.3 Using the Central Limit Theorem ................................................................................................. 9
2. Exploratory data analysis and data visualization ................................................................................ 9
2.1 Apriori (Association Rule Analysis) ........................................................................................... 15
2.2 Association rule ........................................................................................................................... 17
3. Conclusion ........................................................................................................................................... 18
Project Overview

A retail store wants to provide personalized offer for customers against the different products, so they
want to understand the customer purchase behavior (specifically, purchase amount) against various
products of different categories. They have shared purchase summary of various customers for
selected high volume products from summer season. We will need to provide statistical view of
purchase amount against various products which will help store to announce summary sales offer for
customers against different products.

The Objective of this project is to create a statistical view based on the customer datasets to analyze
the customer sales, so that marketing strategies can be built to increase the net volume of sales.
Managers often use sales analysis reports to identify market opportunities and areas where they could
increase volume. For instance, a customer may show a history of increased sales during certain
periods. This data can be used to ask for additional business during these peak periods. A sales
analysis report shows a company's actual sales for a specified period that managers feel is significant.

Confidentiality

To perform the analysis of summer of specific products, one needs to have inventory data for the
complete supply chains. Due to the amount of data needed in order to be able to perform a study of
a full supply chain it is practically impossible to collect and organize data of the complete
background system without having access to a background database. We can use your time to collect
data for your specific foreground system, while using store database for the background.

In rare and exceptional situations, an activity dataset that includes confidential information may be
kept inaccessible as a unit process dataset while still being included in calculations of accumulated
systems datasets. Confidential datasets are subject to the same Data quality guidelines as any other
dataset and are reviewed on a unit process level internally, but the review procedure is performed
under the direct management of the data warehouse team.

Most of the few confidential datasets are from older versions of the database, and in general our
aims to replace such data with transparent unit process data as soon as possible.
Data Source

Data is the foundation for any analysis project. The second stage of project implementation is
complex and involves data collection, selection, preprocessing, and transformation. Each of these
phases can be split into several steps.

Data collection: Since we are need to predict the sales and purchase behavior of customers, so we
need the sales data of last summer season, product and customer details. Sales transaction file is very
huge file so we extract only one-month data because store wants to make decision on the start of
summer session.

We have collected sales transactions from different server like online web server, POS server and
other front hand servers. Filter then columns to reduce the size to dataset.
Collect the customer and product information table from DB2 server. Join all three table (customer,
product and sale) and select only the important required columns from the final dataset.

Problem Statement

Our approach is to create a statistical models based on the customer datasets to predict the customer
sales, so that marketing strategies can be built to increase the net volume of sales. Managers often
use sales analysis reports to identify market opportunities and areas where they could increase
volume. For instance, a customer may show a history of increased sales during certain periods. This
data can be used to ask for additional business during these peak periods. A sales analysis report
shows a company's actual sales for a specified period that manager’s feel is significant.

1. Decision Making Using Statistical Methods


1.1 Sale Summary (first month of summer season)

We have get the data of first month of summer season of last year because most of the discounts
will be applied on the starting of season, so to reduce the sales transactions and we are focusing on
the first month of summer season. Last year, there were 550099 sales transaction is recorded by the
stores in one month.

Straightforward calculations of descriptive statistics on the purchase amount were applied. The
standard deviation of $5023 from the mean of $9263 shows a wide dispersion in data set,

It further shows from the Range value of 23949 (Min $12 to Max $23961). This further shows that
the purchase amount one division of product has no similar patterns to that of division of product
although they may be in the same store. Example Women division sale is $13600 and Kids division
sale is $11300 are both in the same store of retail and yet they don’t share the same pattern in terms
of spending on acquiring a user.

It is a representation that how valuable an acquired user is to their business.

Statistics of Purchase of amount in store


Number of transactions 550068
Max 23961
Min 12
Median 8047
Mean 9263.989
Variance 25231186
Standard Deviation 5023.058

1.2 Determining probability of event using Poisson Distribution

Use the Poisson formula to evaluate whether it is financially viable to keep a store open 24 hours a
day. Let N ~ P(λ) [Poisson with mean λ] model the no. of customers; Si ~ N(μ,σ2) [Normal with
parameters μ and σ2] model the amount the i’th customer will spend. Also, assume that the Si’s are
independent.

If N = n then the total sales S ~ N(nμ,nσ2). It follows from the law of total probability that the pdf
of S can be given by f(s; λ,μ,σ) = ∑n=∞n=0N(s; nμ,nσ2)P(n; λ)

[where N(s; *) is a normal pdf evaluated at s and P(n; *) is a Poisson evaluated at n, also note the
point s = 0 has positive probability equal to P(0; λ)]

The expression for f(s; λ,μ,σ) can be used to forecast sales. Also, if λ is large f(s; λ,μ,σ) ≈ N(s;
λμ,λ[σ2+μ2])
Calculate the average number of sales made by the store during the extra overnight shift – the period
from midnight to 8 A.M. using the distribution formula then calculate the probable lowest number
of sales that might be made during the overnight shift

• Counts of sales transactions occurring randomly in time


• X = number of transactions
An online sale has thousands of transactions, but each one will purchase with a very small
probability. The ISP (Internet service provider) knows that on average 5 transactions will be made
in one minute.

Let X = number of transactions made in a minute.

Probability Cumulative Probability


R P (X = r) P (X ≤ r)
0 0.0067 0.0067
1 0.0337 0.0404
2 0.0843 0.1247
3 0.1403 0.265
4 0.1755 0.4405
5 0.1755 0.616
6 0.1462 0.7622
7 0.1044 0.8666
8 0.0653 0.9319
9 0.0363 0.9682
10 0.0181 0.9863
sum 1
1.3 Sampling Distribution & Central Limit Theorem

We tried to demonstrate a sampling distribution using the available Click Thru Rate of transactions
data. Each sample contained a sample size of 20. By repeatedly picking up a samples over 16, 415
repetitions, we could obtain a sample mean of the sales transactions value.

Note: Ideally, it is best to obtain a sample list from the population. However, our entire study in this
project is based on a sample group. In order to validate and show our understanding of transactions,
we have obtained samples from our existing list. In order to increase the samples, we have obtained
a sample size of 20 over multiple repetitions (16415) using the INDEX function in Excel.

If the number of samples were obtained from a larger population, the Normal Curve would have
smoother curve.

The sample mean followed a normal curve


1.3 Using the Central Limit Theorem

From the histogram above, we can see when the sample size is 50. The sampling distribution of the
sample mean is fairly close to normal. Based on the Central Limit Theorem, the sampling
distribution of the sample mean is roughly N(μ,σ/sqrtn)N(μ,σ/sqrtn), where μ=1499.69 is the
population mean and σ=505.5089 is the population SD. The values of μ and σ have been found in
the beginning.

2. Exploratory data analysis and data visualization


To show that given data will help with these use cases. Only if the relevant columns have a
correlation, those will be helpful to predict/conclude the cases.

Looking at sales transaction count of store its looks like very less number of females attended the
summer sale. But it could also mean less number of females paid for the products and may be their
spouse paid for them.

As we can see, there are quite a few more males than females shopping at our store. This gender split
metric is helpful to retailers because some might want to modify their store layout, product selection,
and other variables differently depending on the gender proportion of their shoppers.

A study published in the Clothing and Textiles Research Journal writes,

"Involvement, variety seeking, and physical environment of stores were selected as antecedents of
shopping experience satisfaction. The structural model for female subjects confirmed the existence
of the mediating role of hedonic shopping value in shopping satisfaction, whereas the model for male
respondents did not." Chang, E., Burns, L. D., & Francis, S. K. (2004) (Abstract)
Although this does not give direct insight into recommended actions for retail stores, it does display
a difference in the value derived from shopping and its relationship to gender, which should be taken
into account by retailers.

To investigate further, let’s compute the average spending amount as it relates to Gender. For easy
interpretation and trace back we will create separate tables and then join them together.
Transaction count plot for age seems like the majority of the population in the age’s group 26-35
attended the sale.

It seems as though younger people (18-25 & 26-35) account for the highest number of purchases of
the bestselling product. Let’s compare this observation to the overall dataset.
Check among the age groups, which gender was a majority by adding a hue. And as seen below,
more males spent in the sale than females.
May be we could check further - how many of these males were actually married? For this let’s create
a column that represents gender and married status and then use it as hue.

As we see above, there are no bars for the married in the 0-17 range which makes sense. And then if
we look at the 46 and above groups, females are very less. But on the other hand, married males
paying in range 46-55 are also comparatively more than married females. So it could also imply that
though ladies do shop a lot, their spouses are possibly paying for it and hence data reflects that men
shopped more. If we had more categorical data defining what kind of products were purchased by
men, we could dig in this statement further. However, since in this dataset we don't know if there is
a category that implies feminine products/clothes we cannot further explore this case.

Even below plots don't provide any hint of whether some products are particularly being purchased
by either females or married males.
Looks like User ID 1004277 is our top spender. Let’s use summary() to see other facets of our total
customer spending data.

We can see an average total purchase amount of 851752, max total purchase amount of
10536783, min total purchase amount of 44108 and a median purchase amount of 512612.
Let’s plot a chart showing the distribution of purchase amounts to see if purchases are normally
distributed or contain some skewness. A density plot will show us where the highest number of
similar purchase amounts rests in accordance to the entire customer base. It is important to note that
Density charts graph the expected probability of values, given data as input, and then plot a line
surrounding those values (estimation).
Here we are seeing a very right (positive) skewed density plot with a long tail. This means that there
are quite a few values that sit higher than the mean and that the highest density of values isn't a
standard distributed series. We see that the largest density of purchases is around the 250000 mark.

2.1 Apriori (Association Rule Analysis)


Now let’s use an algorithm called Apriori to make some association rules regarding customer
purchases. We will be using a rules package.

Before we begin, let’s elaborate on the idea of Association Rule Learning. In its simplest form,
Association Rule Learning attempts to predict customer transactions. In other words, the algorithm
solves the problem, "People who bought and also bought." This can prove to be extremely useful for
retailers who aim to optimize product placement in stores and promotional campaigns.

In the case of our store, implementing an effective product placement strategy can prove to optimize
sales of products normally bought together. For example, let’s say that our store was to have a sale
on TVs. It would be smart to place HDMI Cables alongside these TVs because those items are usually
purchased together. On the other hand, it may also prove to be smart to place them far apart so that
customers need to walk throughout the entire store while searching for their desired item, where
another product may catch their eye along the way.
The Apriori algorithm specifically aims to maximize the likelihood someone
performs/purchases/watches something given knowledge about their prior actions.

To begin, let’s import the libraries we will be using for this section if not done so already.

The "element (item set/transaction) length distribution" gives us a distribution of the number of
items in a customer’s (User) basket and underneath it we can see more information including the
quartile and mean information. In this case, we see a mean of 92.41, which means that on average,
each customer purchased 92.41 items. In this case, since we are aware of a few customers who
purchased over ~1000 items, it may be useful to use the median value of 54.00 items instead since
the mean can be heavily affected by outlier values.
To get a clearer picture of the items, lets create an item frequency plot which is included in the rules
package.
2.2 Association rule

Our first step will be to set our parameters. The first parameters we will set are the support and
confidence. The support value is derived from the frequency of a specific item within the dataset.
When we set our support value, we are setting a minimum number of transactions necessary for our
rules to take effect.

Support: Our support value will be the minimum number of transactions necessary divided by the
total number of transactions.

 As described by summary (customers Products), we have a total number of unique customer


transactions of 5892.

 From our dataset, let’s assume that we want to choose a product which was purchased by at
least 50 different customers.

 With these two values established, we can compute the support value with simple division.
(50/5892) = .008486083

The second parameter we will take into consideration will be the confidence. The confidence value
determines how often a rule is to be found true. In other words, the minimum strength of any rule is
a limit we place when setting our minimum confidence value.

 The default confidence value in the apriori() function is 0.80 or 80%, so we can begin with
that number and then adjust the parameters to applicable results.

Confidence: We can determine our confidence value by first starting with the default value and
adjusting accordingly.

 With more domain knowledge, and with Product IDs referencing items with recognizable
names, the Confidence value can be easily changed to see different, and more relevant,
results.

 In our case, we will start with a value and then lower the confidence to see different rules.
3. Conclusion

Overall, we have made some insightful discoveries from our Analysis of this summer sale dataset.
We saw how customers at our store were distributed across multiple categorical classifications such
as Gender, Age, Occupation, Stay in Current City, etc. We have also determined who our top
purchasing customers were and also classified products into "best sellers" and " worst sellers." Also,
we have identified various metrics regarding Purchases made including the average amount spend
by customers and total purchase amount across multiple categories.

After our Analysis, we dove into the world of Association Rule Learning and identified some
association rules for our store. We discovered multiple situations where customers that purchased a
certain set of items were over 75% likely to purchase another item, given a set of inputs.

Das könnte Ihnen auch gefallen