Beruflich Dokumente
Kultur Dokumente
Psychologist
k-means Tags
Clustering for
Customer
Segmentation: Amazon Web Services caret
A Practical choropleth
Example data
visualization
data
August 13, 2016
wrangling
descriptive statistics
for loop foreach
functions kmeans
lea et Louisiana
machine learning
multilevel model
neural network
parallel
Customer segmentation is a
processing
deceptively simple-sounding
concept. Broadly speaking, the
predictive
analytics
R random forest
goal is to divide customers into
groups that share certain
characteristics. There are an regression
almost-in nite number of
scalable code
characteristics upon which you
could divide customers,
shiny text analysis
however, and the optimal time series
characteristics and analytic Web Service API
approach vary depending upon
the business objective. This
means that there is no single,
correct way to perform
customer segmentation.
Customer segmentation is
often performed using
unsupervised, clustering
techniques (e.g., k-means,
latent class analysis,
hierarchical clustering, etc.),
but customer segmentation
results tend to be most
actionable for a business when
the segments can be linked to
something concrete (e.g.,
customer lifetime value,
product proclivities, channel
preference, etc.). This begs the
question: if you’re looking to
link the segments to some sort
of dependent variable, why not
use an analytic technique that
explicitly estimates the
relationship between your
possible predictors and the
dependent variable?
Third, an important
distinction
library(XLConnect)
raw.data <-
readWorksheet(loadWorkbook(
"Online Retail.xlsx"),
sheet=1)
range(data$InvoiceDate)
data <- subset(data,
InvoiceDate >= "2010-12-
09")
range(data$InvoiceDate)
table(data$Country)
data <- subset(data,
Country == "United
Kingdom")
length(unique(data$InvoiceN
o))
length(unique(data$Customer
ID))
# Identify returns
data$item.return <-
grepl("C", data$InvoiceNo,
fixed=TRUE)
data$purchase.invoice <-
ifelse(data$item.return=="T
RUE", 0, 1)
RFM Variables
customers <-
as.data.frame(unique(data$C
ustomerID))
names(customers) <-
"CustomerID"
###########
# Recency #
###########
data$recency <-
as.Date("2011-12-10") -
as.Date(data$InvoiceDate)
customers$recency <-
as.numeric(customers$recenc
y)
#############
# Frequency #
#############
customer.invoices <-
subset(data, select =
c("CustomerID","InvoiceNo",
"purchase.invoice"))
customer.invoices <-
customer.invoices[!duplicat
ed(customer.invoices), ]
customer.invoices <-
customer.invoices[order(cus
tomer.invoices$CustomerID),
]
row.names(customer.invoices
) <- NULL
# Number of invoices/year
(purchases only)
annual.invoices <-
aggregate(purchase.invoice
~ CustomerID,
data=customer.invoices,
FUN=sum, na.rm=TRUE)
names(annual.invoices)
[names(annual.invoices)=="p
urchase.invoice"] <-
"frequency"
# Add # of invoices to
customers data
customers <-
merge(customers,
annual.invoices,
by="CustomerID", all=TRUE,
sort=TRUE)
remove(customer.invoices,
annual.invoices)
range(customers$frequency)
table(customers$frequency)
###########################
####
# Monetary Value of
Customers #
###########################
####
80/20 Rule
customers <-
customers[order(-
customers$monetary),]
customers <-
customers[order(customers$C
ustomerID),]
Preprocess data
# Log-transform positively-
skewed variables
customers$recency.log <-
log(customers$recency)
customers$frequency.log <-
log(customers$frequency)
customers$monetary.log <-
customers$monetary + 0.1 #
can't take log(0), so add a
small value to remove zeros
customers$monetary.log <-
log(customers$monetary.log)
# Z-scores
customers$recency.z <-
scale(customers$recency.log
, center=TRUE, scale=TRUE)
customers$frequency.z <-
scale(customers$frequency.l
og, center=TRUE,
scale=TRUE)
customers$monetary.z <-
scale(customers$monetary.lo
g, center=TRUE, scale=TRUE)
Visualize data
library(ggplot2)
library(scales)
# Original scale
scatter.1 <-
ggplot(customers, aes(x =
frequency, y = monetary))
scatter.1 <- scatter.1 +
geom_point(aes(colour =
recency, shape = pareto))
scatter.1 <- scatter.1 +
scale_shape_manual(name =
"80/20 Designation",
values=c(17, 16))
scatter.1 <- scatter.1 +
scale_colour_gradient(name=
"Recency\n(Days since Last
Purchase))")
scatter.1 <- scatter.1 +
scale_y_continuous(label=do
llar)
scatter.1 <- scatter.1 +
xlab("Frequency (Number of
Purchases)")
scatter.1 <- scatter.1 +
ylab("Monetary Value of
Customer (Annual Sales)")
scatter.1
This rst
graph uses
the
variables’
original metrics and is almost
completely uninterpretable.
There’s a clump of data points
in the lower left-hand corner
of the plot, and then a few
outliers. This is why we log-
transformed the input
variables.
# Log-transformed
scatter.2 <-
ggplot(customers, aes(x =
frequency.log, y =
monetary.log))
scatter.2 <- scatter.2 +
geom_point(aes(colour =
recency.log, shape =
pareto))
scatter.2 <- scatter.2 +
scale_shape_manual(name =
"80/20 Designation",
values=c(17, 16))
scatter.2 <- scatter.2 +
scale_colour_gradient(name=
"Log-transformed Recency")
scatter.2 <- scatter.2 +
xlab("Log-transformed
Frequency")
scatter.2 <- scatter.2 +
ylab("Log-transformed
Monetary Value of
Customer")
scatter.2
Ah, this is
better. Now
we can see a
scattering of
high-value, high-frequency
customers in the top, right-
hand corner of the graph.
These data points are dark,
indicating that they’ve
purchased something recently.
In the bottom, left-hand
corner of the plot, we can see a
couple of low-value, low
frequency customers who
haven’t purchased anything
recently, with a range of values
in between.
# Scaled variables
scatter.3 <-
ggplot(customers, aes(x =
frequency.z, y =
monetary.z))
scatter.3 <- scatter.3 +
geom_point(aes(colour =
recency.z, shape = pareto))
scatter.3 <- scatter.3 +
scale_shape_manual(name =
"80/20 Designation",
values=c(17, 16))
scatter.3 <- scatter.3 +
scale_colour_gradient(name=
"Z-scored Recency")
scatter.3 <- scatter.3 +
xlab("Z-scored Frequency")
scatter.3 <- scatter.3 +
ylab("Z-scored Monetary
Value of Customer")
scatter.3
remove(scatter.1,
scatter.2, scatter.3)
preprocessed <-
customers[,9:11]
j <- 10 # specify the
maximum number of clusters
you want to try out
models <-
data.frame(k=integer(),
tot.withinss=numeric(),
betweenss=numeric(),
totss=numeric(),
rsquared=numeric())
for (k in 1:j ) {
print(k)
# Run kmeans
# nstart = number of
initial configurations; the
best one is used
# $iter will return the
iteration used for the
final model
output <-
kmeans(preprocessed,
centers = k, nstart = 20)
# Graph clusters
cluster_graph <-
ggplot(customers, aes(x =
frequency.log, y =
monetary.log))
cluster_graph <-
cluster_graph +
geom_point(aes(colour =
customers[,(var.name)]))
colors <-
c('red','orange','green3','
deepskyblue','blue','darkor
chid4','violet','pink1','ta
n3','black')
cluster_graph <-
cluster_graph +
scale_colour_manual(name =
"Cluster Group",
values=colors)
cluster_graph <-
cluster_graph + xlab("Log-
transformed Frequency")
cluster_graph <-
cluster_graph + ylab("Log-
transformed Monetary Value
of Customer")
title <- paste("k-means
Solution with", k, sep=" ")
title <- paste(title,
"Clusters", sep=" ")
cluster_graph <-
cluster_graph +
ggtitle(title)
print(cluster_graph)
# Cluster centers in
original metrics
library(plyr)
print(title)
cluster_centers <-
ddply(customers, .
(customers[,(var.name)]),
summarize,
monetary=round(median(monet
ary),2),# use median b/c
this is the raw, heavily-
skewed data
frequency=round(median(freq
uency),1),
recency=round(median(recenc
y), 0))
names(cluster_centers)
[names(cluster_centers)=="c
ustomers[, (var.name)]"] <-
"Cluster"
print(cluster_centers)
cat("\n")
cat("\n")
remove(output, var.name,
cluster_graph,
cluster_centers, title,
colors)
remove(k)
A 2-cluster
solution
produces
one group of
high-value
(median =
$1,797.78),
high-
frequency (median = 5
purchases) customers who
have purchased recently
(median = 17 days since their
most recent purchase), and
one group of lower value
(median = $327.50), low
frequency (median = 1
purchase) customers for whom
it's been a median of 96 days
since their last purchase.
Although these two clusters
are clear and interpretable,
this may be simplifying
customer behavior too much.
remove(j, r2_graph,
ss_graph)
###########################
###########################
###
# Using NbClust metrics to
determine number of
clusters #
###########################
###########################
###
library(NbClust)
set.seed(1)
nc <- NbClust(preprocessed,
min.nc=2, max.nc=7,
method="kmeans")
table(nc$Best.n[1,])
nc$All.index # estimates
for each number of clusters
on 26 different metrics of
model fit
barplot(table(nc$Best.n[1,]
),
xlab="Number of Clusters",
ylab="Number of Criteria",
main="Number of Clusters
Chosen by Criteria")
remove(preprocessed)
The greatest
number of
indices
recommend
the 2-
cluster solution.
Three-Dimensional
Representation of Clusters
This code
will give you
a 3D
rendering of
your
clusters,
which is
helpful if
you’re
working
with three
input
variables, as
an RFM
analysis
often is. The
rst graph
features
planes
through
each cluster
that help the
user
understand
the
monetary
value
associated
with the
cluster. The
second
features
ellipses that
highlight the
location of
each cluster
in three-
dimensional
space. In
both graphs,
the data
points are
color-coded
to
correspond
to their
associated
cluster.
colors <-
c('red','or
ange','gree
n3','deepsk
yblue','blu
e','darkorc
hid4','viol
et','pink1'
,'tan3','bl
ack')
library(car
)
library(rgl
)
scatter3d(x
=
customers$f
requency.lo
g,
y =
customers$m
onetary.log
,
z =
customers$r
ecency.log,
groups =
customers$c
luster_5,
xlab =
"Frequency
(Log-
transformed
)",
ylab =
"Monetary
Value (log-
transformed
)",
zlab =
"Recency
(Log-
transformed
)",
surface.col
= colors,
axis.scales
= FALSE,
surface =
TRUE, #
produces
the
horizonal
planes
through the
graph at
each level
of monetary
value
fit =
"smooth",
#
ellipsoid =
TRUE, # to
graph
ellipses
uses this
command and
comment out
"surface =
TRUE"
grid =
TRUE,
axis.col =
c("black",
"black",
"black"))
remove(colo
rs)
Share 40 Likes
← UPDATED … Compare Mul…
Comments (13)
Newest First
Subscribe via e-mail
Claude
Uwimana
2 years
ago
·0
Likes
Hi Kimberly Co ey,
I faced this problem
follow and I installed
XLConnectJars but still
show this, I need your
help.
Best regards
> source('~/.active-
rstudio-document')
Loading required
package:
XLConnectJars
Error: package or
namespace load failed
for ‘XLConnectJars’:
.onLoad failed in
loadNamespace() for
'rJava', details:
call: fun(libname,
pkgname)
error: JAVA_HOME
cannot be determined
from the Registry
Error: package
‘XLConnectJars’ could
not be loaded
Tuong
Nguyen
2 years
ago
·0
Likes
Antonio
Coelho
2 years
ago
·0
Likes
dn
2 years
ago
·0
Likes
Great post! Curious to
know why you
preferred clustering vs.
giving the RFM score of
555, 111, etc.
Lion
2 years
ago
·0
Likes
Apoorva
2 years
ago
·0
Likes
This is awesome.
thanks for sharing!
Prinu
Dickson
3 years
ago
· 1 Like
Kimberly
Co ey
3
years
ago
Likes
Thanks for
letting me know,
Prinu! I'm so
glad you enjoyed
the post. :-)
Giridharan
3 years
ago
·0
Likes
3 years
ago
· 1 Like
Kimberly
Co ey
3
years
ago
Likes
Hi Courtland -
I'm so glad you
enjoyed the post,
and thank you
for your
comment!
Jeanna
4 years
ago
· 1 Like
Kim,
This is a wonderful
analysis, and I enjoyed
reading it. You've done
an excellent job
producing actionable
results that any
business would be
blessed to have
available as they do
their strategic planning
for the future. This is
the real meat of the
project:
"As we add clusters, we
gain insight into
increasingly subtle
distinctions between
customers. I really like
the 5-cluster solution.
It gives us: a high-
value, high-frequency,
recent purchase group
(cluster 5), a medium-
value, medium-
frequency, relatively-
recent purchase group
(cluster 2), two clusters
of low-value, low-
frequency customers
broken down by
whether their last
purchase was recent or
much earlier in the year
(clusters 3 and 1,
respectively), and lastly
a no-value cluster
whose median value to
the business is $0.00
(cluster 4)."
Kimberly
Co ey
3
years
ago
Likes
Hi Jeanna - I
don't think I
ever thanked you
for your lovely
comment. I'm so
glad you enjoyed
the post!
Powered by Squarespace