You are on page 1of 19

Segmenting consumers on bath soap

BIDM Assignment 2

Section B, Group 7 Kuldeep Das PGP26282 Nitul Das PGP26105 Amit Roykaran PGP26196

Contents
Introduction ............................................................................................................................................ 3 Understanding the business problem & objectives ................................................................................ 3 Business Objectives............................................................................................................................. 3 Data mining objectives........................................................................................................................ 3 Data preparation ( Done in excel file) ..................................................................................................... 4 Clustering Analysis Using SAS ................................................................................................................. 5 Clustering based on Demographics .................................................................................................... 6 Clustering based on purchase behaviour............................................................................................ 9 Clustering based on Purchase Basis .................................................................................................. 10 Clustering based on Purchase behaviour + Purchase basis .............................................................. 15 Question 2 ............................................................................................................................................. 17

ASSIGNMENT2 - BIDM
Introduction
CRISA is an Asian market research agency that specializes in tracking consumer purchase behaviour in consumer goods. CRISA has recorded the data of household consumption pattern. The households were selected using stratified sampling techniques. The data captured by CRISA contains the following information: Demographics of the households (updated annually) Possession of durable goods: This data is used to calculate the affluence index Purchase data of product categories and brands (updated monthly)

In this project, we have used k-means clustering to identify clusters based on parameters such as: Purchase behaviour (volume, frequency, susceptibly to discounts, and brand loyalty) Basis of purchase (price, selling proposition)

And then we have combined the above variables to find segmentation based on both purchase behaviour and Basis of purchase.

Understanding the business problem & objectives


Business Objectives
The data needs to be analyzed by segmenting the variables into various clusters based on criterion other than demographics. The customers display different levels of brand loyalty based on the price, choice criteria, promotions, affluence, social & economic status etc. If we can segment the customers based on certain important variables as given in the data set, we can target them more specifically by providing customized branding and promotional campaigns. Hence, the business objective is to form segments of customers that shows similar purchase behaviour and are affected similarly by any kind of selling proposition or promotional campaigns so that segments can be targeted in particular for branding and promotional activities.

Data mining objectives


To divide the variables into clusters or segments based on: Purchase behaviour (volume, frequency, susceptibly to discounts, and brand loyalty) Basis of purchase (price, selling proposition) Variables that describe both purchase behaviour and basis of purchase

To find the best segmentation of these clusters using demographic variables also in combination with the above variables. There is an upper cap on the number of clusters due to the number of promotional campaigns that can be run which is 5. Hence, an ideal clustering should not exceed more than 5 clusters.

Data preparation ( Done in excel file)


Note:- The transformed columns have been highlighted in red in the excel file. The given data has many missing values or values that do not represent any particular category. Hence, imputation of the values were done a) Imputing missing values - Many sex variables are 0, converted them to female - FEH = 1 assuming major population is vegetarian - MT=5 (Hindi speaking) - Number of people in the household is 5 - EDU = 5 (12th standard as it is the majority value in the data) - CS = 1 (majority of the data points have this value) b) Derivation for Brand Loyalty Index The brand loyalty index is a measure of 3 criteria (ceteris paribus, the volume of transactions) No. Of brands Brand Runs Volume of purchases attributed to each brand

Each of these criteria is normalized (between 0 to 1) so as to remove the bias of higher numeric values for a given criteria. a) No. Of brands As the number of brands increases, the probability of switching between the brands increases, hence the lower the number of brands its better. Hence we assign a lower score to rows which have low number of brands thus indicating a better brand loyalty. b) Brand Runs The lower the number of brand runs, the better it is. A higher number of brand runs increases the probability of having brand runs for multiple brands, therefore indicating a higher switching behaviour. Hence we assign a lower score to rows which have lower number of brand runs. c) Volume of purchases attributed to each brand The higher the purchase for a given brand, the better it is and hence we attribute a lower score to this parameter. The way we have worked out the score for this criteria is that We find the max % volume attributed to any one of the given brands From the given table below we assign the score to this variable (Note that the score increases as the % volume decreases, this is to ensure that we get a lower score for the brand loyalty index in consistent with the other 2 criterias) Score 0.0

% volume of purchase to a given brand 100%

90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

The final score for brand loyalty index is therefore a linear combination of the three criteria mentioned above with different weights assigned to indicate relative importance. Volume of purchase is given low importance. This is illustrated by the example below. A customer might buy a brand less number of times, but the times he buy a brand, purchases in bulk quantities, in this case he is less loyal than a person, who visits to buy a brand more number of times, but buys in less quantities. Brand Loyalty Index = 0.4 * No. Of brands_score + 0.4 * Brand Run_score + 0.2 * Volume of purchase attributed to a given brand_score The lower the brand loyalty index, the better it is.

Clustering Analysis Using SAS


The process flow diagram is shown below

Figure 1. SAS Flow diagram

Clustering based on Demographics


All the demographic variables were used in clustering. In all there are 10 variables. Below diagrams show the cluster and segment plot.

Figure 2: Segment and Cluster Plot, Demographic clustering

Figure 3: Variable Importance, Demographic clustering As we can see, that Affluence_Index is the most important variable among the demographic variables

Figure 4: Mean statistics of the generated clusters, Demographic clustering

Figure 5: Segment Plot of the generated clusters, Demographic clustering

Figure 6: Segment Profile of the generated clusters, Demographic clustering Segment 1 2 3 4 5 Comment on Affluence Index Little less than average Very low Very high Little higher than average Average

Clustering based on purchase behaviour

Figure 7: Segment size, clustering based on purchase behaviour

Figure 8: Variable importance, clustering based on purchase behaviour As we can see, Total Volume, No_of_trans, and Brand Loyal are the important variables.

Figure 9: Segment plot, clustering based on purchase behaviour

Figure 10: Mean statistics of generated clusters, Purchase behaviour

Figure 11: Segment profile, Purchase Behaviour

Clustering based on Purchase Basis


Since there are many variables included in Price Category (Pr_Cat_1 to Pr_Cat_4) and Selling proposition (PropCat_5 to PropCat_15) , when we run the clustering tool, we find a large number of

clusters (> 15). Hence we manually limit the number of clusters to 4, 5&6 and then come to the conclusion that the best cluster is 5. (Below diagrams illustrate that cluster size 5 gives the best distribution)

Figure 12: Cluster proximities, Cluster size 5

Figure 13: Cluster proximities, Cluster size 4

Figure 14: Cluster proximities, Cluster size 6

We proceed with Cluster size 5

Figure 15: Segment size plot, Purchase Basis

Figure 16: Variable importance, Purchase Basis As we can see, that Pr_Cat_2 is the most important variable.

Figure 17: Mean statistics, Purchase Basis

Figure 18: Segment plot, Purchase Basis

Figure 19: Segment profile, Purchase Basis Segment 1 2 3 4 5 Comment on Pr_Cat_2 variable distribution in the cluster Lowest among the all Less than average Significantly higher than average Less than average Higher than average

Clustering based on Purchase behaviour + Purchase basis

Figure 20: Segment size, both purchase basis + purchase behaviour

Figure 21: Cluster proximities, both purchase basis + purchase behaviour

Figure 22: Variable importance, both purchase basis + purchase behaviour

Figure 23: Segment plot, both purchase basis + purchase behaviour

Figure 24: Mean Statistics, both purchase basis + purchase behaviour

Figure 25: Segment profile, both purchase basis + purchase behaviour As it can be seen, that purchase behaviour variables dominate more than the purchase basis variables from the variables importance table.

Question 2
To identify the best segmentation basis out of the 3 profiles (purchase behavior, basis for purchase, both basis for purchase and purchase behavior) we have to see the distance between the clusters. The following plots shows the distance between the clusters in the 3 profiles used:

Basis of Purchase

Demographic

Both basis of purchase and purchase behavior Hence, we can see that a combination of both basis for purchase and purchase behavior gives the highest degree of separation between the clusters and hence is the best segmentation criteria. Based on the segment profile of this segmentation basis, we can say that the segments have the following membership Segment 1 2 3 4 5 Key Characteristics Less than average volume purchase, Least brand loyalty, Less than average price category 1 More than average volume purchase, Less than average brand loyalty, More than average price category 1 Average Volume purchase, Average brand loyalty, Average price category 1 Lowest Volume purchase, Highest brand loyalty, Highest price category 1 Highest Volume purchase, More than average brand loyalty, Lowest price category 1