Sie sind auf Seite 1von 6

BI Mini Project

Report Should contain:


1. Problem Definition (It should be in 1 to 2 Paragraph,Describe dataset that
you have considered):
➢ Suppose you are owing a supermarket mall and through membership
cards , you have some basic data about your customers like Customer
ID, age, genre, annual income and spending score. Spending Score is
something you assign to the customer based on your defined
parameters like customer behavior and purchasing data.
Problem Statement: You own the mall and want to understand the
customers like who can be easily converge [Target Customers] so that the
sense can be given to marketing team and plan the strategy accordingly.

2. Identifying which data mining task is needed & Why?


➢ We have implemented Hierarchical Clustering Algorithm because it
outputs a hierarchy, i.e: a structure that is more informative than the
unstructured set of flat clusters returned by k-means. Therefore, it is
easier to decide on the number of clusters by looking at the
dendrogram.

3. Implement the data mining algorithm of your choice(Python). Describe


it(you can show flowchart of the process, attach screenshot of code):

➢ import matplotlib.pyplot as plt


import pandas as pd
import seaborn as sns

dataset = pd.read_csv('Mall_Customers.csv')
x = dataset.iloc[:, [3,4]].values

plt.figure(1 , figsize = (15 , 5))


sns.countplot(y = 'Genre' , data = dataset)
plt.show()
plt.figure(1 , figsize = (15 , 7))
n=0
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
for y in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
n += 1
plt.subplot(3 , 3 , n)
plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
sns.regplot(x = x , y = y , data = dataset)
plt.ylabel(y.split()[0]+' '+y.split()[1] if len(y.split()) > 1 else y )
plt.show()

plt.figure(1 , figsize = (15 , 7))


n=0
for cols in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
n += 1
plt.subplot(1 , 3 , n)
plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
sns.violinplot(x = cols , y = 'Genre' , data = dataset , palette = 'vlag')
'''sns.swarmplot(x = cols , y = 'Genre' , data = dataset)'''
plt.ylabel('Gender' if n == 1 else '')
plt.title('Boxplots' if n == 2 else '')
plt.show()

# Using Dendrogram to find optimal no. of clusters


import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(x, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.show()

# Fitting hierarchical clustering to dataset


from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(x)

# Visualizing the clusters


plt.scatter(x[y_hc == 0, 0], x[y_hc == 0, 1], s = 100, c = 'red', label = 'Careful')
plt.scatter(x[y_hc == 1, 0], x[y_hc == 1, 1], s = 100, c = 'blue', label = 'Standard')
plt.scatter(x[y_hc == 2, 0], x[y_hc == 2, 1], s = 100, c = 'green', label = 'Targets')
plt.scatter(x[y_hc == 3, 0], x[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Careless')
plt.scatter(x[y_hc == 4, 0], x[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Sensible')
plt.title('Clusters of Clients')
plt.xlabel('Annual Income($)')
plt.ylabel('Spending score(1-100)')
plt.legend()
plt.show()

4. Interpret & visualize the result(Different Graph you can


show/Rapidminer output you can attach. its mandatory ):


5. Provide clearly the BI decision that is to be taken as a result of mining:
➢ We have implemented Hierarchical Clustering Algorithm because it outputs
a hierarchy, i.e: a structure that is more informative than the unstructured set
of flat clusters returned by k-means. Therefore, it is easier to decide on the
number of clusters by looking at the dendrogram.

Das könnte Ihnen auch gefallen