2 views

Original Title: 3.Unsupervised Learning

Uploaded by Alexandra Veres

- IJCTT-V3I2P117
- Pragmatic Analysis on Malware Classification
- Cement Industry Overview and Market Price Forecasting In Azerbaijan
- 1-s2.0-S0020025514000504-main copy
- K-Means Segmentation Method for Automatic Leaf Disease Detection
- Machine Learning 3
- An Empirical Study for Defect Prediction using Clustering
- Generation of Fuzzy Rules With Subtractive Clustering
- 2-78-1394014260-3. Manage-An Empirical Analysis of Customer Satisfaction-Ibukun Fatudimu.pdf
- 2
- Paper 18-Comparative Study Between the Proposed GA Based ISODAT Clustering and the Conventional Clustering Methods
- Cluster
- A195-200Ari Muzakir
- Survey on Clustering Based Color Image Segmentation
- Bello Orgaz Gema
- An Improved Method of Segmentation of Medical Images that Incorporates Bias Correction
- Image Processing
- AI Transformation Playbook v8
- International Journal of Engineering Research and Development (IJERD)
- Rise of the Machines - Preliminaries WP New Template Final_web

You are on page 1of 9

Unsupervised learning

Unsupervised learning

The goal of unsupervised learning is to model patterns that are hidden in the data. For

example, in our retail dataset there may be groups of customers with particular behaviours,

e.g. customers that use the shop for expensive items, customers that use the shop only with a

small budget, customers that use the website only in some periods of the year, and so on. With

unsupervised learning we can discover these kinds of pattern and summarise them.

The analysis that allows us to discover and consolidate patterns is called unsupervised because

we do not know what groups there are in the data or the group membership of any individual

observation. In this case, we say that the data is unlabelled. The most common unsupervised

learning method is clustering, where patterns are discovered by grouping samples.

K-means clustering is a method for finding clusters and cluster centres in a set of unlabelled

data. Intuitively, we might think of a cluster as comprising a group of data points whose inter-

point distances are small compared with the distances to points outside of the cluster. Given

an initial set of K centres, the K-means algorithm alternates the two steps:

1. for each centre we identify the subset of training points (its cluster) that is closer to it than

any other centre;

2. the mean of each feature for the data points in each cluster are computed, and the

corresponding vector of means becomes the new centre for that cluster.

These two steps are iterated until the centres no longer move or the assignments no longer

change. Then, a new point x can be assigned to the cluster of the closest prototype.

Isolate the features mean_spent and max_spent , then run the K-Means algorithm on the

resulting dataset using K=2 (in sklearn, it is n_clusters = 2 ) and visualise the result.

http://beta.cambridgespark.com/courses/jpm/03module.html 1/9

2016. 11. 27. Unsupervised learning

PYTHON

# Apply k-means with 2 cluster using a subset of the features

# (mean_spent and max_spent)

Xsub = X[:,1:3]

n_clusters = 2

kmeans.fit(Xsub) (1)

# use the fitted model to predict what the cluster of each customer should be

cluster_assignment = kmeans.predict(Xsub) (2)

cluster_assignment

1. The method fit runs the K-Means algorithm on the data that we pass to it.

2. The method predict returns a cluster label for each sample in the data.

PYTHON

# Visualise the clusters using a scatter plot or scatterplot matrix if you wish

data = [

Scatter(

x = Xsub[cluster_assignment == i, 0],

y = Xsub[cluster_assignment == i, 1],

mode = 'markers',

name = 'cluster '+ str(i)

) for i in range(n_clusters)

]

layout = Layout(

xaxis = dict(title = 'max_spent'),

yaxis = dict(title = 'mean_spent'),

height= 600,

)

iplot(fig)

http://beta.cambridgespark.com/courses/jpm/03module.html 2/9

2016. 11. 27. Unsupervised learning

The separation between the two clusters is neat (the two clusters can be separated with a line).

One cluster contains customers with low spendings and the second with high spendings.

Run K-Means using all the available features and visualise the result in the subspace created

by mean_spent and max_spent .

PYTHON

# Apply k-means with 2 clusters using all the features

PYTHON

# Adapt the visualisation code accordingly

http://beta.cambridgespark.com/courses/jpm/03module.html 3/9

2016. 11. 27. Unsupervised learning

The result is now different. The first cluster contains customers with a maximum spending

close to the minimum mean spending and the second contains customers with a maximum

spending far from the minimum mean spending. This way can tell apart customers that could

be willing to buy objects that cost more than their average spending.

Select the feature 'mean_spent' (or any feature of your choice) and compare the two clusters

obtained. Can you interpret the output of these commands?

PYTHON

# Compare expenditure between clusters

feat = 1

columns=['cluster0']).describe()

columns=['cluster1']).describe()

compare_df

http://beta.cambridgespark.com/courses/jpm/03module.html 4/9

2016. 11. 27. Unsupervised learning

Compare the distribution of the feature mean_spent in the two clusters using a box plot.

PYTHON

# Create a boxplot of the two clusters for 'mean_spent'

data = [

Box(

y = X[cluster_assignment == i, feat],

name = 'cluster'+ str(i),

) for i in range(n_clusters)

]

layout = Layout(

xaxis = dict(title = "Clusters"),

yaxis = dict(title = "Value"),

showlegend=False

)

iplot(fig)

http://beta.cambridgespark.com/courses/jpm/03module.html 5/9

2016. 11. 27. Unsupervised learning

Use the function create_distplot from FigureFactory to show the distribution of the

mean expenditure in both clusters.

PYTHON

# Compare mean expediture with a histogram

x1 = X[cluster_assignment == 0, feat]

x2 = X[cluster_assignment == 1, feat]

hist_data = [x1, x2]

group_labels = ['Cluster 1', 'Cluster 2']

iplot(fig)

http://beta.cambridgespark.com/courses/jpm/03module.html 6/9

2016. 11. 27. Unsupervised learning

Here we note:

Look at the centroids of the clusters kmeans.cluster_centers_ and check the values of the

centres in for the features 'mean_spent' and 'max_spent'.

PYTHON

# Compare the centroids

We can see that the centres coincide with the means of each cluster in the table above.

Compute the silhouette score of the clusters resulting from the application of K-Means. The

Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean

nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a)

/ max(a, b) . It represents how similar a sample is to the samples in its own cluster compared

to samples in other clusters. The best value is 1 and the worst value is -1. Values near 0

indicate overlapping clusters. Negative values generally indicate that a sample has been

assigned to the wrong cluster, as a different cluster is more similar.

http://beta.cambridgespark.com/courses/jpm/03module.html 7/9

2016. 11. 27. Unsupervised learning

PYTHON

# Compute the silhouette score

> ('silhouette_score', 0.451526633737)

Pro:

easy to understand

any unseen point can be assigned to the cluster with the closest mean to the point

Cons:

all the points are assigned to a cluster, clusters are affected by noise

Comparison of algorithms

The chart below shows the characteristics of different clustering algorithms implemented in

sklearn on simple 2D datasets.

http://beta.cambridgespark.com/courses/jpm/03module.html 8/9

2016. 11. 27. Unsupervised learning

Here we note that K-Means works pretty well in case of globular clusters but it doesn’t

produce good results on the clusters that have circular and half moon shapes. Instead, Linkage

and DBSCAN are able to deal with these kind of cluster shapes.

learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html.

Wrap up of Module 3

Clustering is an unsupervised way to generate groups out of your data

Some clustering algorithms, like DBSCAN, have an embedded outlier detection mechanism

Silhouette score can be used to measure how compact the clusters are

http://beta.cambridgespark.com/courses/jpm/03module.html 9/9

- IJCTT-V3I2P117Uploaded bysurendiran123
- Pragmatic Analysis on Malware ClassificationUploaded byIJRASETPublications
- Cement Industry Overview and Market Price Forecasting In AzerbaijanUploaded byAnonymous 7VPPkWS8O
- 1-s2.0-S0020025514000504-main copyUploaded byGreg O'Brien
- K-Means Segmentation Method for Automatic Leaf Disease DetectionUploaded byAnonymous 7VPPkWS8O
- Machine Learning 3Uploaded byMaksim Tsvetovat
- An Empirical Study for Defect Prediction using ClusteringUploaded byidescitation
- Generation of Fuzzy Rules With Subtractive ClusteringUploaded byKosa
- 2-78-1394014260-3. Manage-An Empirical Analysis of Customer Satisfaction-Ibukun Fatudimu.pdfUploaded byMadhu Evuri
- 2Uploaded byruchi
- Paper 18-Comparative Study Between the Proposed GA Based ISODAT Clustering and the Conventional Clustering MethodsUploaded byEditor IJACSA
- ClusterUploaded bySanjay Nath
- A195-200Ari MuzakirUploaded byHimawan Sandhi
- Survey on Clustering Based Color Image SegmentationUploaded byInternational Journal of Research in Engineering and Technology
- Bello Orgaz GemaUploaded byJ Jesús Villanueva García
- An Improved Method of Segmentation of Medical Images that Incorporates Bias CorrectionUploaded byinventionjournals
- Image ProcessingUploaded byJournalNX - a Multidisciplinary Peer Reviewed Journal
- AI Transformation Playbook v8Uploaded bymzai2003
- International Journal of Engineering Research and Development (IJERD)Uploaded byIJERD
- Rise of the Machines - Preliminaries WP New Template Final_webUploaded byPempek Lenjer
- Machine Learning SlidesUploaded bymuhikan
- 10.1.1.18Uploaded bySatyabrat Srikumar
- A Hybrid Approach for DICOM Image Segmentation Using Fuzzy TechniquesUploaded byijflsjournal
- summarizUploaded byapi-19969042
- 4066.docxUploaded byShantala Giraddi
- 2719_CH19Uploaded byUlrick Bouabre
- SIT742W09.pdfUploaded byRajat Panwar
- cluster_analysis-2Uploaded byeug_manu8
- Classification of CottonUploaded byVignesh Viky
- Stratified EMUploaded bysgsfak

- Fission and FusionUploaded byAlexandra Veres
- Git Cheat SheetUploaded bykzelda
- ML workshopUploaded byAlexandra Veres
- Decision Trees and Random ForestsUploaded byAlexandra Veres
- The Electorweak Interaction Part IUploaded byAlexandra Veres
- Lattice DynamicsUploaded byAlexandra Veres
- DiffractionUploaded byAlexandra Veres
- Yuval Noah Harari - SAPIENS - Az Emberiseg Rovid TorteneteUploaded byZsolnai Andrea
- 04 Phonons-Thermal PropertiesUploaded byAlexandra Veres
- Feynman-Leighton-Sands - Mai fizika 1.A mechanika törvényei-upByOM.pdfUploaded byAlexandra Veres

- DataUploaded bysachitscribd
- Prediction and Evaluation of Students Academic Performance using Fuzzy LogicUploaded byIRJET Journal
- BigDataX-syllabusUploaded byClaudia Gonzales
- Energy Efficient Routing Protocols for Wireless Sensor Networks-A Survey.pdfUploaded byNaresh Babu Merugu
- RFIDUploaded byTijana Dejanović
- Xl MinerUploaded byLoneli Costaner
- CONFIDENTIAL DATA IDENTIFICATION USING DATA MINING TECHNIQUES IN DATA LEAKAGE PREVENTION SYSTEMUploaded byLewis Torres
- NoahLaithUploaded byashish88bhardwaj_314
- 4.Cluster Nov 2013Uploaded byNaveen Jaishankar
- Machine Learning Absolute Beginners Introduction 2ndUploaded byud
- Carlsson_Topology and DataUploaded byEdwardTheGreat86
- LEACH Improvement Based on Ant Colony Optimization and Energy BalanceUploaded bySEP-Publisher
- Energy Efficient Cluster Based Routing Protocol for Mobile Ad-hoc NetworkUploaded byIRJET Journal
- Fingerprinting Based Indoor Positioning System using RSSI BluetoothUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Recommending and Localizing Change Requests for Mobile Apps based on User ReviewsUploaded bySebastiano Panichella
- Market Basket Analysis by Velislava GanchevaUploaded byeuge_prime2001
- C-Q Matrix-Q Collective Intelligence & Matrix Intelligence PRIMORDIAL ALGORITHMS GAME Challenge (Research Proposal)Uploaded byLuis Daniel Maldonado Fonken
- Speaker RecognitionUploaded bythesovereignmoonlove
- Msc DissertationUploaded byBogdan Profir
- sonar ThesisUploaded byPedro
- A COMPARATIVE STUDY OF CLUSTERING PROTOCOLS IN VANETUploaded byAnonymous vQrJlEN
- Brain Tumor Segmentation Using Asymmetry Based Histogram Thresholding and K-means ClusteringUploaded byInternational Journal of Research in Engineering and Technology
- Data Mining PresentationUploaded byChandra Bhushan
- Data MiningUploaded bySuhas Bharadwaj
- QB Students DmUploaded byVinay Gopal
- A Modified Method for Order Reduction of Large Scale Discrete SystemsUploaded byEditor IJACSA
- A Study of Consumer Behavior in Selected Shopping Malls of ChandigarhUploaded bychauhanbrothers3423
- Cluster AnalysisUploaded byShyaam Sundhar Baskaran
- Qin Jin - Robust Speaker RecognitionUploaded byTung Thanh Vu
- Foreign studiesUploaded byLeanie Maraviles-Pascual