Beruflich Dokumente
Kultur Dokumente
S t at i s t i c al t o o ls fo r h i gh t h ro u gh p u t d at a an aly s i s
Search
Connect
Home / Easy Guides / R software / Cluster Analysis in R Unsupervised machine learning / Visual Enhancement of Actions menu for module Wiki
Clustering Analysis Unsupervised Machine Learning
HotelBioxuryGHL
ReservehoyalmejorPrecio!Gotobioxury.com
Tools
1 Required package
2 Data preparation
3 Enhanced distance matrix computation and visualization
4 Enhanced clustering analysis
4.1 eclust() function
4.2 Examples
5 Infos
Clustering analysis is used to find groups of similar objects in a dataset. There are two main categories of clustering:
Hierarchical clustering: like agglomerative (hclust and agnes) and divisive (diana) methods, which construct a hierarchy of clustering.
Partitioning clustering: like kmeans, pam, clara and fanny, which require the user to specify the number of clusters to be generated.
These clustering methods can be computed using the R packages stats (for kmeans) and cluster (for pam, clara and fanny), but the workflow
require multiple steps and multiple lines of R codes.
In this chapter, we provide some easytouse functions for enhancing the workflow of clustering analyses and we implemented ggplot2 method for
visualizing the results.
1 Required package
The following R packages are required in this chapter:
factoextra for enhanced clustering analyses and data visualization
cluster for computing the standard PAM, CLARA, FANNY, AGNES and DIANA clustering
1. factoextra can be installed as follow:
if(!require(devtools))install.packages("devtools")
devtools::install_github("kassambara/factoextra")
2. Install cluster:
install.packages("cluster")
2 Data preparation
#Loadandscalethedataset
data("USArrests")
df<scale(USArrests)
head(df)
##MurderAssaultUrbanPopRape
##Alabama1.242564080.78283930.52090660.003416473
##Alaska0.507862481.10682251.21176422.484202941
##Arizona0.071633411.47880320.99898011.042878388
##Arkansas0.232349380.23086801.07359270.184916602
##California0.278268231.26281441.75892342.067820292
##Colorado0.025714560.39885930.86080851.864967207
#Correlationbaseddistancemethod
res.dist<get_dist(df,method="pearson")
head(round(as.matrix(res.dist),2))[,1:6]
##AlabamaAlaskaArizonaArkansasCaliforniaColorado
##Alabama0.000.711.450.091.871.69
##Alaska0.710.000.830.370.810.52
##Arizona1.450.830.001.180.290.60
##Arkansas0.090.371.180.001.591.37
##California1.870.810.291.590.000.11
##Colorado1.690.520.601.370.110.00
#Visualizethedissimilaritymatrix
fviz_dist(res.dist,lab_size=8)
The ordered dissimilarity matrix image (ODI) displays the clustering tendency of the dataset. Similar objects are close to one another. Red
color corresponds to small distance and blue color indicates big distance between observation.
#Loadandscalethedataset
data("USArrests")
df<scale(USArrests)
#Computedissimilaritymatrix
res.dist<dist(df,method="euclidean")
#Computehierarchicalclustering
res.hc<hclust(res.dist,method="ward.D2")
#Visualize
plot(res.hc,cex=0.5)
In this chapter, we provide the function eclust() [in factoextra] which provides several advantages:
It simplifies the workflow of clustering analysis
It can be used to compute hierarchical clustering and partititioning clustering in a single line function call
Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of
clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
For hierarchical clustering, correlationbased metric is allowed
It provides silhouette information for all partitioning methods and hierarchical clustering
It draws beautiful graphs using ggplot2
eclust(x,FUNcluster="kmeans",hc_metric="euclidean",...)
The function eclust() returns an object of class eclust containing the result of the standard function used (e.g., kmeans, pam, hclust, agnes,
diana, etc.).
It includes also:
cluster: the cluster assignment of observations after cutting the tree
nbclust: the number of clusters
silinfo: the silhouette information of observations
size: the size of clusters
data: a matrix containing the original or the standardized data (if stand = TRUE)
gap_stat: containing gap statistics
4.2 Examples
In this section well show some examples for enhanced kmeans clustering and hierarchical clustering. Note that the same analysis can be done for
PAM, CLARA, FANNY, AGNES and DIANA.
library("factoextra")
#Enhancedkmeansclustering
res.km<eclust(df,"kmeans",nstart=25)
#Gapstatisticplot
fviz_gap_stat(res.km$gap_stat)
#Silhouetteplot
fviz_silhouette(res.km)
##clustersizeave.sil.width
##1180.39
##22160.34
##33130.37
##44130.27
#Optimalnumberofclustersusinggapstatistics
res.km$nbclust
##[1]4
#Printresult
res.km
##Kmeansclusteringwith4clustersofsizes8,16,13,13
##
##Clustermeans:
##MurderAssaultUrbanPopRape
##11.41188980.87433460.81452110.01927104
##20.48943750.38260010.57582980.26165379
##30.96154071.10660100.93010690.96676331
##40.69507011.03944140.72263701.27693964
##
##Clusteringvector:
##AlabamaAlaskaArizonaArkansasCalifornia
##14414
##ColoradoConnecticutDelawareFloridaGeorgia
##42241
##HawaiiIdahoIllinoisIndianaIowa
##23423
##KansasKentuckyLouisianaMaineMaryland
##23134
##MassachusettsMichiganMinnesotaMississippiMissouri
##24314
##MontanaNebraskaNevadaNewHampshireNewJersey
##33432
##NewMexicoNewYorkNorthCarolinaNorthDakotaOhio
##44132
##OklahomaOregonPennsylvaniaRhodeIslandSouthCarolina
##22221
##SouthDakotaTennesseeTexasUtahVermont
##31423
##VirginiaWashingtonWestVirginiaWisconsinWyoming
##22332
##
##Withinclustersumofsquaresbycluster:
##[1]8.31606116.21221311.95246319.922437
##(between_SS/total_SS=71.2%)
##
##Availablecomponents:
##
##[1]"cluster""centers""totss""withinss"
##[5]"tot.withinss""betweenss""size""iter"
##[9]"ifault""clust_plot""silinfo""nbclust"
##[13]"data""gap_stat"
#Enhancedhierarchicalclustering
res.hc<eclust(df,"hclust")#computehclust
fviz_dend(res.hc,rect=TRUE)#dendrogam
fviz_silhouette(res.hc)#silhouetteplot
##clustersizeave.sil.width
##1170.40
##22120.26
##33180.38
##44130.35
fviz_cluster(res.hc)#scatterplot
eclust(df,"kmeans",k=4)
5 Infos
WanttoLearnMoreonRProgrammingandDataScience?
FollowusbyEmail
Subscribe
byFeedBurner
OnSocialNetworks:
onSocialNetworks
Get involved :
Click to follow us on Facebook and Google+ :
Comment this article by clicking on "Discussion" button (topright position of this page)
Sign up as a member and post news and articles on STHDA web site.
Suggestions
Determining the optimal number of clusters: 3 must known methods Unsupervised Machine Learning
Cluster Analysis in R Unsupervised machine learning
Beautiful dendrogram visualizations in R: 5+ must known methods Unsupervised Machine Learning
Partitioning cluster analysis: Quick start guide Unsupervised Machine Learning
DBSCAN: densitybased clustering for discovering clusters in large datasets with noise Unsupervised Machine Learning
Clustering Validation Statistics: 4 Vital Things Everyone Should Know Unsupervised Machine Learning
Hierarchical Clustering Essentials Unsupervised Machine Learning
Static and Interactive Heatmap in R Unsupervised Machine Learning
ModelBased Clustering Unsupervised Machine Learning
Clarifying distance measures Unsupervised Machine Learning
HCPC: Hierarchical clustering on principal components Hybrid approach (2/2) Unsupervised Machine Learning
Assessing clustering tendency: A vital issue Unsupervised Machine Learning
How to choose the appropriate clustering algorithms for your data? Unsupervised Machine Learning
Hybrid hierarchical kmeans clustering for optimizing clustering outputs Unsupervised Machine Learning
The Guide for Clustering Analysis on a Real Data: 4 steps you should know Unsupervised Machine Learning
How to compute pvalue for hierarchical clustering in R Unsupervised Machine Learning
Fuzzy clustering analysis Unsupervised Machine Learning
Practical Guide to Cluster Analysis in R Book
License
(Click on the image below)
Welcome!
Want to Learn More on R Programming and Data Science?
Follow us by Email
Subscribe
by FeedBurner
on Social Networks
R Basi cs
Impo rt i ng D at a
Ex po rt i ng D at a
Reshapi ng D at a
D at a Mani pul at i o n
D at a Vi sual i zat i o n
Basi c St at i st i cs
Adsby Google
ClusterAnalysis
Functions
R Packages
R packages developed by STHDA for easier data analyses and visualization: factoextra, survminer and ggpubr.
Fo rum
Co nt act