Beruflich Dokumente
Kultur Dokumente
S t at i s t i c al t o o ls fo r h i gh t h ro u gh p u t d at a an aly s i s
Search
Connect
Home / Easy Guides / R software / Cluster Analysis in R Unsupervised machine learning / Beautiful dendrogram Actions menu for module Wiki
visualizations in R: 5+ must known methods Unsupervised Machine Learning
Tools
A variety of functions exists in R for visualizing and customizing dendrogram. The aim of this article is to describe 5+ methods for drawing a
beautiful dendrogram using R software.
We start by computing hierarchical clustering using the data set USArrests:
#Loaddata
data(USArrests)
#Computedistancesandhierarchicalclustering
dd<dist(scale(USArrests),method="euclidean")
hc<hclust(dd,method="ward.D2")
As you already know, the standard R function plot.hclust() can be used to draw a dendrogram from the results of hierarchical clustering analyses
(computed using hclust() function).
A simplified format is:
plot(x,labels=NULL,hang=0.1,
main="Clusterdendrogram",sub=NULL,
xlab=NULL,ylab="Height",...)
#Defaultplot
plot(hc)
#Putthelabelsatthesameheight:hang=1
plot(hc,hang=1,cex=0.6)
2 plot.dendrogram() function
In order to visualize the result of a hierarchical clustering analysis using the function plot.dendrogram(), we must firstly convert it as a
dendrogram.
The format of the function plot.dendrogram() is:
plot(x,type=c("rectangle","triangle"),horiz=FALSE)
#Converthclustintoadendrogramandplot
hcd<as.dendrogram(hc)
#Defaultplot
plot(hcd,type="rectangle",ylab="Height")
#Triangleplot
plot(hcd,type="triangle",ylab="Height")
#Zoomintothefirstdendrogram
plot(hcd,xlim=c(1,20),ylim=c(1,8))
The above dendrogram can be customized using the arguments:
nodePar: a list of plotting parameters to use for the nodes (see ?points). Default value is NULL. The list may contain components named pch,
cex, col, xpd, and/or bg each of which can have length two for specifying separate attributes for inner nodes and leaves.
edgePar: a list of plotting parameters to use for the edge segments (see ?segments). The list may contain components named col, lty and
lwd (for the segments). As with nodePar, each can have length two for differentiating leaves and inner nodes.
leaflab: a string specifying how leaves are labeled. The default perpendicular write text vertically; textlike writes text horizontally (in a
rectangle), and none suppresses leaf labels.
#DefinenodePar
nodePar<list(lab.cex=0.6,pch=c(NA,19),
cex=0.7,col="blue")
#Customizedplot;removelabels
plot(hcd,ylab="Height",nodePar=nodePar,leaflab="none")
#Horizontalplot
plot(hcd,xlab="Height",
nodePar=nodePar,horiz=TRUE)
#Changeedgecolor
plot(hcd,xlab="Height",nodePar=nodePar,
edgePar=list(col=2:3,lwd=2:1))
3 Phylogenetic trees
The package ape (Analyses of Phylogenetics and Evolution) can be used to produce a more sophisticated dendrogram.
The function plot.phylo() can be used for plotting a dendrogram. A simplified format is:
plot(x,type="phylogram",show.tip.label=TRUE,
edge.color="black",edge.width=1,edge.lty=1,
tip.color="black")
#install.packages("ape")
library("ape")
#Defaultplot
plot(as.phylo(hc),cex=0.6,label.offset=0.5)
#Cladogram
plot(as.phylo(hc),type="cladogram",cex=0.6,
label.offset=0.5)
#Unrooted
plot(as.phylo(hc),type="unrooted",cex=0.6,
no.margin=TRUE)
#Fan
plot(as.phylo(hc),type="fan")
#Radial
plot(as.phylo(hc),type="radial")
#Cutthedendrograminto4clusters
colors=c("red","blue","green","black")
clus4=cutree(hc,4)
plot(as.phylo(hc),type="fan",tip.color=colors[clus4],
label.offset=1,cex=0.7)
#Changetheappearance
#changeedgeandlabel(tip)
plot(as.phylo(hc),type="cladogram",cex=0.6,
edge.color="steelblue",edge.width=2,edge.lty=2,
tip.color="steelblue")
install.packages("ggdendro")
ggdendro requires the package ggplot2. Make sure that ggplot2 is installed and loaded before using ggdendro.
Load ggdendro as follow:
library("ggplot2")
library("ggdendro")
#Visualizationusingthedefaultthemenamedtheme_dendro()
ggdendrogram(hc)
#Rotatetheplotandremovedefaulttheme
ggdendrogram(hc,rotate=TRUE,theme_dendro=FALSE)
4.3 Extract dendrogram plot data
The function dendro_data() can be used for extracting the data. It returns a list of data frames which can be extracted using the functions below:
segment(): To extract the data for dendrogram line segments
label(): To extract the labels
#Builddendrogramobjectfromhclustresults
dend<as.dendrogram(hc)
#Extractthedata(forrectangularlines)
#Typecanbe"rectangle"or"triangle"
dend_data<dendro_data(dend,type="rectangle")
#Whatcontainsdend_data
names(dend_data)
##[1]"segments""labels""leaf_labels""class"
#Extractdataforlinesegments
head(dend_data$segments)
##xyxendyend
##119.77148413.5162428.86718813.516242
##28.86718813.5162428.8671886.461866
##38.8671886.4618664.1250006.461866
##44.1250006.4618664.1250002.714554
##54.1250002.7145542.5000002.714554
##62.5000002.7145542.5000001.091092
#Extractdataforlabels
head(dend_data$labels)
##xylabel
##110Alabama
##220Louisiana
##330Georgia
##440Tennessee
##550NorthCarolina
##660Mississippi
#Plotlinesegmentsandaddlabels
p<ggplot(dend_data$segments)+
geom_segment(aes(x=x,y=y,xend=xend,yend=yend))+
geom_text(data=dend_data$labels,aes(x,y,label=label),
hjust=1,angle=90,size=3)+
ylim(3,15)
print(p)
The package dendextend contains many functions for changing the appearance of a dendrogram and for comparing dendrograms.
In this section well use the chaining operator (%>%) to simplify our code.
5.1 Chaining
The chaining operator (%>%) turns x %>% f(y) into f(x, y) so you can use it to rewrite multiple operations such that they can be read from leftto
right, toptobottom. For instance, the results of the two R codes below are equivalent.
Standard R code for creating a dendrogram:
data<scale(USArrests)
dist.res<dist(data)
hc<hclust(dist.res,method="ward.D2")
dend<as.dendrogram(hc)
plot(dend)
install.packages('dendextend')
Loading:
library(dendextend)
The function set() can be used to change the parameters with dendextend.
The format is:
set(object,what,value)
#Createadendrogramandplotit
dend<USArrests[1:5,]%>%scale%>%
dist%>%hclust%>%as.dendrogram
dend%>%plot
#Getthelabelsofthetree
labels(dend)
##[1]"Alaska""Arizona""California""Alabama""Arkansas"
This section describes how to change label names as well as the color and the size for labels.
#Changethelabels,andthenplot:
dend%>%set("labels",c("a","b","c","d","e"))%>%plot
#Changecolorandsizeforlabels
dend%>%set("labels_col",c("green","blue"))%>%#changecolor
set("labels_cex",2)%>%#Changesize
plot(main="Changethecolor\nandsize")#plot
#Colorlabelsbyspecifyingthenumberofcluster(k)
dend%>%set("labels_col",value=c("green","blue"),k=2)%>%
plot(main="Colorlabels\npercluster")
abline(h=2,lty=2)
In the R code above, the value of color vectors are too short. Hence, its recycled.
#Changethetype,thecolorandthesizeofnodepoints
#+++++++++++++++++++++++++++++
dend%>%set("nodes_pch",19)%>%#nodepointtype
set("nodes_cex",2)%>%#nodepointsize
set("nodes_col","blue")%>%#nodepointcolor
plot(main="Nodepoints")
#Changethetype,thecolorandthesizeofleavepoints
#+++++++++++++++++++++++++++++
dend%>%set("leaves_pch",19)%>%#nodepointtype
set("leaves_cex",2)%>%#nodepointsize
set("leaves_col","blue")%>%#nodepointcolor
plot(main="Leavespoints")
#Specifydifferentpointtypesandcolorsforeachleave
dend%>%set("leaves_pch",c(17,18,19))%>%#nodepointtype
set("leaves_cex",2)%>%#nodepointsize
set("leaves_col",c("blue","red","green"))%>%#nodepointcolor
plot(main="Leavespoints")
#Defaultcolors
dend%>%set("branches_k_color",k=2)%>%
plot(main="Defaultcolors")
#Customizedcolors
dend%>%set("branches_k_color",
value=c("red","blue"),k=2)%>%
plot(main="Customizedcolors")
Its also possible to use the function color_branches().
Clusters can be highlighted by adding colored rectangles. This is done using the rect.dendrogram() function (modeled based on the rect.hclust()
function). One advantage of rect.dendrogram over rect.hclust, is that it also works on horizontally plotted trees:
#Verticalplot
dend%>%set("branches_k_color",k=3)%>%plot
dend%>%rect.dendrogram(k=3,border=8,lty=5,lwd=2)
#Horizontalplot
dend%>%set("branches_k_color",k=3)%>%plot(horiz=TRUE)
dend%>%rect.dendrogram(k=3,horiz=TRUE,border=8,lty=5,lwd=2)
grp<c(1,1,1,2,2)
k_3<cutree(dend,k=3,order_clusters_as_data=FALSE)
#TheFALSEabovemakessurewegettheclustersintheorderofthe
#dendrogram,andnotinthatoftheoriginaldata.Itislike:
#cutree(dend,k=3)[order.dendrogram(dend)]
the_bars<cbind(grp,k_3)
dend%>%set("labels","")%>%plot
colored_bars(colors=the_bars,dend=dend)
5.10 ggplot2 integration
dend<iris[1:30,5]%>%scale%>%dist%>%
hclust%>%as.dendrogram%>%
set("branches_k_color",k=3)%>%set("branches_lwd",1.2)%>%
set("labels_colors")%>%set("labels_cex",c(.9,1.2))%>%
set("leaves_pch",19)%>%set("leaves_col",c("blue","red"))
#plotthedendinusual"base"plottingengine:
plot(dend)
library(ggplot2)
#Rectangledendrogramusingggplot2
ggd1<as.ggdend(dend)
ggplot(ggd1)
#Changethethemetothedefaultggplot2theme
ggplot(ggd1,horiz=TRUE,theme=NULL)
#Thememinimal
ggplot(ggd1,theme=theme_minimal())
#Createaradialplotandremovelabels
ggplot(ggd1,labels=FALSE)+
scale_y_reverse(expand=c(0.2,0))+
coord_polar(theta="x")
The package dendextend can be used to enhance many packages including pvclust. Recall that, pvclust is for calculating pvalues for hierarchical
clustering.
pvclust can be used as follow:
library(pvclust)
data(lung)#916genesfor73subjects
set.seed(1234)
result<pvclust(lung[1:100,1:10],method.dist="cor",
method.hclust="average",nboot=10)
##Bootstrap(r=0.5)...Done.
##Bootstrap(r=0.6)...Done.
##Bootstrap(r=0.7)...Done.
##Bootstrap(r=0.8)...Done.
##Bootstrap(r=0.9)...Done.
##Bootstrap(r=1.0)...Done.
##Bootstrap(r=1.1)...Done.
##Bootstrap(r=1.2)...Done.
##Bootstrap(r=1.3)...Done.
##Bootstrap(r=1.4)...Done.
#Defaultplotoftheresult
plot(result)
pvrect(result)
#pvclustanddendextend
result%>%as.dendrogram%>%
set("branches_k_color",k=2,value=c("purple","orange"))%>%
plot
result%>%text
result%>%pvrect
6 Infos
WanttoLearnMoreonRProgrammingandDataScience?
FollowusbyEmail
Subscribe
byFeedBurner
OnSocialNetworks:
onSocialNetworks
Get involved :
Click to follow us on Facebook and Google+ :
Comment this article by clicking on "Discussion" button (topright position of this page)
Sign up as a member and post news and articles on STHDA web site.
Suggestions
Determining the optimal number of clusters: 3 must known methods Unsupervised Machine Learning
Cluster Analysis in R Unsupervised machine learning
Partitioning cluster analysis: Quick start guide Unsupervised Machine Learning
DBSCAN: densitybased clustering for discovering clusters in large datasets with noise Unsupervised Machine Learning
Clustering Validation Statistics: 4 Vital Things Everyone Should Know Unsupervised Machine Learning
Hierarchical Clustering Essentials Unsupervised Machine Learning
Static and Interactive Heatmap in R Unsupervised Machine Learning
ModelBased Clustering Unsupervised Machine Learning
Clarifying distance measures Unsupervised Machine Learning
HCPC: Hierarchical clustering on principal components Hybrid approach (2/2) Unsupervised Machine Learning
Assessing clustering tendency: A vital issue Unsupervised Machine Learning
How to choose the appropriate clustering algorithms for your data? Unsupervised Machine Learning
Hybrid hierarchical kmeans clustering for optimizing clustering outputs Unsupervised Machine Learning
The Guide for Clustering Analysis on a Real Data: 4 steps you should know Unsupervised Machine Learning
Visual Enhancement of Clustering Analysis Unsupervised Machine Learning
How to compute pvalue for hierarchical clustering in R Unsupervised Machine Learning
Fuzzy clustering analysis Unsupervised Machine Learning
Practical Guide to Cluster Analysis in R Book
License
(Click on the image below)
Welcome!
Want to Learn More on R Programming and Data Science?
Follow us by Email
Subscribe
by FeedBurner
on Social Networks
R Basi cs
Impo rt i ng D at a
Ex po rt i ng D at a
Reshapi ng D at a