Sie sind auf Seite 1von 26

Package clustext

April 2, 2016
Title Consistent Clustering for Text Data
Version 0.1.1
Maintainer Tyler Rinker <tyler.rinker@gmail.com>
Description Optimized, consistent tools for clustering text data.
Depends R (>= 3.2.3)
Imports dplyr, fastcluster, gofastr, graphics, Matrix, mclust,
methods, rNMF, skmeans, slam, stats, termco, textshape, tm
Suggests testthat
Date 2016-04-02
License GPL-2
LazyData TRUE
Roxygen list(wrap = FALSE)
RoxygenNote 5.0.1
Author Tyler Rinker [aut, cre]
RemoteType local
RemoteUrl C:\{}Users\{}Tyler\{}GitHub\{}clustext
RemoteUsername trinker
RemoteRepo clustext

R topics documented:
approx_k . . .
assignments . .
assign_cluster .
as_topic . . . .
categorize . . .
clustext . . . .
compare . . . .
cosine_distance
data_store . . .
get_documents
get_dtm . . . .
get_removed . .
get_terms . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
1

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

2
3
3
5
6
7
7
8
9
10
11
12
13

approx_k
get_text . . . . . . . . . .
hierarchical_cluster . . . .
jaccard_distance . . . . . .
kmeans_cluster . . . . . .
nmf_cluster . . . . . . . .
plot.hierarchical_cluster . .
presidential_debates_2012
print.assign_cluster . . . .
print.as_topic . . . . . . .
print.compare . . . . . . .
print.data_store . . . . . .
print.get_documents . . . .
print.get_terms . . . . . .
skmeans_cluster . . . . . .
summary.assign_cluster . .
write_cluster_text . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Index
approx_k

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

14
15
16
17
18
19
20
20
21
21
21
22
22
22
23
24
26

Approximate Number of Clusters for a Text Matrix

Description
Can & Ozkarahan (1990) formula for approximating the number of clusters for a text matrix: (m
n)/t where m and n are the dimensions of the matrix and t is the length of the non-zero elements
in matrix A.
Usage
approx_k(x, verbose = TRUE)
## S3 method for class 'TermDocumentMatrix'
approx_k(x, verbose = TRUE)
## S3 method for class 'DocumentTermMatrix'
approx_k(x, verbose = TRUE)
Arguments
x

A matrix.

verbose

logical. If TRUE the k determination is printed.

Value
Returns an integer.
References
Can, F., Ozkarahan, E. A. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Transactions on Database Systems 15 (4): 483.
doi:10.1145/99935.99938.

assignments

Examples
library(gofastr)
library(dplyr)
presidential_debates_2012 %>%
with(q_dtm(dialogue)) %>%
approx_k()

assignments

Topic Assignments

Description
A dataset containing a list of topic assignments by various clustering algorithms. Assignments
correspond to the rows (minus empty rows) of the presidential_debates_2012 data set.
Usage
data(assignments)
Format
A list with 3 elements

assign_cluster

Assign Clusters to Documents/Text Elements

Description
Assign clusters to documents/text elements.
Usage
assign_cluster(x, k = approx_k(get_dtm(x)), h = NULL, ...)
## S3 method for class 'hierarchical_cluster'
assign_cluster(x, k = approx_k(get_dtm(x)),
h = NULL, ...)
## S3 method for class 'kmeans_cluster'
assign_cluster(x, ...)
## S3 method for class 'skmeans_cluster'
assign_cluster(x, ...)
## S3 method for class 'nmf_cluster'
assign_cluster(x, ...)

assign_cluster

Arguments
x

a xxx_cluster object.

The number of clusters (can supply h instead). Defaults to use approx_k of the
DocumentTermMatrix produced by data_storage.

The height at which to cut the dendrograms (determines number of clusters). If


this argument is supplied k is ignored.

...

ignored.

Value
Returns an assign_cluster object; a named vector of cluster assignments with documents as
names. The object also contains the original data_storage object and a join function. join is a
function (a closure) that captures information about the assign_cluster that makes rejoining to
the original data set simple. The user simply supplies the original data set as an argument to join
(attributes(FROM_ASSIGN_CLUSTER)$join(ORIGINAL_DATA)).
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
hierarchical_cluster(x) %>%
plot(h=.7, lwd=2)
hierarchical_cluster(x) %>%
assign_cluster(h=.7)
hierarchical_cluster(x, method="complete") %>%
plot(k=6)
hierarchical_cluster(x) %>%
assign_cluster(k=6)
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster()
ca2 <- assign_cluster(x2, k = 55)
summary(ca2)
## add to original data
attributes(ca2)$join(presidential_debates_2012)
## split text into clusters
get_text(ca2)
## Kmeans Algorithm
kmeans_cluster(x, k=6) %>%
assign_cluster()

as_topic

x3 <- presidential_debates_2012 %>%


with(data_store(dialogue)) %>%
kmeans_cluster(55)
ca3 <- assign_cluster(x3)
summary(ca3)
## split text into clusters
get_text(ca3)

as_topic

Convert get_terms to Topics

Description
View important terms as a comma separated string (a topic).
Usage
as_topic(x, max.n = 8, sort = TRUE, ...)
## S3 method for class 'get_terms'
as_topic(x, max.n = 8, sort = TRUE, ...)
Arguments
x

A get_terms object.

max.n

The max number of words to show before truncation.

sort

logical. If TRUE the cluster topics are sorted by size (number of documents)
otherwise the topics are sorted by cluster number.

...

ignored.

Value
Returns a data.frame of "cluster", "count", and "terms". Pretty prints as clusters, number of
documents, and associated important terms.
Examples
library(dplyr)
myfit5 <- presidential_debates_2012 %>%
mutate(tot = gsub("\\..+$", "", tot)) %>%
textshape::combine() %>%
filter(person %in% c("ROMNEY", "OBAMA")) %>%
with(data_store(dialogue, stopwords = tm::stopwords("english"), min.char = 3)) %>%
hierarchical_cluster()
ca5 <- assign_cluster(myfit5, k = 50)
get_terms(ca5, .4) %>%
as_topic()

categorize

get_terms(ca5, .4) %>%


as_topic(sort=FALSE)
get_terms(ca5, .95) %>%
as_topic()

categorize

Merge Clusters & Cluster Categories Back to Original Data

Description
Merge clusters, categories, and the original data back together.
Usage
categorize(data, assign.cluster, cluster.key)
Arguments
data

A data set that was fit with a cluster model.

assign.cluster An assign_cluster object.


cluster.key

An assign_cluster object.

Value
Returns a data.frame key of clusters and categories.
See Also
write_cluster_text, read_cluster_text
Examples
library(dplyr)
## Assign Clusters
ca <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
assign_cluster(k = 7)
## Write Cluster Text for Human Categorization
write_cluster_text(ca)
write_cluster_text(ca, n.sample=10)
write_cluster_text(ca, lead=" -", n.sample=10)
## Read Human Coded Categories Back In
categories_file <- system.file("additional/foo_turk.txt", package = "clustext")
readLines(categories_file)
(categories_key <- read_cluster_text(categories_file))
## Add Categories Back to Original Data Set

clustext

categorize(
data = presidential_debates_2012,
assign.cluster = ca,
cluster.key = categories_key
)

clustext

Consistent Clustering for Text Data

Description
Optimized, consistent tools for clustering text data.

compare

Adjusted Rand Index Comaprison Between Algorithms

Description
An Adjusted Rand Index comparison of the assignments between different clustering algorithms.
Usage
compare(...)
Arguments
...

A series of outputs from assign_cluster for various cluster algorithmns.

Value
Returns a pair-wise comparison matrix of Adjusted Rand Indices for algorithm. Higher Adjusted
Rand Index scores indicate higher cluster assignment agreement.
References
http://faculty.washington.edu/kayee/pca/supp.pdf
Examples
compare(
assignments$hierarchical_assignment,
assignments$kmeans_assignment,
assignments$skmeans_assignment,
assignments$nmf_assignment
)
## Understanding the ARI
set.seed(10)
w <- sample(1:10, 40, TRUE)
x <- 11-w
set.seed(20)

cosine_distance
y <- sample(1:10, 40, TRUE)
set.seed(50)
z <- sample(1:10, 40, TRUE)
data.frame(w, x, y, z)
library(mclust)
mclust::adjustedRandIndex(w, x)
mclust::adjustedRandIndex(x, y)
mclust::adjustedRandIndex(x, z)

cosine_distance

Optimized Computation of Cosine Distance

Description
Utilizes the slam package to efficiently calculate cosine distance on large sparse matrices.
Usage
cosine_distance(x, ...)
## S3 method for class 'DocumentTermMatrix'
cosine_distance(x, ...)
## S3 method for class 'TermDocumentMatrix'
cosine_distance(x, ...)
Arguments
x
...

A data type (e.g., DocumentTermMatrix or TermDocumentMatrix).


ignored.

Value
Returns a cosine distance object of class "dist".
Author(s)
Michael Andrec and Tyler Rinker <tyler.rinker@gmail.com>.
References
http://stackoverflow.com/a/29755756/1000343
Examples
library(gofastr)
library(dplyr)
out <- presidential_debates_2012 %>%
with(q_dtm(dialogue)) %>%
cosine_distance()

data_store

data_store

Data Structure for hclusttext

Description
A data structure which stores the text, DocumentTermMatrix, and information regarding removed
text elements which can not be handled by the hierarchical_cluster function. This structure is
required because it documents important meta information, including removed elements, required
by other hclustext functions. If the user wishes to combine documents (say by a common grouping
variable) it is recomended this be handled by combine prior to using data_store.
Usage
data_store(text, doc.names, min.term.freq = 1, min.doc.len = 1,
stopwords = tm::stopwords("english"), min.char = 3, max.char = NULL,
stem = FALSE, denumber = TRUE)
Arguments
text

A character vector.

doc.names

An optional vector of document names corresponding to the length of text.

min.term.freq

The minimum times a term must appear to be included in the DocumentTermMatrix.

min.doc.len

The minimum words a document must contain to be included in the data structure (other wise it is stored as a removed element).

stopwords

A vector of stopwords to remove.

min.char

The minial length character for retained words.

max.char

The maximum length character for retained words.

stem

Logical. If TRUE the stopwords will be stemmed.

denumber

Logical. If TRUE numbers will be excluded.

Value
Returns a list containing:
dtm A tf-idf weighted DocumentTermMatrix
text The text vector with unanalyzable elements removed
removed The indices of the removed text elements, i.e., documents not meeting min.doc.len
n.nonsparse The length of the non-zero elements
Examples
data_store(presidential_debates_2012[["dialogue"]])
## Use `combine` to merge text prior to `data_stare`
library(textshape)
library(dplyr)
dat <- presidential_debates_2012 %>%
dplyr::select(person, time, dialogue) %>%

10

get_documents
textshape::combine()
## Elements in `ds` correspond to `dat` grouping vars
(ds <- with(dat, data_store(dialogue)))
dplyr::select(dat, -3)
## Add row names
(ds2 <- with(dat, data_store(dialogue, paste(person, time, sep = "_"))))
rownames(ds2[["dtm"]])
## Get a DocumentTermMatrix
get_dtm(ds2)

get_documents

Get Documents Based on Cluster Assignment in assign_cluster

Description
Get the documents associated with each of the k clusters .
Usage
get_documents(x, ...)
## S3 method for class 'assign_cluster'
get_documents(x, ...)
Arguments
x
...

A assign_cluster object.
ignored.

Value
Returns a list of vectors of document names.
Examples
library(dplyr)
mydocuments1 <- presidential_debates_2012 %>%
with(data_store(dialogue, paste(person, time, sep="-"))) %>%
hierarchical_cluster() %>%
assign_cluster(k = 6) %>%
get_documents()
mydocuments1
mydocuments2 <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
assign_cluster(k = 55) %>%
get_documents()
mydocuments2

get_dtm

get_dtm

11

Get a DocumentTermMatrix Stored in a hierarchical_cluster Object

Description
Extract the DocumentTermMatrix supplied to/produced by a hierarchical_cluster object.
Usage
get_dtm(x, ...)
## S3 method for class 'data_store'
get_dtm(x, ...)
## S3 method for class 'hierarchical_cluster'
get_dtm(x, ...)
## S3 method for class 'kmeans_cluster'
get_dtm(x, ...)
## S3 method for class 'skmeans_cluster'
get_dtm(x, ...)
## S3 method for class 'nmf_cluster'
get_dtm(x, ...)
Arguments
x

A hierarchical_cluster object.

...

ignored.

Value
Returns a DocumentTermMatrix.
Examples
library(dplyr)
presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
get_dtm()

12

get_removed

get_removed

Get a Text Stored in a hierarchical_cluster Object

Description
Extract the text supplied to the hierarchical_cluster object.
Usage
get_removed(x, ...)
## S3 method for class 'hierarchical_cluster'
get_removed(x, ...)
## S3 method for class 'kmeans_cluster'
get_removed(x, ...)
## S3 method for class 'skmeans_cluster'
get_removed(x, ...)
## S3 method for class 'nmf_cluster'
get_removed(x, ...)
## S3 method for class 'data_store'
get_removed(x, ...)
Arguments
x

A hierarchical_cluster object.

...

ignored.

Value
Returns a vector of text strings.

Examples
library(dplyr)
presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
get_removed()

get_terms

13

get_terms

Get Terms Based on Cluster Assignment in assign_cluster

Description
Get the terms weighted (either by tf-idf or returned from the model) and min/max scaling associated
with each of the k clusters .
Usage
get_terms(x, min.weight = 0.6, nrow = NULL, ...)
## S3 method for class 'assign_cluster_hierarchical'
get_terms(x, min.weight = 0.6,
nrow = NULL, ...)
## S3 method for class 'assign_cluster_kmeans'
get_terms(x, min.weight = 0.6, nrow = NULL,
...)
## S3 method for class 'assign_cluster_skmeans'
get_terms(x, min.weight = 0.6, nrow = NULL,
...)
## S3 method for class 'assign_cluster_nmf'
get_terms(x, min.weight = 0.6, nrow = NULL,
...)
Arguments
x

A assign_cluster object.

min.weight

The lowest min/max scaled tf-idf weighting to consider as a documents salient


term.

nrow

The max number of rows to display in the returned data.frames.

...

ignored.

Value
Returns a list of data.frames of top weighted terms.
Examples
library(dplyr)
library(textshape)
myterms <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
assign_cluster(k = 55) %>%
get_terms()

14

get_text
myterms
textshape::bind_list(myterms[!sapply(myterms, is.null)], "Topic")
## Not run:
library(ggplot2)
library(gridExtra)
library(dplyr)
library(textshape)
library(wordcloud)
max.n <- max(textshape::bind_list(myterms)[["n"]])
myplots <- Map(function(x, y){
x %>%
mutate(term = factor(term, levels = rev(term))) %>%
ggplot(aes(term, weight=n)) +
geom_bar() +
scale_y_continuous(expand = c(0, 0),limits=c(0, max.n)) +
ggtitle(sprintf("Topic: %s", y)) +
coord_flip()
}, myterms, names(myterms))
myplots[["ncol"]] <- 10
do.call(gridExtra::grid.arrange, myplots[!sapply(myplots, is.null)])
##wordclouds
par(mfrow=c(5, 11), mar=c(0, 4, 0, 0))
Map(function(x, y){
wordcloud::wordcloud(x[[1]], x[[2]], scale=c(1,.25),min.freq=1)
mtext(sprintf("Topic: %s", y), col = "blue", cex=.55, padj = 1.5)
}, myterms, names(myterms))
## End(Not run)

get_text

Get a Text Stored in Various Objects

Description
Extract the text supplied to the hierarchical_cluster object.
Usage
get_text(x, ...)
## S3 method for class 'hierarchical_cluster'
get_text(x, ...)
## S3 method for class 'kmeans_cluster'
get_text(x, ...)
## S3 method for class 'nmf_cluster'
get_text(x, ...)

hierarchical_cluster

15

## S3 method for class 'skmeans_cluster'


get_text(x, ...)
## S3 method for class 'data_store'
get_text(x, ...)
## S3 method for class 'assign_cluster'
get_text(x, ...)
Arguments
x

A hierarchical_cluster object.

...

ignored.

Value
Returns a vector or list of text strings.
Examples
library(dplyr)
presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
get_text() %>%
head()

hierarchical_cluster

Fit a Hierarchical Cluster

Description
Fit a hierarchical cluster to text data. Prior to distance measures being calculated the tf-idf (see
weightTfIdf) is applied to the DocumentTermMatrix. Cosine dissimilarity is used to generate the
distance matrix supplied to hclust. method defaults to "ward.D2". A faster cosine dissimilarity
calculation is used under the hood (see cosine_distance). Additionally, hclust is used to quickly
calculate the fit. Essentially, this is a wrapper function optimized for clustering text data.
Usage
hierarchical_cluster(x, distance = "cosine", method = "ward.D2", ...)
## S3 method for class 'data_store'
hierarchical_cluster(x, distance = "cosine",
method = "ward.D", ...)

16

jaccard_distance

Arguments
x

A data store object (see data_store).

distance

A distance measure ("cosine" or "jaccard").

method

The agglomeration method to be used. This must be (an unambiguous abbreviation of) one of "single", "complete", "average", "mcquitty", "ward.D",
"ward.D2", "centroid", or "median".

...

ignored.

Value
Returns an object of class "hclust".
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
hierarchical_cluster(x) %>%
plot(k=4)
hierarchical_cluster(x) %>%
plot(h=.7, lwd=2)
hierarchical_cluster(x) %>%
assign_cluster(h=.7)
hierarchical_cluster(x, method="complete") %>%
plot(k=6)
hierarchical_cluster(x) %>%
assign_cluster(k=6)
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue))
myfit2 <- hierarchical_cluster(x2)
plot(myfit2)
plot(myfit2, 55)
assign_cluster(myfit2, k = 55)

jaccard_distance

Optimized Computation of Jaccard Distance

Description
Utilizes the slam package to efficiently calculate jaccard distance on large sparse matrices.

kmeans_cluster

17

Usage
jaccard_distance(x, ...)
## S3 method for class 'DocumentTermMatrix'
jaccard_distance(x, ...)
## S3 method for class 'TermDocumentMatrix'
jaccard_distance(x, ...)
Arguments
x

A data type (e.g., DocumentTermMatrix or TermDocumentMatrix).

...

ignored.

Value
Returns a jaccard distance object of class "dist".
Author(s)
user41844 of StackOverflow, Dmitriy Selivanov, and Tyler Rinker <tyler.rinker@gmail.com>.
References
http://stackoverflow.com/a/36373333/1000343 http://stats.stackexchange.com/a/89947/
7482
Examples
library(gofastr)
library(dplyr)
out <- presidential_debates_2012 %>%
with(q_dtm(dialogue)) %>%
jaccard_distance()

kmeans_cluster

Fit a Kmeans Cluster

Description
Fit a kmeans cluster to text data. Prior to distance measures being calculated the tf-idf (see weightTfIdf)
is applied to the DocumentTermMatrix.
Usage
kmeans_cluster(x, k, ...)
## S3 method for class 'data_store'
kmeans_cluster(x, k, ...)

18

nmf_cluster

Arguments
x

A data store object (see data_store).

The number of clusters.

...

Other arguments passed to kmeans.

Value
Returns an object of class "kmeans".
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
## 6 topic model
kmeans_cluster(x, k=6)
kmeans_cluster(x, k=6) %>%
assign_cluster()
kmeans_cluster(x, k=6) %>%
assign_cluster() %>%
summary()
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue))
myfit2 <- kmeans_cluster(x2, 55)
assign_cluster(myfit2)
assign_cluster(myfit2) %>%
summary()

nmf_cluster

Fit a Non-Negative Matrix Factorization Cluster

Description
Fit a robust non-negative matrix factorization cluster to text data via rnmf. Prior to distance measures being calculated the tf-idf (see weightTfIdf) is applied to the DocumentTermMatrix.
Usage
nmf_cluster(x, k = k, ...)
## S3 method for class 'data_store'
nmf_cluster(x, k, ...)

plot.hierarchical_cluster
Arguments
x
k
...

A data store object (see data_store).


The number of clusters.
Other arguments passed to rnmf.

Value
Returns an object of class "hclust".
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
## 6 topic model
model6 <- nmf_cluster(x, k=6)
model6 %>%
assign_cluster()
model6 %>%
assign_cluster() %>%
summary()
## Not run:
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue))
myfit2 <- nmf_cluster(x2, 55)
assign_cluster(myfit2)
assign_cluster(myfit2) %>%
summary()
## End(Not run)

plot.hierarchical_cluster
Plots a hierarchical_cluster Object

Description
Plots a hierarchical_cluster object
Usage
## S3 method for class 'hierarchical_cluster'
plot(x, k = approx_k(get_dtm(x)), h = NULL,
color = "red", ...)

19

20

print.assign_cluster

Arguments
x
k

color
...

A hierarchical_cluster object.
The number of clusters (can supply h instead). Defaults to use approx_k of the
DocumentTermMatrix produced by data_storage. Boxes are drawn around
the clusters.
The height at which to cut the dendrograms (determines number of clusters). If
this argument is supplied k is ignored. A line is drawn showing the cut point on
the dendrogram.
The color to make the cluster boxes (k) or line (h).
Other arguments passed to rect.hclust or abline.

presidential_debates_2012
2012 U.S. Presidential Debates

Description
A dataset containing a cleaned version of all three presidential debates for the 2012 election.
Usage
data(presidential_debates_2012)
Format
A data frame with 2912 rows and 4 variables
Details

person. The speaker


tot. Turn of talk
dialogue. The words spoken
time. Variable indicating which of the three debates the dialogue is from

print.assign_cluster

Prints an assign_cluster Object

Description
Prints an assign_cluster object
Usage
## S3 method for class 'assign_cluster'
print(x, ...)
Arguments
x
...

An assign_cluster object.
ignored.

print.as_topic

21

print.as_topic

Prints an as_topic Object

Description
Prints an as_topic object
Usage
## S3 method for class 'as_topic'
print(x, ...)
Arguments
x
...

An as_topic object.
ignored.

print.compare

Prints a compare Object.

Description
Prints a compare object.
Usage
## S3 method for class 'compare'
print(x, digits = 3, ...)
Arguments
x
digits
...

The compare object


Number of decimal places to print.
ignored

print.data_store

Prints a data_store Object

Description
Prints a data_store object
Usage
## S3 method for class 'data_store'
print(x, ...)
Arguments
x
...

A data_store object.
ignored.

22

skmeans_cluster

print.get_documents

Prints a get_documents Object

Description
Prints a get_documents object
Usage
## S3 method for class 'get_documents'
print(x, ...)
Arguments
x

A get_documents object.

...

ignored.

print.get_terms

Prints a get_terms Object

Description
Prints a get_terms object
Usage
## S3 method for class 'get_terms'
print(x, ...)
Arguments
x

A get_terms object.

...

ignored.

skmeans_cluster

Fit a skmean Cluster

Description
Fit a skmean cluster to text data. Prior to distance measures being calculated the tf-idf (see weightTfIdf)
is applied to the DocumentTermMatrix. Cosine dissimilarity is used to generate the distance matrix
supplied to skmeans.
Usage
skmeans_cluster(x, k, ...)
## S3 method for class 'data_store'
skmeans_cluster(x, k, ...)

summary.assign_cluster

23

Arguments
x
k
...

A data store object (see data_store).


The number of clusters.
Other arguments passed to skmeans.

Value
Returns an object of class "skmean".
Examples
library(dplyr)
x <- with(
presidential_debates_2012,
data_store(dialogue, paste(person, time, sep = "_"))
)
## 6 topic model
myfit1 <- skmeans_cluster(x, k=6)
myfit1 %>%
assign_cluster()
myfit1 %>%
assign_cluster() %>%
summary()
## Not run:
x2 <- presidential_debates_2012 %>%
with(data_store(dialogue))
myfit2 <- skmeans_cluster(x2, 55)
assign_cluster(myfit2)
assign_cluster(myfit2) %>%
summary()
## End(Not run)

summary.assign_cluster
Summary of an assign_cluster Object

Description
Summary of an assign_cluster object
Usage
## S3 method for class 'assign_cluster'
summary(object, plot = TRUE, print = TRUE, ...)

24

write_cluster_text

Arguments
object

An assign_cluster object.

plot

logical. If TRUE an accompanying bar plot is produced a well.

print

logical. If TRUE data.frame counts are printed.

...

ignored.

write_cluster_text

Write/Read Cluster Text for Human Categorization

Description

Write cluster text from get_text(assign_cluster(myfit)) to an external file for categorization. After file has been written with write_cluster_text a human coder can assign categories
to each cluster. Simple write the category after the Cluster #:. To set a cluster category equal to
another simply write and equal sign follwed by the other cluster to set as the same category (e.g.,
Cluster 10: =5 to set cluster #10 the same as cluster #5). See readLines(system.file("additional/foo_turk.txt"
for an example.
Usage
write_cluster_text(x, path, n.sample = NULL, lead = " * ", ...)
read_cluster_text(path, ...)
Arguments
x

An assign_cluster object.

path

A pather to the file (.txt) is recommended.

n.sample

The length to limit the sample to (default gives all text in the cluster). Setting
this to an integer uses this as the number to randomly sample from.

lead

A leading character string prefix to give the cluster text.

...

ignored.

See Also
categorize
Examples
library(dplyr)
## Assign Clusters
ca <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
hierarchical_cluster() %>%
assign_cluster(k = 7)
## Write Cluster Text for Human Categorization
write_cluster_text(ca)
write_cluster_text(ca, n.sample=10)

write_cluster_text
write_cluster_text(ca, lead="

25
-", n.sample=10)

## Read Human Coded Categories Back In


categories_file <- system.file("additional/foo_turk.txt", package = "clustext")
readLines(categories_file)
(categories_key <- read_cluster_text(categories_file))
## Add Categories Back to Original Data Set
categorize(
data = presidential_debates_2012,
assign.cluster = ca,
cluster.key = categories_key
)

Index
kmeans_cluster, 17

Topic cosine
cosine_distance, 8
Topic datasets
assignments, 3
presidential_debates_2012, 20
Topic data
data_store, 9
Topic dissimilarity
cosine_distance, 8
jaccard_distance, 16
Topic jaccard
jaccard_distance, 16
Topic structure
data_store, 9

nmf_cluster, 18
package-clustext (clustext), 7
plot.hierarchical_cluster, 19
presidential_debates_2012, 20
print.as_topic, 21
print.assign_cluster, 20
print.compare, 21
print.data_store, 21
print.get_documents, 22
print.get_terms, 22
read_cluster_text, 6
read_cluster_text (write_cluster_text),
24
rect.hclust, 20
rnmf, 18, 19

abline, 20
approx_k, 2
as_topic, 5
assign_cluster, 3, 6, 10, 13
assignments, 3

skmeans, 22, 23
skmeans_cluster, 22
summary.assign_cluster, 23

categorize, 6, 24
clustext, 7
clustext-package (clustext), 7
combine, 9
compare, 7
cosine_distance, 8, 15

TermDocumentMatrix, 8, 17
vector, 10
weightTfIdf, 15, 17, 18, 22
write_cluster_text, 6, 24

data.frame, 5, 6, 13
data_store, 9, 16, 18, 19, 23
DocumentTermMatrix, 4, 8, 9, 11, 15, 17, 18,
20, 22
get_documents, 10
get_dtm, 11
get_removed, 12
get_terms, 13
get_text, 14
hclust, 15
hierarchical_cluster, 11, 12, 14, 15, 15
jaccard_distance, 16
kmeans, 18
26

Das könnte Ihnen auch gefallen