Sie sind auf Seite 1von 30

Accelerating Flow Cytometry Data Clustering Workows with

Graphics Processing Units


by
Andrew D. Pangborn
A Thesis Proposal Submitted in Partial Fulllment of the Requirements for the Degree of
Master of Science in Computer Engineering
Supervised by
Dr. Muhammad Shaaban
Department of Computer Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
September 2009
Approved By:
Dr. Muhammad Shaaban
Associate Professor - Department of Computer Engineering
Dr. Gregor von Laszewski
Assistant Director of Cloud Computing - Pervasive Technology Institute, Indiana University
Dr. Roy Melton
Lecturer - Department of Computer Engineering, RIT
Dr. James Cavenaugh
Postdoctoral Fellow - Center for Vaccine Biology and Immunology, University of Rochester
Abstract
Flow cytometry is a mainstay technology used by biologists and immunologists for counting, sort-
ing, and analyzing cells suspended in a uid. The results of ow cytometry are used in a variety
of important clinical and research applications such as phenotyping, DNA analysis, and cell func-
tion analysis. Like many modern scientic applications, ow cytometry produces massive amounts
of data which must be clustered in order to be useful. Conventional analysis of ow cytometry
data uses manual sequential bivariate gating. However, this technique is limited in both the quan-
tity and quality of analyses produced. Unsupervised multivariate clustering techniques have been
investigated and show promise for producing sound statistical analyses of ow cytometry data.
The computational demands of multivariate clustering grow rapidly and therefore large data
sets containing hundreds of natural clusters, like those found in ow cytometry data, are impractical
on a single CPU. Fortunately these techniques lend themselves naturally to large scale parallel
processing. To address the computational demands graphics processing units, specically Nvidias
CUDA framework and Tesla architecture will be investigated for a low-cost, high performance
solution to a number of clustering algorithms.
This thesis will implement a number of clustering algorithms and information criteria using the
CUDA framework. The algorithm implementations will be scalable such that they can use multiple
GPUs on either a single node or multiple nodes using shared memory or message passing. The
CUDA algorithm implementations are envisioned as part of a larger grid-based workow service
where biologists can apply multiple algorithms and parameter sweeps to their data sets and quickly
receive a thorough set of results that can be further analyzed by experts. The CUDA accelerated
implementations will be compared against sequential versions. A performance evaluation will be
completed that assesses accuracy, speedup, scalability, and utilization of the parallel architecture.
ii
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1.1 Types of Data clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Flow Cytometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Workow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Supporting Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.1 C Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.3 Exhaustive Bivariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Data Clustering on GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Project Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8 Required Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
1. Introduction
Science and business applications often produce massive data sets. This immense amount of data
must be classied into meaningful subsets for data analysts and scientists to draw meaningful con-
clusions. Data clustering is the broad eld of statistical analysis that groups (classies) similar
objects into relatively homogenous sets called clusters. Data clustering has a history in a wide va-
riety of elds, such as, data mining, machine learning, geology, astronomy, and bioinformatics, to
name a few [1] [2]. The nature of the data similarity varies signicantly from one application and
data set to another. Therefore there is no single data clustering algorithm that is superior to all others
in every instance. As such, there has been extensive research and a myriad of clustering techniques
developed in the past 50 to 60 years [2].
Flow cytometry is a mainstay technology used in immunology and other clinical and biological
research areas such as DNA analysis, genotyping, phenotyping, and cell function analysis. It is
used to gather information about the physical and chemical characteristics of a population of cells.
Flow cytometers produce a d-length multidimensional data vector of oating point values for every
event (usually a cell) in a sample, where d indicates the number of photo-sensitive sensors installed.
Typical samples have on the order of 10
6
events with upwards of 2024 dimensions (and this number
continues to increase as ow cytometer technology evolves). This massive amount of data must then
be clustered in order for biologists to draw meaningful conclusions about the characteristics of the
sample.
Sequential bivariate gating is the approach traditionally followed by biologists for clustering and
analyzing ow cytometry data. Two dimensions of the data are analyzed at a time with a scatter plot.
Clusters are then manually drawn around populations of cells by a technique called gating. Data is
typically diffuse and clusters are not well-dened or distinct, therefore gating requires experience
and expert knowledge about the data and the dimensions involved. Unfortunately this process is
time consuming, cumbersome, and in-exact. Unpublished research by the University of Rochester
Center for Vaccine Biology and Immunology suggests that results can vary by as much as an order
of magnitude between experienced immunologists on a difcult data set. Therefore both the number
and quality of the analyses produced by sequential bivariate gating is limited.
Multivariate data clustering techniques have been around for decades, however their application
to the eld of ow cytometry has been limited. Multivariate techniques have the potential to use
the full multidimensional nature of the data, to nd cell populations of interest (that are difcult to
isolate with sequential bivariate gating), and allowanalysts to make more sound statistical inferences
from the results. Flow cytometry data sets are complex, containing millions of events, dozens
of dimensions, and potentially hundreds of natural clusters. Unsupervised multivariate clustering
techniques are computational intensive and the computational demands grow rapidly as the number
of clusters, events, and dimensions increase. This makes it very time consuming to thoroughly
analyze a ow cytometry data set using a single general purpose processor. Fortunately, many
2
clustering techniques lend themselves nicely to large scale parallel processing.
In this thesis Nvidias CUDA framework for general purpose computing on graphics processing
units (GPGPU) is investigated as a low cost, high performance solution to address the computational
demands of unsupervised multivariate data clustering for ow cytometry [3]. The existing work
on data clustering algorithms using GPGPU has been limited in the algorithms implemented, the
scalability of such algorithms (such as using multiple GPUs), and not optimized specically for
the ow cytometry problem. Two unsupervised multivariate clustering algorithms - c-means and
Expectation Maximization with Gaussian mixture models will be implemented using CUDA and
the Tesla architecture. A novel subspace clustering algorithm called exhaustive bivariate will be
implemented as well. In the exhaustive bivariate all, or a large subset of all, two dimensional
subspaces of the same data set are clustered. The result of all two dimensional clusterings are
then combined to determine unique groups of events that belong to the same clusters in each of
the subspaces. Clustering results for all methods will be assessed by the accuracy and quality of
ow cytometry analyses produced. The performance of sequential, single GPU, and multiple GPU
implementations will be compared.
3
2. Thesis Objectives
This thesis will leverage the parallel computing architecture of Graphics Processing Units (GPUs)
and exploit the parallelism inherent to many data clustering algorithms to make thorough analyses
of ow cytometry data possible in a timely manner. Specically, Nvidias CUDA framework [3] and
Tesla architecture [4] for GPGPU will be used as a high performance, low cost solution. Multi-GPU
scalable implementations of fuzzy C-means and expectation maximization with Gaussian mixture
models will be developed. A novel subspace clustering algorithm will also be implemented that
performs an exhaustive bivariate clustering of a data set.
A software workow framework will be developed that allows immunologists to input FCS les
and congure a suite of clustering parameters for analyses of ow data sets. Finally the algorithms
will be thoroughly analyzed for performance - including accuracy of clustering results, speedup
versus single CPU, speedup versus single GPU, and efciency (i.e. utilization of parallel architec-
tures resources). This effort will be a step towards revolutionizing the way ow cytometry data is
analyzed and allow immunologists and clinical technicians to conduct better, more thorough, work
in a timely fashion.
4
3. Background
3.1 Data Clustering
Data clustering is a statistical method for grouping of similar objects into related or homogeneous
sets, called clusters. There are a myriad of scientic elds and commercial applications that generate
immense amounts of data, ranging from high energy particle physics at CERN to the buying habits
of consumers at grocery stores. In any case, the objective is to group related data together so that
analysts can draw meaningful conclusions from the data. The reason for data mining, as well as
the size and nature of the data varies tremendously from one eld to another, and even from one
data set to another in a given eld. As such, there have been a wide variety of data clustering
techniques developed over the past 60+ years. No single data clustering algorithm is sufcient for
all applications.
3.1.1 Types of Data clustering
A thorough discussion of data clustering techniques is well beyond the scope of this thesis. For a
more thorough review of data clustering techniques please consult [1] and [2]. However, this section
provides an overview of the different types of data clustering techniques that will explored in this
thesis for GPU acceleration and application to the ow cytometry problem.
Data clustering can be dichotomized into two broad efforts - exploratory or conrmatory [1].
The former case attempts to gain insight into the contents of the data and form hypotheses, whereas
the latter attempts to match the data to known patterns or models. Flow cytometry has a wide variety
of applications that fall into both categories. Regardless of the technique involved, data clustering
algorithms face the same questions. How many clusters should be use? How good does the data
t the cluster conguration or model? Are all of the natural clusters being exposed? While philo-
sophically simple, in practice these are difcult problems and as such these sorts of questions have
plagued researchers for decades and are the reason why data clustering remains such a challenging
and computationally intensive problem. Many popular data clustering techniques have a combina-
torial increase in computational complexity as the number of dimensions, the number of vectors,
and the number of clusters increase. The nature of the data in ow cytometry, with millions of
events, over 20 dimensions, and hundreds of natural clusters, make unsupervised data clustering
both difcult and very computationally demanding.
Among the unsupervised (exploratory) clustering techniques are many subcategories. In this
thesis four different types of data clustering are investigated. The rst is a center-based method [2]
although some literature group these into least-squares methods [1]. In center-based techniques,
a conguration with k clusters has k d-dimensional cluster centers. The most common center-based
clustering method is the k-means algorithm. Events in the data set are grouped into the cluster whose
5
center is closest. Centers are then recomputed based upon their members and this iterates until
the center locations converge. Fuzzy k-means [5] (typically called c-means in the data clustering
literature) and k-medoids [6] also fall into the category of center-based clustering. The next method
is model-based, which assumes that the data is composed of a collection of different distributions. A
model-based algorithm investigated in this thesis is expectation maximization (EM) with Gaussian
mixture models. This is a two step iterative clustering algorithm, where the rst step of each iteration
maps the probability of each individual data point to the current models, and the second stage
updates the statistical parameters of the models based upon the new probabilities. Agglomerative
hierarchical clustering works by iteratively combining the most similar clusters. The similarity
measure varies signicantly from one implementation to another. This thesis combines hierarchical
clustering to the Gaussian Mixture Model approach, so the similarity measure is based upon the
covariance and means of the Gaussian distributions. The nal type explored in this thesis is subspace
clustering. Subspace clustering nds clusters in all subspaces of the data. The general principle is
that if a cluster exists in a high-dimension space, that cluster must also exist in lower dimension
subspaces. It is useful for nding clusters that are too difcult to nd in the full high-dimensional
space (the so-called curse of dimensionality). The exhaustive bivariate technique explored in this
thesis is a form of subspace clustering.
3.2 Flow Cytometry
Flow cytometry is a process for studying the physical and chemical characteristics of a group of
particles (usually cells) suspended in a uid. The system contains three major components - uidics,
optics, and electronics. The uidics system is responsible for creating a narrow stream of particles
that pass through the optics by a technique known as hydrodynamic focusing (see Figure 3.1) [7].
Optics consist of one or more lasers of different wavelengths. Lenses, mirrors, and lters direct the
scattered light to different photo-sensitive electronic detectors as seen in Figure 3.1. The particles
(typically cells) pass through a focused laser beam at a rate of thousands per second. Light hits the
cell and is refracted (scattered) in all directions. Forward scatter information can describe physical
characteristics of the particle, such as cell size. Other light gets scattered to the side and these rays
are channelled through a series of lters for different wavelengths and reected onto other light
detectors. A typical way of studying different cell characteristics is to use uorescently labeled
antibodies. The cell samples are doped with various uorescent reagents, which bind to certain
types of cells, to act as indicators. When light hits the uorescent molecules, they become excited
and emit light at a particular wavelength [8]. Different colors of light will be scattered depending
upon which uorescent markers are attached to the cell, and the corresponding light emission is
picked up by a series of detectors. [7]
The information from all of the different detectors forms a data vector for each event a
particle or cell passing through the laser. State of the art ow cytometers can have as many as 2024
detectors. This results in large data sets on the order of 10
6
events with up to 24 dimensions. In
addition to the raw data, there are headers to indicate which color/sensor the dimensions correspond
to and compensation information for the different channels. Annotation information for the cell
6
Figure 3.1: Hydrodynamic focusing of particles in a ow cytometer [8]
sample can also be present. The industry uses the ow cytometry standard (FCS) which contains
header, compensation data, annotation information, and raw data [7].
3.2.1 Bivariate Analysis
Bivariate analysis is the traditional, and most mature, approach used in the ow cytometry eld
for analyzing data. Histograms plot a single dimension of the data versus the frequency of said
value in the cell population. This does not allow the analyst to observe all of the overlapping, but
naturally distinct, cell populations in the data. Instead, two dimensions at a time are analyzed in a
scatter plot to expose more distinct cell populations. Figure 3.2 shows a simple scatter plot of the
forward and side refraction data for a ow cytometry sample and their corresponding histograms.
The different colors in the plot denote the natural clusters (in this case, different types of cells such
as lymphocytes, monocytes, and neutrophils) [8]. Naturally the cell type for each event would not be
known a priori with real data, and therefore the analyst would group the blobs of cells in the scatter
plot into distinct clusters. Expert knowledge must then be used by the immunologist studying the
data to determine what each cluster corresponds to based upon the dimension and location of the
clusters in the plot. For non-trivial analyses it is necessary to iteratively cluster the data rather than
using single scatter plots. Bins are draw around cell populations, and then cells within a particular
bin are further analyzed using different dimensions. This technique is known as sequential bivariate
analysis.
There are many software packages available for owcytometry, such as FlowJo [9] and WebFlow
[10]
1
. However, most of this software focuses on bivariate analysis techniques. Services provided
1
There are many more, but I still need to organize a few more noteworthy ones into references
7
Figure 3.2: Bivariate Scatter Plot [8]
include data management and organization, statistical analysis, gating, and visualization.
3.2.2 Multivariate Analysis
The previous section discussed how moving from one dimension to two dimensions for analysis
exposed many more natural clusters in the data. However with data containing upwards of 24
dimensions, there is still a lot of potential information in the data that may not be exposed by
bivariate techniques. As the number of colors increases, the parameter space for bivariate analysis
grows rapidly. While a simple experiment with 6 colors has only 64 possible boolean parameters,
a 20 color experiment is on the order of 10
6
. Using an iterative sequential bivariate technique with
multiple scatter plots and different combinations of dimensions allow the analyst to see a lot of
different characteristics in the cell population. However, the number of different ways of analyzing
the dimensions two at a time grows combinatorially and consistent, reliable, data analysis is difcult
for manual operators. Very slight changes in the binning in one iteration of the sequential bivariate
analysis can also have a large impact on the nal result. Instead, this thesis investigates the use
of multivariate data clustering techniques for analyzing ow cytometry data. Despite automatic
multivariate clustering being in the ow cytometry literature for nearly 25 years [11] [12] [13],
it has had little effect on the ow cytometry community. However, some recent research efforts
such as FLAME [14] and Lo et. al [15] have shown promise for the use of automated multivariate
8
clustering for ow cytometry with biological case studies. Only recently have any multivariate FC
data analysis tools began to emerge for widespread use, such as Immport [16], and FlowCore [17].
3.2.3 Workow
So far Section 3 discussed the characteristics of ow cytometry data and a brief overview of an
individual analysis of the data. However the workow for a thorough ow cytometry analysis is far
more involved than a single sequential bivariate gating or multivariate clustering.
Prior to doing any clustering, there are some desirable preprocessing steps. First the raw data
needs to be extracted from the FCS le itself. Values that max out the light detectors in the ow
cytometer are ltered out because they are not reliable data. The FCS le also contains compen-
sation information which is used to lter out the effects of uorescent spillover that occurs due to
overlapping channel wavelengths. The compensation information is applied to clean up the raw data
and transform it from an instrument space to one that more accurately represents the biology. While
light scatter data tends to have a linear response, the uorescent channels tend to be log-like. There-
fore it is common to apply log or biexpontential ?? transformations to these channels before doing
any gating or clustering (FlowJo, a popular tool for bivariate gating [9], does this automatically).
After the preprocessing of the data is complete, clustering begins. A one-to-many relationship
exists between a particular data set and the clustering analyses that can be performed on it. There
are two main sources of different clustering analyses. First, no single clustering algorithm has
been shown to be universally superior to all others for all purposes, and different algorithms can
be expected to give different clusters and different numbers of clusters. Depending on the specic
objectives of the analysis, different algorithms may be chosen. For example, one algorithm may be
good at exploratory analysis of the data, able to locate very small clusters out of the entire data set,
while another may be good at consistently proling larger clusters - which is useful for inference
across data sets. While this thesis only focuses on three clustering algorithms, many more may be
used. Secondly, for any particular clustering algorithm it maybe desirable to do parameter sweeps
(such as varying the number of starting clusters, or the parameters used by information criteria such
as MDL) as well as sensitivity analyses. Using many different algorithms, each with many different
starting parameters, results in an analysis with dozens or even hundreds of clusterings. Finally,
intuitive visualization of the results is necessary in order for experts to gauge the quality of the
results and to determine the biological implications of the results (i.e. what cells do the clusters
correspond to).
A very important, yet under-developed, aspect of ow cytometry research is good statistical
inference. In short, biologists are interested in the change in various cell populations (clusters)
across samples. Here the one-to-many relationship between a particular data and clustering analyses
is expanded to a many-to-many relationship with multiple data sets, each with various clusterings.
As previously mentioned, even a single clustering algorithm can be slow on a single CPU. Clearly
there is a great need to have accelerated algorithms in order for the entire workow to be feasible in
reasonable time.
9
3.3 GPGPU
Traditional von Neumann general purpose processor architectures support only a single active thread
of execution. Sequencing of operations is quite easy with this architecture - however concurrency is
difcult. While the capabilities of general purpose processors to exploit parallelism have improved
signicantly throughout the years due to techniques such as the Tomasulo algorithm, superscalar
dispatching, symmetric multi-threading, and multiple-core technologies, they still remain limited to
a few threads of concurrent execution. In addition, a signicant portion of on-chip resources are
dedicated to decoding instruction sets for general purpose computing, handling branching and syn-
chronization, pipelining, and caching. The resources available for raw oating point computations
(used heavily in ow cytometry) are limited. Effectively meeting the computing requirements of
scientic applications using general purpose processors often requires hundreds if not thousands of
computing nodes.
Graphics processing units (GPUs) have evolved from simple xed function co-processors in the
graphics pipeline to programmable computation engines suitable for certain general purpose com-
puting applications. The introduction of programmable shaders into GPUs made the eld of general
purpose computing on graphics processing units (GPGPU) possible. Older efforts at GPGPU re-
quired researchers to cast the general purpose computations into streaming graphical applications,
with the instructions written as shaders, such as the OpenGL Shader Language (GLSL) and the data
stored as textures. With the Geforce 8800 graphics card series, Nvidia introduced a new architecture
with a unied shader model [4]. This was a major shift from a xed-function pipeline (with sepa-
rate processing elements dedicated to particular tasks, such as vertex shading and pixel shading) to
a more general purpose architecture. In graphics applications, these processing elements still exe-
cute either vertex and pixel shading procedures. However, they are actually multiprocessors capable
of executing general purpose threads. This is a massively parallel architecture with many benets
over a general purpose desktop processor for data-intensive computations such as data clustering.
Newer generations of the Nvidia GPU architecture, such as the Tesla T10, contain 240 concurrent
processing elements [4]. Compared to general purpose processors, a much larger portion of on-chip
resources in a GPU are dedicated to data and oating-point calculations rather than control and se-
quencing - ideal for ow cytometry data clustering, whose computation is composed almost entirely
of oating point operations. Nvidias CUDA Zone website boasts a wide variety of conference pa-
pers, articles, and project websites with application speedups of 100300x on graphics cards costing
much less than a typical desktop computer [3].
There are a variety of different approaches to GPGPU. As previously mentioned, the OpenGL
Shader Language can be used to write general purpose procedures. Another signicant effort was
the Brook GPUlanguage and toolset by Ian Buck et al. at Stanford University [18]. Brook GPUuses
a modied C language with extensions for stream processing. One benet of the Brook GPU project
is that the compiler can produce target code for a wide variety of different graphics cards from ATI,
Nvidia, and Intel. Unfortunately the Brook GPU project is no longer under active development.
Another emerging technology for GPGPU is OpenCL. Nvidias CUDA framework and hardware
based on the Tesla architecture was chosen for this thesis because it is currently the most mature
10
tool chain and best hardware for GPGPU. There is a signicant amount of both commercial and
development community support. The majority of GPGPU research in recent years has been using
CUDA technology.
3.3.1 CUDA
The compute unied device architecture (CUDA) is a framework for scientic general purpose
computing on Nvidia GPUs. CUDA provides a set of APIs, a compiler (nvcc), supporting libraries,
and hardware drivers to enable running applications on the GPU. Programs utilize the CPU on the
workstation, referred to as the host in the CUDA documentation, as well as the GPU which is
referred to as the device. Host code can be written in standard ANSI C and C++ and compiled with
a standard C compiler such as the GNU C Compiler (gcc). CUDA programs are compiled with
Nvidias C compiler and use a slightly extended C language. It has additional identier to dening
whether functions are dened for the host only, for the device only, or global and identiers for data
location. Finally there is additional syntax added for invoking kernel functions.
While instruction set architectures for general purpose CPUs seek to abstract the details of
the hardware from the software, knowledge of the underlying hardware is still important for pro-
grammers of performance critical applications (and especially important for compiler designers).
Programming for CUDA is very close to metal, and as such requires a thorough understanding
of the underlying parallel architecture, thread model, and memory model to create effective imple-
mentations on the GPU.
Hardware Architecture
At heart of the Tesla Architecture [4] are streaming multiprocessors (SMs). Each SM is comparable
to a CPU with an interface for fetching instructions, decoding instructions, functional units, and
registers. Within each SM are many functional units, called cores which all execute different
threads with the same instruction but on different data elements (SIMD). SMs also have a small
amount (16 KB) of high-speed memory for sharing data between threads on a single SM, and an
interface to the rest of the onboard DRAM. The Nvidia GTX260 for example has 24 multiprocessors
with 8 cores each, for a total of 192 simultaneous execution cores.
Thread Model
The programming model and use of threads is best explained by the CUDA Programming Guide
[19]. Since understanding the thread model is essential to effective programming with CUDA and
understanding CUDA program implementations this section provides a brief overview. The mas-
sively parallel architecture with many cores requires a robust thread organization structure. At the
top level, threads are organized into a grid which composes the entirety of the application run-
ning on the GPU at any given time (i.e. a kernel being launched by the host). The grid contains
a 2-dimensional set of blocks. A block runs on a single multiprocessor and cannot be globally
synchronized with other blocks and are not even guaranteed to be running physically at the same
11
time as other blocks in the grid (the number of blocks can, and typically does, exceed the number
of multiprocessors on the GPU). Blocks can be compared to independent tasks (processes) run-
ning on an operating system such as Windows or Linux, however all block resources are statically
dened rather than being swapped out when other blocks are running - this essentially eliminates
task switching overhead. Inside a block are threads with 3-dimensional indices and are allocated
to the different cores within a multiprocessor. Threads are organized into warps, which are sets
of threads executing the same instruction in a SIMD fashion. Whenever a branch or uncoalescent
memory access is performed, the SIMD warp of threads is split, resulting in lower performance. All
threads within a block have their own registers and can access the shared memory on the SM. A
low-overhead thread synchronization function is available for all threads within a block, similar to
a barrier in MPI applications.
Memory Model
Just like understanding the thread model is important for parallelizing algorithms and implementing
them with CUDA, understanding the memory model is essential to a high performance CUDA
applications. Again this section provides a brief overview, for more details consult the CUDA
programming guide [19]. Memory is divided into three main categories - registers, shared memory,
and global memory. The registers are high-speed memory. There are 8192 32-bit registers per
multiprocessor on the 8800 series graphics cards (16384 in newer cards, like the GTX 200 series)
and they are divided evenly amongst the threads in the block. Thus for a block with 512 threads,
each thread only receives 16 registers. If the program requires more registers, then the compiler
assigns global memory as local memory for the thread. Threads cannot access registers reserved
for other threads, regardless of whether or not the threads are physically executing at the same time
or not (remember there only 8 physical cores per multiprocessor). Shared memory is high-speed
on-chip memory and is organized into 16-banks, totaling 16 KB per multiprocessor. All threads
within a block can access the shared memory, however threads must be synchronized to avoid race
conditions and to guarantee that the expected value has actually been written to shared memory.
Additionally, multiple threads attempting to access the same bank causes a memory conict and the
accesses are serialized. The global memory space is large compared to shared memory (almost 1
GB on the GTX 260), but signicantly slower with delays up to hundreds of cycles. All threads
in the grid can access the global memory and it persists throughout the lifetime of the application.
Additionally there are two read-only memories, constant and texture. Both of these memories reside
in the global DRAM but are cached in higher speed memory on the multiprocessor, and thus can
provide higher performance than global memory if access patterns exhibit spatial locality.
12
4. Supporting Work
This chapter discusses the specic algorithms being targeted for implementation using CUDA. Prior
work with data clustering algorithms using GPUs is also discussed.
4.1 Algorithms
The work for this thesis focuses on three algorithms. The rst algorithm is c-means - a soft or fuzzy
implementation of the popular k-means algorithm. The second algorithm is expectation maximiza-
tion (EM) with Gaussian mixture models. The nal is a novel exploratory algorithm that performs
an exhaustive clustering of all bivariate subspaces in the data.
4.1.1 C Means
The c-means is a soft, or fuzzy, version of the k-means least-squares clustering algorithm. Rather
than every data point being associated with only the nearest cluster center (where nearest in this
case means the smallest Euclidean distance), data points have a membership ranging from 0 to 1 in
every cluster.
The algorithm is based on the minimization of the following function dening the error asso-
ciated with a solution [2]. It is the sum of the squared distances of each data point to each cluster
centers, weighted by the membership of the data point to each cluster, for all data points.
E
m
=
N

i=1
C

j=1
u
p
ij
x
i
c
j

2
, 1 m < (4.1)
In Equation 4.1, p is any real number that is greater than one and denes the degree of fuzziness,
u
ij
is the membership level of event x
i
in the cluster j, and c
j
is the center of a cluster. The fuzzy
clustering is done through an iterative optimization of Equation 4.1. Each iteration, the membership
u
ij
is updated using Equation 4.2 and the cluster centers c
j
are updated using Equation 4.3.
u
ij
=
1
C

k=1
_
x
i
c
j

x
i
c
k

_ 2
p1
(4.2)
c
j
=
N

i=1
u
p
ij
x
i
N

i=1
u
p
ij
(4.3)
13
The following is an outline of a fuzzy c-means algorithm.
1. Given the number of clusters, c, randomly choose c data points as cluster centers.
2. For each cluster, sum the distance to each data point weighted by its membership in that
cluster.
3. Recompute each cluster center by dividing by the associated membership value of each event.
4. Stop if there is minimal change in the cluster center, otherwise return to 2.
5. Report cluster centers.
4.1.2 Gaussian Mixture Models
Data in ow cytometry is composed of many distinct subclasses or clusters. The data for each vector
(or event) is an aggregate of a mixture of multiple distinct behaviors. Mixture distributions form
probabilistic models composed of a number of component subclasses[20]. Given an M dimensional
data set, each subclass k is characterized by [20]:

k
- the probability that a sample in the data set belongs to the subclass

k
- the spectral mean
R
k
- an M M spectral covariance matrix.
Assuming there are N ow cytometry events Y
1
, Y
2
, , Y
N
, then the likelihood that an event
Y
i
belongs to a Gaussian distribution subclass X
N
is given by [20]:
p
yn|xn
(y
n
|k, ) =
1
(2)
M/2
|R
k
|
1/2
exp
_

1
2
(y
n

k
)
t
R
1
k
(y
n

k
)
_
(4.4)
It is not known what subclass each event belongs to, therefore it is necessary to calculate the
likelihood for each subclass and apply conditional probability [20].
p
yn
(y
n
|) =
K

k=1
p
yn|xn
(y
n
|k, )
k
Neither the statistical parameters of the Gaussian Mixture Model, = (, , R), nor the mem-
bership of events to subclasses are known a priori. An algorithm must be employed to deal with this
lack of information.
14
EM
Expectation maximization is a statistical method for performing likelihood estimation with incom-
plete data [2]. The objective of the algorithm is to estimate K, the number of subclasses, and ,
the parameters for each subclass. First each event Y
N
is classied based on the likelihood criteria
above. However instead of a hard classication, it is desirable to compute a soft classication for
each event [20]:
p
xn|yn
(k|y
n
,
(i)
) =
p
yn|xn
(y
n
|k,
(i)
)
k
K

l=1
p
yn|xn
(y
n
|l,
(i)
)
l
(4.5)
Then the subclass parameters, , are re-estimated [20].

N
k
=
N

n=1
p
xn|yn
(k|y
n
,
(i)
) (4.6)

k
=

N
k
N
(4.7)

k
=
1

N
k
N

n=1
y
n
p
xn|yn
(k|y
n
,
(i)
) (4.8)

R
k
=
1

N
k
N

n=1
(y
n

k
)(y
n

k
)
t
p
xn|yn
(k|y
n
,
(i)
) (4.9)
The event classication (E-step) and re-estimation of subclass parameters (M-step) repeats until
the change in likelihoods for the events is less than some .
Hierarchical Clustering Stage
The clustering begins with a user-specied number of clusters. The algorithm then performs EM on
the clusters and determines the Gaussian model parameters for each cluster. This involves comput-
ing equation 4.5 for every event for every subclass and then equations 4.6, 4.7, 4.8, and 4.9 for each
subclass.
A MDL (Rissanen) score [21] is then calculated using equation 4.10. The Minimum Description
Length (MDL) principle extends the classical maximum likelihood principle by attempting to de-
scribe the data with the minimum number of binary digits required to represent the data with some
precision [21]. The score serves as an information criterion and helps the unsupervised algorithm
determine the optimal solution (i.e. how many clusters).
MDL(K, ) =
N

n=1
log
_
K

i=1
p
yn|xn
(y
n
|k, )
k
_
+
1
2
Llog(NM) . (4.10)
15
The algorithm then attempts to combine the two most similar clusters. In this case, similarity
is based upon the Gaussian model parameters. A distance function is computed between all possi-
ble combinations of clusters. The two clusters with the minimum distance (i.e. most similar) are
combined into a new cluster. The distance function is dened by [20] as:
d(l, m) =
N
l
2
log
_
|R
(l,m)
|
|

R
l
|
_
+
N
m
2
log
_
|R
(l,m)
|
|

R
m
|
_
This process repeats until the data has all been combined into a single cluster. Finally, the
conguration with the minimum Rissenan score is output as the optimal solution. The results are
two-fold. First, there are the statistical parameters, = (, , R), for Gaussian cluster. Secondly,
all events have membership values for every cluster. Figure 4.1 summarizes the basic steps of the
clustering procedure.
Perform soft
classification of each
pixel using cluster
parameters
re-estimate cluster
parameters
(i.e. mean vectors and
covariance matrices)
Reduce number of
clusters by combining
two nearest clusters
Measure goodness of
fit using Rissenan
criteria
Initialize number
of clusters
Initialize cluster
parameters
If best fit so far, then
save result.
Only 1 cluster?
exit
Figure 4.1: High-Level Clustering Procedure [20]
4.1.3 Exhaustive Bivariate
While FLAME [14] and Lo et. al [15] show promise for multivariate gating of ow cytometry data,
multivariate techniques often fall victim of Bellmans so called curse of dimensionality [22]. The
clustering performance of many algorithms degrade as the dimensionality of the data is increased
[23]. Expectation maximization for example, is well-known to fall victim to local minima, and
the likelihood of getting stuck in a local minima increases with higher dimensionality. The relative
scaling of all the dimensions becomes increasingly important as well, since the effect of an important
dimension can be dwarfed by other dimensions.
The following in a novel clustering approach proposed by James S. Cavenaugh and will be ex-
plored as part of this thesis. The idea is a subspace clustering algorithm that performs an exhaustive
bivariate analysis. Rather than clustering the d-dimensional data all at once with a multivariate clus-
tering technique, it analyzes every bivariate subspace of the data. In other words, it analyzes every
combination of two dimensions of the data. This results in
16
_
d
2
_
=
(d)(d 1)
2
(4.11)
bivariate combinations. Every subspace is then clustered. The choice of the bivariate clustering
algorithm is exible, provided it can produce hard cluster associations for every event (one can
also classify soft clustering results into hard clustering results). For simplicity, the other clustering
algorithms being implemented for this thesis will be used for bivariate clustering - c-means and EM
with Gaussians. The individual clusterings produce hard-clustering results for every event in every
subspace, s
i
, as seen in Table 4.1. The numbers in the subspace columns indicate which cluster the
events correspond to in that subspace clustering. The number of clusters found for each subspace
do not need to be identical.
Table 4.1: Intermediate exhaustive bivariate results
Event s
1
s
2
. . . s
(
d
2
)
0 3 4 . . . 1
1 1 3 . . . 2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n 5 2 . . . 4
After performing all of the individual clusterings, it is necessary to combine the results into
nal clusters. As seen in Table 4.1, every event has a vector of cluster memberships. Final clusters
are formed by grouping events that have identical vectors, that is, events who exist in all of the
same clusters in each subspace. If two events exhibit similar responses in all combinations of
the uorescent dimensions, then they should be grouped into the same clusters in each bivariate
clustering and are likely very similar in a biological sense. The potential also exists for more domain
expert oriented grouping. For example, a biologist may be interested in cells that have a similar
response in only a few of the dimensions, and therefore the number of subspaces used for grouping
similar cells could be greatly reduced.
This algorithm has a number of perceived benets. First, it allows the individual clustering algo-
rithms to partially escape the curse of dimensionality, since they are only clustering 2-dimensional
data. Algorithms could be chosen that are known to work well at clustering bivariate data, and with
lower computational complexity. Many clustering algorithms, like EM, have computational com-
plexity that grows exponentially as dimension increases, whereas the work load for this algorithm
grows at a rate of d choose 2. In essence, it is a form of dynamic programming [22], breaking a
large problem into many smaller ones. Second, the individual subspace clusterings are embarrass-
ingly parallel and can be distributed to different computational nodes in a computer cluster or grid.
The only overheads are the initial distribution of the subspace data and the aggregation of results
from all nodes. Unfortunately, the amount of result data generated for this technique is much larger
than for the other multivariate techniques discussed (n
_
d
2
_
4 bytes instead of n k 4 bytes
17
where k is the number of clusters). However, it is possible to form a hierarchy of the nodes in the
computing cluster to perform a parallel reduction of the results rather than burdening a master node
with the job of aggregating all of the data by itself. Finally, since it uses bivariate clustering, it is
similar to current practice techniques in ow cytometry (although providing incomplete analyses of
the entire data set, are known to work well), but make use of all combinations of dimensions rather
than only a few.
4.2 Data Clustering on GPGPU
The abundant parallelism makes data clustering algorithms a natural choice for implementation
using GPGPU. In 2004, Hall et al. implemented k-means using Cg and achieved a speedup of 3x
versus a modern cpu at the time the article was written. In 2006, Takizawa et al. implemented
k-means using fragment shaders and Nvidia 6800 GPUs [24]. The implementation in [24] only
showed a speedup of 4x relative to a cluster of CPUs without GPUs, however their implementation
divided the task among a cluster of PCs each equipped with GPUs using MPI. These efforts showed
it was possible to implement a data clustering algorithm using a graphics pipeline and achieve
speedup.
The introduction of more advanced GPU architectures and coding frameworks for general pur-
pose computing on GPUs allowed for much more signicant speedup of data clustering algorithms
on GPUs. Che et al. implemented k-means with an impressive speedup of 72x using CUDA and a
Nvidia 8800 GTX GPU in 2008 [25], and also compared it to a multi-threaded version running on
a Quad core processor, and still maintaining a speedup of 30x.
While the performance results of recent K-means implementations on GPUs and other parallel
architectures are impressive, k-means is an embarrassingly parallel algorithm and its spherical bias
is not very good at analyzing ow cytometry data, where clusters often have very diverse non-
spherical shapes. Outliers can also have a signicant impact on the resulting cluster centers with
k. Despite these short-comings, k-means is still a de facto standard clustering algorithm used in a
variety of applications that has been implemented on many platforms and parallel architectures, and
thus is a good basis for comparison.
Using a fuzzy version of k-means, where data points have a membership value in all of the
clusters, rather than belonging to only one cluster, can lessen the effect of outliers. It is also produces
better results when the number of specied clusters does not match the number of natural clusters
in the data. A hard clustering may attempt to create multiple adjacent, but not overlapping, clusters
inside one natural cluster. A soft clustering is more likely to simply have multiple overlapping
clusters with approximately the same center - which more accurately reects the underlying data.
Therefore the thesis will implement and examine a c-means (the literature uses C for soft clustering,
and K for hard clustering) algorithm.
Shalom et al. implemented c-means (a fuzzy version of k-means) on a Nvidia 8800 GTX using
the OpenGL Shader Language [26]. Results were impressive, with over 70x speedup on high di-
mensional data. Their implementation also scales well to a large number of clusters and dimensions.
18
Our preliminary single-GPU fuzzy C-means implementation has speedup over 100x on ow cytom-
etry data. There are some additional areas for performance improvement and the implementation
will be extended to use multiple GPUs.
In addition, this thesis will implement an Expectation Maximization (EM) algorithm with Gaus-
sian Mixture Models (GMMs). A recent publication from June 2009 implemented EM with GMMs
[27] using CUDA. Using hardware similar to the aforementioned CUDA implementations of C-
means, they achieved a speedup of 120x for particular data sizes. One limitation of their imple-
mentation is that it uses only diagonal covariance matrices, rather than the full covariance matrices
for the Gaussian Mixture Models. This simplies the EM equations (particularly, nding the de-
terminant and inverse of the covariance matrix becomes trivial) and the data structure required for
the algorithm, however it does not allow for any dimensions to be statistically dependent upon one
another - which may occur in real data. It also does not make use of multiple GPUs nor include any
information criterion for unsupervised assessment of clustering results. Other CUDA applications
have been developed using GMMs, such as anomaly detection in hyperspectral image processing
[28] and achieved overall speedup factors of 20x, and over 100x for specic portions of the algo-
rithm.
19
5. Project Deliverables
Essential Deliverables
The thesis document
Conference or Journal paper
CUDA Fuzzy C-means clustering algorithm implementation using multiple GPUs (single
GPU version implemented by previous student, can be improved)
Sequential Fuzzy C-means clustering algorithm implementation using a single CPU (imple-
mented by previous student, requires some modication for performance analysis)
CUDA Gaussian Mixture Model clustering algorithm implementation using multiple GPUs
Sequential Gaussian Mixture Model clustering algorithm implementation using single CPU
(already implemented by Bouman [20], but requires some modications)
Workow software for extracting data from FCS les, applying compensation information
to the data (if available), and running various clustering algorithms on said data. (started by
previous student, needs additional work)
Testing suite for proling the performance of the algorithms and generating results.
Exhaustive Bivariate technique making use of CUDA-enhanced clustering algorithms
Clustering results using synthetic data to assess basic functionality and accuracy of the algo-
rithm implementations.
Clustering results using real owcytometry data for both algorithms with a variety of different
parameters.
Wishlist / Reach Deliverables
Multi-core CPU implementations of C-means and Gaussian Mixture Models
Support of additional models in EM implementation, such as a T mixture or skewed T mixture
(as used in the Pyne et al [14] and Lo et al. [15] papers)
Integration of results with a database for conducting biological inference and querying data
Visualization
20
5.1 Performance Analysis
Performance Metrics
Accuracy (of clustering results). Compare the algorithms and how well they cluster both
known statistical data and real ow cytometry data.
Raw performance (FLOPS). Utilization of the peak performance of the GPU architecture.
GPU occupancy and kernel proling.
Speedup of the computations. Examine how speedup is affected by varying event number,
dimensionality, and the number of clusters.
Speedup of the whole program/workow (including I/O and extra overhead associated with
the GPU such as the host to device memory copying and synchronization). This includes a
detailed proling of the percentage of the total computation for each portion of the applica-
tion.
Scalability. How does the number of GPU cores, the number of GPUs, and the number of
CPU+GPU nodes affect speedup?
21
6. Thesis Outline
1. Abstract
2. Introduction
3. Background
(a) Data Clustering
i. K-means
ii. C-means
iii. Expectation Maximization
(b) Flow Cytometry
(c) GPGPU
4. Supporting Work
(a) Data Clustering for Flow Cytometry
(b) Data Clustering with GPGPU
5. Parallel Algorithms
(a) C-means
(b) EM
(c) Exhaustive Bivariate
6. Flow Cytometry Workow
(a) FCS data extraction
(b) Data compensation, ltering, transformations
(c) Clustering
(d) Aggregating and storing results
(e) Visualization
(f) Expert Analysis
7. Results
(a) Testing Environment
(b) Performance
22
i. Speedup
ii. Scalability
A. # Data Points
B. # Dimensions
C. # Clusters
iii. Resource Utilization
A. System / OS
B. GPUs
(c) Clustering - Synthetic Data
(d) Clustering - Real Flow Cytometry Data
8. Conclusions
23
7. Schedule
October
Write multi-GPU versions of c-means and GMM
Verify functionality with synthetic data
Scripting for thorough performance study
Enhance GUI / Workow software
November
CUDA kernel proling, tweak algorithms for additional performance gains if possible
Collect data
Analyze data
Clustering results on real FCS data
December
Finish writing thesis document
Prepare conference/journal paper for performance study
Prepare for defense
Defend
24
8. Required Resources
Multiple CUDA cards with 1GB or more of onboard memory
Multi-core desktop machines
Flow cytometry data
Cluster with CUDA cards (such as Lincoln on the Teragrid) for MPI version of Exhaustive
Bivariate, however a simulation should be possible on single node if neccesary
Access to ImmPort and FlowJo - competing FC data analysis portals / tools (should be pro-
vided by URMC)
25
Bibliography
[1] A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: a review, ACM Comput. Surv.,
vol. 31, no. 3, pp. 264323, 1999.
[2] G. Gan, C. Ma, and J. Wu, Data Clustering Theory, Algorithms, and Applications, M. T. Wells,
Ed. Society for Industrial and Applied Mathematics, 2007.
[3] NVIDIA. Cuda zone. [Online]. Available: www.nvidia.com/cuda
[4] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, Nvidia tesla: A unied graphics and
computing architecture, Micro, IEEE, vol. 28, no. 2, pp. 3955, March-April 2008.
[5] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algoritms. Plenum Press,
New York, 1981.
[6] A. P. Reynolds, G. Richards, and V. J. Rayward-Smith, The application of k-medoids and pam
to the clustering of rules, in Intelligent Data Engineering and Automated Learning. Springer
Berlin, 2004, pp. 173178.
[7] H. Shapiro, J. Wiley, and W. InterScience, Practical ow cytometry. Wiley-Liss New York,
2003.
[8] Invitrogen, Fluorescence tutorials: Intro to ow cytometry, 2009. [Online]. Available:
http://www.invitrogen.com/site/us/en/home/support/Tutorials.html
[9] T. S. Inc., Flowjo, 2009. [Online]. Available: http://www.owjo.com/
[10] M. M. Hammer, N. Kotecha, J. M. Irish, G. P. Nolan, and P. O. Krutzik, Webow: A software
package for high-throughput analysis of ow cytometry data, ASSAY and Drug Development
Technologies, vol. 7, pp. 4455, 2009.
[11] M. P. Conrad, A rapid, non-parametric clustering scheme for ow cytometric data, Pattern
Recogn., vol. 20, no. 2, pp. 229235, 1987.
[12] S. Demers, J. Kim, P. Legendre, and L. Legendre, Analyzing multivariate ow cytometric
data in aquatic sciences, Cytometry, vol. 13, no. 3, 1992.
[13] R. Murphy, Automated identication of subpopulations in ow cytometric list mode data
using cluster analysis. Cytometry, vol. 6, no. 4, pp. 302309, 1985.
26
[14] S. Pyne, X. Hu, K. Wang, E. Rossin, T. Lin, L. Maier, C. Baecher-Allan, G. McLachlan,
P. Tamayo, D. Haer et al., Automated high-dimensional ow cytometric data analysis, Pro-
ceedings of the National Academy of Sciences, vol. 106, no. 21, pp. 85198524, 2009.
[15] K. Lo, R. Brinkman, and R. Gottardo, Automated gating of ow cytometry data via robust
model-based clustering, Cytometry. Part A: the journal of the International Society for Ana-
lytical Cytology, vol. 73, no. 4, p. 321, 2008.
[16] ImmPort, Immunology database and analysis portal. [Online]. Available: https:
//www.immport.org
[17] F. Hahne, N. Le Meur, R. Brinkman, B. Ellis, P. Haaland, D. Sarkar, J. Spidlen, E. Strain, and
R. Gentleman, owcore: a bioconductor package for high throughput ow cytometry, BMC
bioinformatics, vol. 10, no. 1, p. 106, 2009.
[18] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, Brook
for gpus: stream computing on graphics hardware, in SIGGRAPH 04: ACM SIGGRAPH
2004 Papers. New York, NY, USA: ACM, 2004, pp. 777786.
[19] NVIDIA, NVIDIA CUDA Programming Guide, 2nd ed. [Online]. Available: http:
//developer.nvidia.com/object/cuda 2 3 downloads.html
[20] C. A. Bouman, Cluster: An unsupervised algorithm for modeling gaussian mixtures, April
1997, available from http://www.ece.purdue.edu/bouman.
[21] J. Rissanen, A universal prior for integers and estimation by minimum description length,
The Annals of Statistics, vol. 11, no. 2, pp. 416431, 1983.
[22] R. Bellman and S. Dreyfus, Applied dynamic programming. Princeton University Press,
1962.
[23] A. Hinneburg and D. A. Keim, Optimal grid-clustering: Towards breaking the curse of di-
mensionality in high-dimensional clustering, in VLDB 99: Proceedings of the 25th Interna-
tional Conference on Very Large Data Bases. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 1999, pp. 506517.
[24] H. Takizawa and H. Kobayashi, Hierarchical parallel processing of large scale data clustering
on a pc cluster with gpu co-processing, J. Supercomput., vol. 36, no. 3, pp. 219234, 2006.
[25] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, A performance
study of general-purpose applications on graphics processors using cuda, Journal
of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1370 1380, 2008.
[Online]. Available: http://www.sciencedirect.com/science/article/B6WKJ-4SVV8GS-2/2/
f7a1dccceb63cbbfd25774c6628d8412
27
[26] S. A. A. Shalom, M. Dash, and M. Tue, Graphics hardware based efcient and scalable fuzzy
c-means clustering, in AusDM, ser. CRPIT, J. F. Roddick, J. Li, P. Christen, and P. J. Kennedy,
Eds., vol. 87. Australian Computer Society, 2008, pp. 179186.
[27] N. Kumar, S. Satoor, and I. Buck, Fast parallel expectation maximization for gaussian mixture
models on gpus using cuda, in 11th IEEE International Conference on High Performance
Computing and Communications, 2009. HPCC09, 2009, pp. 103109.
[28] Y. Tarabalka, T. Haavardsholm, I. K asen, and T. Skauli, Real-time anomaly detection in
hyperspectral images using multivariate normal mixture models and gpu processing, Journal
of Real-Time Image Processing, vol. 4, no. 3, pp. 287300, 2009.
28

Das könnte Ihnen auch gefallen