You are on page 1of 5

Unifying Clinical Trials and Publications via a

Dependent Clustering Approach


Lauren Waldrop

Wenqing Sun

The University of Texas at El Paso


500 West University Avenue
El Paso, Texas 79902

The University of Texas at El Paso


500 West University Avenue
El Paso, Texas 79902

The University of Texas at El Paso


500 West University Avenue
El Paso, Texas 79902

Jessica Rebollosa

lewaldrop@miners.utep.edu
Marcus Gutierrez

wsun2@miners.utep.edu

rebollosa.jr@gmail.com

The University of Texas at El Paso


500 West University Avenue
El Paso, Texas 79902

mgutierrez22@miners.utep.edu

ABSTRACT
Clinicaltrials.gov [1] houses information regarding clinical trials
that are currently underway. In addition to information about
background, purpose, and design of a specific clinical trial, the
webpages also provide links to affiliated papers that can be found
in PubMed [2] (a warehouse for citations in biomedical research).
These links are explicit, but implicit links between clinical trials
and publications more than likely exist. For example, a researcher
may like to know if a given clinical trial is related to more
publications than just the ones listed on the clinical trial webpage.
This relation could be the result of similar key terms imbedded
within the clinical trial webpages and PubMed abstracts. By
using a dependent clustering algorithm [3], and a novel approach
using Nave Bayes for heterogeneous datasets, we aim to give
scientists in the biological community insight not only into related
terms, but also clinical trials and/or other publications that may
not have explicit links.

Categories and Subject Descriptors


General Terms
Algorithms, Design, Experimentation

Keywords
Clustering

2. DATA DESCRIPTION
The data for this project consists of terms found in webpages from
Clinicaltrials.gov and abstracts in PubMed. The data is organized
into two weighted term document matrices, one for data gathered
from Clinicaltrials.gov, and the other from PubMed abstracts.
Each matrix row is associated with a single document, while each
column is associated with a term found within the document.

2.1 Current Data Preprocessing


For this phase of the project, the frequencies of each term are
weighted using term frequency-inverse document frequency, more
commonly known as tf-idf [4]. Each tf-idf value is increased
proportionally with the number of appearances within a single
document, but is balanced out by the number of times the term
appears within all documents examined. Many TF-IDF thresholds
were explored in the generation of our datasest. What we found
was that lower thresholds (i.e. 0.01) tended to produce better
results for both k-means and the dependent clustering algorithm.
In addition to the two weighted document term matrices, we
utilize a relations file that illustrates the explicit relationships
between clinical trials and PubMed publications. All three of these
documents are used in the dependent clustering algorithm
described below.

1. INTRODUCTION

2.2 Future Data Preprocessing

Clinical trials introduce novel techniques that prevent, diagnose or


treat disease. Often publications are generated as a result of a
clinical trial and reveal key information regarding the approach,
methodology and results. Clinicaltrials.gov, a database which
maintains public information about current ongoing clinical
research studies, lists publications that are affiliated with each
clinical trial. These same publications can be found in PubMed.
Relations between a given clinical trial and its associated
publications already exist; however, there may be undiscovered
links between a given clinical trial and other publications, or even
other clinical trials.

After reviewing the results section of this paper, it becomes quite


obvious that going forward, we will more than likely need
additional pre-processing constraints. More specifically, we need
to clean-up the data for both PubMed abstracts and Clinical Trials
excerpts. To do this we are proposing a two-step algorithm. The
first step involves concatenating all of the words of each text file
into two files for each dataset one that contains the names of the
files, which will be used to create the tf-idf weighted term
matrices, and the second contains just the words that will be used
to create the list of terms.

With this study, we are hoping to utilize pre-existing explicit links


to provide additional insight with regard to implicit links between
clinical trials and biological publications to individuals within the
scientific community. In addition to implicit links, we are hoping
to reveal similar terminology that exists between linked
publications and clinical trials.

After the files are generated, punctuation, numbers, and any


unwanted symbols such as #, %, : will be removed from the
list, and all letters will be converted to lower case. Next, repeated
words will be removed from the list. Our first approach to
lumping together similar terms and getting rid of unwanted terms
involved both the use of ignore lists and WordNet. However, our
datasets still maintain quite a bit of noise. Since we are primarily
interested in biological terms, we are proposing that moving
forward we will use the LingPipe tool kit.

3. METHODOLOGY
In this project, we are taking two different but complementary
approaches. The first of which involves a dependent clustering
algorithm, developed by Dr. M. Shahriar Hossain in an effort to
find implicit relationships and similar terms between the clinical
trials dataset, and the PubMed abstracts dataset. The second
approach involves the development of a novel technique for
classifying documents from heterogeneous datasets via Nave
Bayes. Following is a short description of both approaches.
The first step of the dependent clustering algorithm is to
separately assign vectors in each of the two datasets to clusters via
k-means. The second step involves preparing contingency tables
based on the clustering results and the pre-existing relationships
between the two datasets. Finally, each of the contingency tables
are evaluated by minimizing a cost function such that
relationships in one cluster of the clinical trials dataset are
exclusive to only a single cluster in the PubMed dataset. These
individual steps are repeated until convergence.
Finally, Nave Bayes Classification will be applied to the
heterogeneous data set. This algorithm uses training data to
predict classes of new data entries. In this context, the Nave
Bayes Classification will either reinforce existing links or suggest
new links that may better cluster the data and provide insight in to
the architecture of the data set. The result may also further
advance the data preprocessing phase by adding previously
implied links further connecting the two data sets.

4. RESULTS
4.1 Term Elimination via Term Variance
In the initial phases of our work, we wanted to see how our data
would cluster using the implementation of dependent clustering
that was authored by Dr. M. Shahriar Hossain. Using cluster sizes
of 2, 3, 4, and 5, it was found that a majority of the documents
(over 98%) were clustering into one group, both before and after
dependent clustering. Its important to note that during this phase,
the datasets used were generated with a TF-IDF cut-off value of
0.3. It was only much later on in the data exploration process that
it proved to be beneficial to decrease this threshold for both
datasets (more on this later).
K-means has been shown to work very hard to place roughly the
same number of instances in each cluster, and since our
preliminary results varied significantly from this school of
thought, the number of terms in each dataset was reduced. To do
this, terms for both PubMed and Clinical Trials datasets were
eliminated by leveraging the variance of each term. Variance is a
good way to measure the differential power of a term. Low
variance of a term indicates that either the term is not present in
any of the documents, or the term is present in most or all of the
documents. Thus, terms with low variance do not maintain any
discriminatory power.

Figure 1. Term Variance Plots for both the PubMed dataset (left)
and the Clinical Trials dataset (right).
Using these plots, a set of minimum and a set of maximum
variance thresholds were established for each dataset. Once the
variance thresholds were determined, k-means was run with
different combinations of minimum and maximum thresholds.
The best combination of thresholds was determined by
investigating the percentage of documents in the largest cluster.
The combination of thresholds that reduced this percentage the
most was considered to be the best. Figure 2 shows the results of
these variance threshold experiments. Figure 2a and 2b illustrate
the results of the experiments using a TF-IDF cutoff of 0.3, while
Figure 2c and 2d illustrate the results of the experiments using a
TF-IDF cutoff of 0.01 for both the PubMed and Clinical Trials
datasets. Combinations of minimum and maximum thresholds are
represented by a combination of an index along the x-axis
(maximum threshold) and a color of a line (minimum threshold).
From the graphs in Figure 2, it was determined that for the
PubMed dataset, the ideal maximum threshold is a value of 1.5 x
10-3 and a minimum threshold of 1.2 x 10-4. These thresholds
effectively reduce the number of terms in the PubMed dataset
from 6,640 to 2,057. For the Clinical Trials dataset, the ideal
maximum threshold is 1.2 x 10-3 and a minimum threshold of 5.0
x 10-5. These thresholds reduce the number of terms in the
Clinical Trials dataset from 7,599 to 4,393. Figure 3 demonstrates
where these thresholds lie in terms of the variance threshold
graphs displayed in Figure 1. Its important to point out that in
Figure 2 the percentage of documents in the largest cluster overall
is lower for the datasets with TF-IDF thresholds of 0.01, as
compared to datasets with TF-IDF thresholds of 0.3. Moving
forward, datasets with TF-IDF thresholds of 0.01 will be utilized
for further testing.

To rule out terms from both datasets with low variance and
exceptionally high variance, document-term matrices were
generated, the variance of each term was calculated, the terms and
their associated variance values were ranked in decreasing order,
and the results were plotted. Figure 1 illustrates the resulting plots
for both cases.

Figure 2. Experiment to find the optimal minimum and maximum


thresholds for term variance.

Figure 5. Clinical Trials Frequency of Documents in each Cluster


before dependent clustering was applied (k=5).

Figure 3. Term Variance plots with minimum and maximum


thresholds illustrated with red asterisks.

4.2 Dependent Clustering

The second issue we noticed had to do with the results after


clustering (Fig. 6-7). All of the documents in both datasets were
classified into one large cluster. There are two things we need to
investigate in this case: first, the removal of low variance terms
may have affected the way the relations file is processed, since we
were changing index values. Second, the large clusters that exist
before dependent clustering is applied may play a role in the after
clustering result. Once we trouble shoot the discrepancy between
our k-means results and the before clustering results, we may be
able to obtain a more reasonable distribution of documents in each
cluster. From there, we can run dependent clustering again to see
how that affects the after dependent clustering results.

Determining the optimal maximum and minimum thresholds


allowed us to reduce the number of terms in both the PubMed and
the Clinical Trial datasets. Dependent clustering was run again to
determine whether or not feature reduction by variance helped to
more equally distribute documents amongst clusters.
Histograms of the clustering results were generated before and
after applying dependent clustering on each of the datasets (Fig.
4-7). From the results of dependent clustering, we noticed several
discrepancies. First, the clustering results before dependent
clustering was applied are different from the k-means results we
obtained when running our experiments to find the minimum and
maximum thresholds. For example, with the optimal min/max
thresholds for the PubMed dataset, the largest cluster maintained
just over 71% of the documents, while the largest cluster in Figure
4 maintains over 93% of documents. This same discrepancy
exists for the Clinical Trials clustering (Fig. 5); however the
difference is not as extreme. We anticipated there to be some
difference in percentages since k-means selects different starting
centroids each time; however, we do not think that difference this
large should exist. Over the course of the next several weeks, we
will be investigating why this discrepancy exists and attempting
several ways to troubleshoot.

Figure 6. PubMed frequency of documents in each cluster after


dependent clustering was applied (k=5).

Figure 7. Clinical Trials frequency of documents in each cluster


after dependent clustering was applied (k=5).

4.3 Additional Data Findings

Figure 4. PubMed Frequency of Documents in each Cluster


before dependent clustering was applied (k=5).

To have a better understanding as to why the data is not working,


we aimed to obtain a better visualization of the data. Since our
data is high dimensional data, meaning that it has thousands of
variables, we cannot simply plot out the original data. In this
project, we used the principal component analysis (PCA)
algorithm to reduce the dimensionality of the original data [6].
With the help of PCA, we obtain the most informative data from
the original dataset. After running PCA with two principal
components, we obtain PC1 and PC2, which contain the first and
second most information from the original data. As such the
dimension of the data was reduced to two. Figure8 illustrates the
PCA plots for both PubMed and Clinical Trials data. From the
figures, we can see no obvious separation of clusters, and most of
the data is clustered into one large group. We also observed
several outliers which might explain why dependent clustering
gives us one big cluster and several very small clusters.

Figure 8. Results of principal component analysis (PCA) applied


to both the PubMed and Clinical Trials datasets.
In addition to using PCA, we plotted our data using a heat map to
get a better visualization of what our data looks like. In each of
the heat maps (Fig. 9), each row is representative of a document,
while each column is representative of a term. In both cases, the
heat map plot gives us a line, which indicates that most documents
have a unique term that is not present amongst any other
documents. This is another indication as to why our data is not
working well, and suggests that we may need to increase the size
of our corpus to obtain better results.

Figure 9. Heat map visualization for both the PubMed and


Clinical Trials datasets.

5. NAVE BAYES CLASSIFICATION


5.1 Data Input and Nave Bayes
To compare results of the Dependent Clustering Algorithm, Nave
Bayes Classification will be implemented to provide insight about
the architecture of the heterogeneous data set. However, a new
algorithm based off of Nave Bayes Classification had to be
developed to accomplish this. Traditional Nave Bayes
Classification involves taking training data to help classify test
data. However, this algorithm was developed with one table in
mind. The traditional algorithm cannot support heterogeneous
data. Because of this, a new algorithm had to be designed that
would account for heterogeneous data and would predict links
between the data based off of the contents of this information.
The input for this proposed algorithm would be two tables that list
documents as rows and term frequencies as the columns and a
third table that states explicit links between documents between
the two tables. The term frequency tables shall represent every
valuable term in the corpuses, where term value is decided at the
preprocessing phase. Combining the three input tables in to one
extended table will allow for Nave Bayes Classification to predict
the data.

5.2 Document-based Classification


The initial attempt to combine the three input tables involved
creating a data entry for each relational document link expressed
in the third table. If document C from corpus A links to document
5 in corpus B, then a single entry would be created that listed the

all the term frequencies from document C from corpus A and lists
the classification as document 5. This idea would allow for the
prediction of new links and grant new information from
documents in corpus A to link to documents in corpus B. Note:
This algorithm is unidirectional and if prediction links wished to
be applied to corpus B, the two tables would need to be swapped.
Two large criticisms plagued this approach. The first criticism
being that the terms in corpus B were unused; this is a large
amount of information that simply goes to waste that could help
more accurately predict links. The second criticism is that links
can only be predicted if documents in corpus B are currently
linked. Perhaps document C from corpus A would strongly link
with document 17 in corpus B based on content, which currently
has no links. In this case, the link could not be predicted because
document 17 was never present in the training phase of the
algorithm, disqualifying it from the testing phase. A new
algorithm that uses some of the same principles but combats the
two criticisms had to be considered.

5.3 Term-based Classification


The second attempt to combine the three input tables proved more
fruitful. The spirit of the first algorithm remained present in the
new algorithm, minus the flaws. The basic principle remained the
same: combine three input tables in to one extended table, while
not compromising information about corpus B. To do this, the
new extended table contains a data entry for term count in the
documents from corpus B for every link where each term in the
document from corpus B stands as the class. For instance, if
document C from corpus A links to document 5 of corpus B and
document 5 contains two counts of the word cancer and one
count of the word animal and thats it, then the algorithm will
create three data entries all containing the same term frequencies
from document C, however two of the entries will be classed as
cancer and one of the entries will be classed as animal.

( | ) (( ) ( | ))
=1

=1

The extended tables generated from this algorithm tower over the
extended table from the first algorithm in terms of size. However,
classing by corpus B terms does not allow one to simply apply
Nave Bayes Classification, because the goal is to link documents
from corpus A to documents in corpus B, not to link documents
from corpus A to terms. To manage this, probabilities of terms in
documents from corpus B, given terms from documents in corpus
A established via the explicit links are calculated. The probability
of document 17, given document C, can be expressed as the sum
of the probabilities of the product of the probabilities of each term
in document 17, given each term in document C. The result being:

5.4 Term-based Classification Analysis


docs

doc 1

doc 2

doc 3

doc 4

doc 5

doc 6

0.0016

0.0016

0.0020

0.0027

0.0018

0.0020

0.0113

0.0046

0.0075

0.0136

0.0132

0.0161

0.0100

0.0083

0.0060

0.0039

0.0111

0.0035

0.0001

0.0001

0.0004

0.0008

0.0001

0.0005

Z
0.0005
0.0005
0.0010
0.0016 0.0005
0.0012
Figure 10. Document conditional probabilities. Documents A-Z
are from corpus A and Documents 1-6 are from corpus B. Entry
(doc A, doc 1) represents P(doc 1| doc A).

A toy data set was created to test this new algorithm. The toy data
set included terms that appeared in every document, documents
without explicit links, terms that only appeared in one document,
and more. The variety was added in hopes that it would lead to
interesting results. Luckily, the results did end up interesting:
implicit links were discovered for documents that had no explicit
links and even more interesting was the fact that the algorithm
suggested a correlation between document C and document 5
even though no explicit link was there before. In other words,
document C was most correlated to document 5, even though
there was no explicit link was present. The results can be seen on
Figure 10, with the red squares indicating explicit links in the
training data and higher numbers representing higher correlation.

Displaying key terms for each cluster may reveal significant links
that have not previously been explored. For example, if key
words related to a biological pathway and a drug target show up in
the same cluster but their relationship has never been investigated,
then our users may have found something worth exploring. By
depicting a prototype associated with each cluster, we are hoping
to aid our users in getting a quick and stress-free way to find a
cluster that is the most related to their current work. In this way, a
researcher could quickly hone in on their specific cluster of
interest and begin investigating the individual components of that
cluster. Finally, once a researcher has found their cluster of
interest, they would be able to explore the documents within that
cluster.

5.5 Future Additions

7. ACKNOWLEDGMENTS

Due to the results of Term-Based Nave Bayes Classification, it


seems that the algorithm works well enough for implementation
as to be applied to the Pubmed and Clinical Trials heterogeneous
data set. However, the algorithm may need minor changes or
further preprocessing to appropriately apply to data set in
question. Currently, it is unclear on if weights/coefficients need to
be applied to documents in corpus A or corpus B with the high
term frequencies for terms. For instance, if the word malignant
appears two times in a document, it may need to be weighted
more or less than a factor of two.

This work is supported in part by The University of Texas at El


Paso.

Furthermore, if this algorithm is to be used for preprocessing to


establish more links before applying the dependent clustering
algorithm, then the accuracy of the algorithm needs to be ensured.
To do this, the algorithm will be run multiple times with varying
coefficients, normalization, etc. and performance will be
measured after adding suggested links and removing
uninformative or potentially harmful links. We hope that this
Term-Based Nave Bayes Classification algorithm will be a
fruitful endeavor and further provide insight on the data set as
whole.

[3] Hossain, M.S., Tadepalli, S., Watson, L.T., Davidson, I.,


Helm, R.F.& Ramakrishnan, N. (2010, July). Unifying
dependent clustering and disparate clustering for nonhomogeneous data. In proceedings of the 16th ACM
SIGKDD international conference on knowledge discovery
and data mining (pp. 593-602). ACM.

8. REFERENCES
[1] ClinicalTrials.gov, A service of the U.S. National Institutes
of Health. Retrieved February 9, 2015 from:
https://clinicaltrials.gov/
[2] PubMed.gov, US National Library of Medicine, National
Institutes of Health. Retrieved Feruary 9, 2015 from:
http://www.ncbi.nlm.nih.gov/pubmed

[4] Retrieved February 10, 2015 from Wikipedia, the free


encyclopedia: http://en.wikipedia.org/wiki/Tf-idf

6. DATA ANALYSIS

[5] Cluster Analysis: Basic Concepts and Algorithms. Retrieved


February 11, 2015 from: http://wwwusers.cs.umn.edu/~kumar/dmbook/ch8.pdf

To better serve our users in the scientific community, we are


proposing to reveal some important information inherent to each
cluster such as key terms, prototypes and lists of documents.

[6] Wold, S., Esbensen, K., & Geladi, P. (1987). Principal


component analysis. Chemometrics and intelligent laboratory
systems, 2(1), 37-52.