Beruflich Dokumente
Kultur Dokumente
A. Specific Aims
Many initiatives encourage research data sharing in hopes of increasing research efficiency and quality, but
the effectiveness of these early initiatives is not well understood. Sharing and reusing scientific datasets have
many potential benefits: in addition providing detail for original analyses, raw data can be used to explore
related or new hypotheses, particularly when combined with other publicly available data sets. Real data is
indispensable when investigating and developing study methods, analysis techniques, and software
implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives,
helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of
funding and patient population resources by avoiding duplicate data collection.
Eager to encourage the realization of such benefits, funders, publishers, societies, and individual research
groups have developed tools, resources, and policies to encourage investigators to make their data publicly
available. Despite these investments of time and money, we do not yet understand the rewards, prevalence or
patterns of data sharing and reuse, the effectiveness of initiatives, or the costs, benefits, and impact of
repurposing biomedical research data.
Studies examining current data sharing behavior would be useful in three ways. First, an estimate of the
prevalence with which data is shared, either voluntarily or under mandate, would provide a valuable baseline
for assessing future adoption and continued intervention. Second, analyses of current behavior will likely
identify subfields (perhaps research areas with a particular disease or organism focus, or those in well funded
research groups) with relatively high prevalence of data sharing; digging into these can illuminate valuable best
practices. Third, the same analyses will likely reveal subareas in which researchers rarely share their research
datasets. Future research could focus on these challenging areas, to understand their unique obstacles for
data sharing and refine future initiatives accordingly. You can not manage what you do not measure:
understanding the rewards, prevalence, and patterns of data sharing and withholding will facilitate effective
refinement of data sharing initiatives to better address real-world needs.
The long-term goal of this research is to accelerate research progress by increasing effective data reuse
through informed improvement of data sharing and reuse tools and policies. The objective of this proposal is to
examine the feasibility of evaluating data sharing behavior based on examination of the biomedical literature.
The central hypothesis of this proposal is:
Analysis of the impact, prevalence, and patterns with which investigators share and withhold gene
expression microarray research data can uncover rewards, best practices, and opportunities for increased
adoption of data sharing.
To evaluate the central hypothesis, I will perform the following specific aims:
Expected contributions, taking the form of papers and associated datasets, include:
1. an assessment of the observed and measured rewards, prevalence, and patterns of gene expression
microarray dataset sharing
2. a publicly available dataset associating microarray study publications with data sharing status
3. a generalizable approach for developing practical, real-world natural language tools for information
retrieval and extraction within a wide selection of biomedical literature
4. preliminary models of data sharing behavior
Although I plan to limit this study to one datatype allows an in-depth analysis of many specific facets of data
sharing and reuse, I believe the approach and many of the results will be generalizable across domains.
As further support for the significance of this work, our preliminary work has been enthusiastically welcomed by
peer reviewers; reviews have frequently declared the research to be “relevant and timely.” I believe this
developmental work will provide a strong foundation for refining initiatives to efficiently and effectively
encourage data sharing.
C. Preliminary Studies
In this section, I describe my preliminary results leading to this proposal. They include pilot work for Aims 2
(Section C.1) and 3 (Sections C.2, C.3, C.6), and a few publications that illustrate future application of the
results (Sections C.4, C.5).
C.1 Identifying data sharing in the biomedical literature
In anticipation of Aim 2, I did some preliminary work that looked into the feasibility of identifying statements of
data sharing from full text research articles:
A pilot NLP system has been developed and validated for identifying data sharing from statements within
article full text. Using regular expression patterns and machine learning algorithms on open access
biomedical literature published in 2006, our system was able to identify 61% of articles with shared
datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower
precision (49%). These results demonstrate the feasibility of using an NLP approach to automatically
identify instances of data sharing from biomedical full text research articles.[114]
I extended this work to investigate the feasibility of retrieval through established query interfaces:
In this study, we explore the possibility that deep analysis of full text may not be necessary, thereby
enabling the querying of all reports in PubMed Central. We trained machine learning tree and rule-based
classifiers on full-text open-access article unigram vectors, with the existence of a primary citation link from
NCBI’s Gene Expression Omnibus (GEO) database submission records as the binary output class. We
manually combined and simplified the classifier trees and rules to create a query compatible with the
interface for PubMed Central. The query identified 40% of non-OA articles with dataset submission links
from GEO (recall), and 65% of the returned articles without dataset submission links were manually judged
to include statements of dataset deposit despite having no link from the database (applicable precision).
[115]
I conclude that such approaches allow identification of articles with shared data sets with promising levels of
precision and recall. However, I suspect that for the goal of this proposal, identifying data sharing through
database links may be sufficient and preferable. Nonetheless, the experience gained in developing these
classifiers will be valuable in developing NLP classifiers to identify dataset creation as part of Aim 2.
C.2 Preliminary analysis of prevalence and patterns of microarray data sharing
I have conducted preliminary work in assessing the prevalence and patterns of data sharing. This work
confirms the feasibility of our approach and suggests some interesting findings. However, as preliminary work
it has a major limitation: the article cohort was not filtered to only include articles that create data, and thus the
results may be biased by reuse studies. Our current proposal addresses this limitation in Aim 2, and proposes
a wider and deeper analysis in Aim 3.
We assessed the prevalence and patterns of Dataset-Sharing, using only links from within the GEO or
ArrayExpress database[116]. Of 2503 articles about gene expression microarrays, we found that 440
(18%) had primary-citation data source links from a major microarray database, suggesting that the authors
of these papers shared their microarray data. Interestingly, studies with free full text at PubMed were twice
(OR=2.1) as likely to be linked as a data source than those without free full text, as illustrated in Figure 3.
Studies with human data were less likely to have a link (OR=0.8) than studies with only non-human data.
The proportion of articles identified as a data source has increased over time: the odds of a data-source
link for studies was 2.5 times greater for studies published in 2006 than 2002. As might be expected,
studies with the fewest funding sources had the fewest data-sharing links: only 28 (6%) of the 433 studies
with no funding source were listed within the databases. In contrast, studies funded by the NIH, the US
government, or a non-US government source had data-sharing links in 282 of 1556 cases (18%), while
studies funded by two or more of these mechanisms were listed in the databases in 130 out of 514 cases
(25%).
Aim 1
The dataset used for Aim 1 consists of all cancer gene expression microarray articles identified in the 2003
systematic review by Ntzani and Ioannidis[119]. Data sharing status was found through manual investigation
of the research articles, predominant gene expression databases, and Google.
Aim 2 and 3
I would ideally measure prevalence and patterns within a comprehensive set of articles that generated
microarray data, manually annotated with data sharing status. Unfortunately, the method for Aim 1 is not
feasible: no systematic review covers all published microarray articles, and the manual approach for
identifying data sharing that was used in Aim 1 is too time consuming. Instead, I propose to develop, evaluate,
and use automated methods to create a large annotated corpus on data sharing behavior.
Reference standard for annotations
Fortuitously, a recently published letter to the editor provides a useful independent reference standard
annotations on microarray dataset creation and sharing[8]. The authors, Ochsner et al, manually reviewed all
eligible articles published in 20 journals in 2007, and annotated each article for whether it produced original
gene expression microarray data, and whether there was evidence that they shared this data in a database or
on a website. Ochsner et al found almost 400 eligible studies, of which almost 200 had evidence of shared
microarray data. Ochsner et al made their own review dataset available: their initial query, plus the PubMed
IDs for the 400 articles that they considered to have generated microarray data, and links to all identified
microarray datasets. I propose to use this dataset as a reference standard for evaluating the performance of
automated annotation.
Specifically, I propose to assemble a large set of gene expression microarray articles by querying the full text
of research articles for indications that the study produced gene expression microarray data, and verify the
precision and recall of this automatic identification using the Ochsner annotations. I will then use a
combination of database links and full-text queries to automatically identify the data sharing status of each
article, and again use the Ochsner study to ensure that the automatic identification of data sharing status is of
sufficient accuracy.
Full text
Access to the full text of a research article is needed both of our annotations: whether a study has performed a
particular wet lab experiment, and also whether the authors declare that they shared their research data.
Queries of abstracts or MeSH terms, for example, have inadequate recall, retrieving only about 30% and 60%,
respectively, of all articles known to have gene expression data deposited in GEO (Table 1). In contrast, a full
text filter retrieves 96% of all articles known to have data deposited in GEO. Admittedly, the precision is likely
extremely low for the simple query presented in Table 1, but a more refined query should hopefully be able to
maintain relatively high recall while improving precision.
Table 1: Poor recall of abstract and MeSH filters for identifying papers with shared microarray data
Literature subset PubMed Central Query Number of Recall
articles
PMC articles published pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate]) 550 reference
in 2007 linked from GEO
datasets
a filter of abstracts and (gene[Title] OR gene[Abstract]) AND (expression[title] OR 175 175/550=
titles expression[abstract]) AND ((microarrays[title] OR 32%
microarrays[abstract]) OR (microarray[title] OR
microarray[abstract])) AND pmc_gds[filter] AND ("2007"[EDate] :
"2007"[EDate])
a MeSH filter ("microarray analysis"[mesh] OR 335 335/550=
"gene expression profiling"[mesh]) 61%
AND pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate])
The relationship between all published articles, those I will use for our study of prevalence and model building
and those I will use for query development is shown in Figure 4. Estimates for the relative sizes of the subsets
are given in Table 5 in the Appendix.
Figure 4: Relationship between all PubMed articles and those included in study. Legend for proposed
corpora is given in Table 2.
D.1 Aim 1 – Does sharing have benefit for those who share?
Goal: Measure the association between an article’s publication citation rate and whether its authors made their
gene expression datasets publicly available.
Importance: While the general research community benefits from shared data, much of the burden for sharing
the data falls to the study investigator. Demonstrating a boost in citation rate would be a potentially important
motivator for publication authors. To my knowledge, this is the first study to investigate a relationship between
citation rate and biomedical data availability. This work also serves as preliminary work for measuring sharing
prevalence and patterns.
Dataset and Methods: We examined the citation history of 85 cancer microarray clinical trial publications with
respect to the availability of their data.
Findings: The 48% of trials with publicly available microarray data received 85% of the aggregate citations.
Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently
of journal impact factor, date of publication, and author country of origin using linear regression.
Limitations: An important limitation of this proposal: associations do not imply causation. The research here
will not be sufficient to conclude that data sharing causes increased citations. It would be possible that both
factors stem from a common cause, such as a high level of research funding. The study is also performed on
a small, relatively homogeneous set of studies.
Status: In fulfillment of my Master thesis requirement, I have completed and published a study addressing Aim
1:
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased
Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
The complete paper will be included in the final dissertation.
Summary
Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for
the investigator who makes his or her data available. We examined the citation history of 85 cancer microarray
clinical trial publications with respect to the availability of their data. As seen in Table 1, trials published in high
impact journals, prior to 2001, or with US authors were more likely to share their data.
Table 3: Characteristics of eligible trials by data sharing.
Reproduced from [10].
The 48% of trials which shared their data received a total of 5334 citations (85% of aggregate), distributed as
shown in Figure 1.
Research consumes considerable resources from the public trust. As data sharing gets easier and benefits are
demonstrated for the individual investigator, hopefully authors will become more apt to share their study data
and thus maximize its usefulness to society.
Proposed Method
I propose to identify shared data using citation links from GEO, as identified by the PubMed filter
“pubmed_gds[filter]”.
Reference Standard and Query Evaluation
It is important to evaluate the recall of GEO links to ensure it is sufficiently high that my analysis for Aim 3 is
not unacceptably biased due to overlooking valid sharing mechanisms. A response rate of 70% is often
considered sufficient to limit bias in survey[128-130], so I will adopt the same acceptability criterion.
My proposed method for estimating recall is summarized in Figure 3; using 300 random articles from the
Ochsner et al[8] review as a reference standard, I will calculate:
Recall = the number of articles that Ochsner as having shared data that also are linked from GEO divided by
the total number of articles found by Ochsner as having shared data.
Risks and Contingency Plans
If the query evaluation suggests that GEO links provide a recall less than 70%, I will supplement the
identification of data sharing by using article submission links from ArrayExpress and the Stanford Microarray
Database. If recall is still less than 70%, I will develop and apply NLP filters such as the one developed in
[115] to sacrifice some precision for recall.
Limitations and Assumptions
This approach includes sharing to the predominant centralized database and excludes sharing to GEO for
which there is no citation link within the submission entry. Although this is unfortunate and will lead to
underestimating the prevalence of data sharing, I don’t expect it to bias our estimates of data sharing patterns.
Another limitation is that the gold standard is only against articles that refer to data sharing. It is possible that
datasets may be shared without mention in the article, though my preliminary data suggests this is rare.[10]
D.3 Aim 3 – How often is data shared? What predicts sharing? How can we model sharing
behavior?
Working Goal: Measure current data sharing and withholding behavior, and associate these sharing decisions
with features that may predict or influence an investigator’s choice. This will be done through two sub-aims:
Aim 3a: estimate prevalence of data sharing
Aim 3b: assess individual contributions
Aim 3c: investigate multidimensional factors
Importance: Understanding the prevalence and patterns with which datasets are shared is a key step in
evaluating and refining policies that encourage data sharing. To our knowledge, this will be the first extensive
evaluation of observed data sharing behavior in the biomedical literature.
Dataset: As discussed in D.0, the dataset will be comprised of all articles reachable by full text query within
PubMed Central, Highwire Press, and Scirus (NPG + Elsevier) using the query developed in Aim 2a. Articles
cited from within GEO, plus those found by any supplemental methods added in Aim 2b, will be considered to
have shared data; the rest of the articles will be considered to have withheld data.
• Raw number of data creation articles = Number of articles identified by Data Creation Query on
Patterns Cohort
• Precision-adjusted number of data creation articles = Raw number of data creation articles * Precision of
Data Creation Query
• Fully-adjusted number of data creation articles = Precision-adjusted number of data creation articles /
Recall of Data Creation Query
Raw prevalence = Number of articles identified by Data Sharing Query on Patterns Cohort / Number of
articles identified by Data Creation Query on Patterns
Adjusted prevalence = Fully-adjusted number of data sharing articles/ Fully-adjusted number of data creation
articles
I will compare this estimate to the recent sample by Ochsner et al.[8]. However, because the estimates were
selected with different criteria, I do not necessarily expect the prevalence rates to be identical.
Aim 3b – Assessing individual contributions
Proposed features
I selected a set of features to collect and analyze, chosen based on degree of directness it serves as a proxy,
completeness for which they are available, and ease of collection within the scope of this project. I
hypothesize the following variables will be associated with an increased prevalence of data sharing:
Author characteristics (for both the first and last authors, separately)
Feature Hypothesized direction for Proposed data source Limitations
more probable data sharing
number of prior gene more gene expression PubMed with gene author name*, inexact
expression publications publications expression MeSH filter filter
number of prior publications more prior publications PubMed author name*
career citations in PMC more citations PubMed with LinkOut to author name*, limited to
PMC citations PMC citations
years since first publication more years since first PubMed author name*
publication
published in open access the author has published a PubMed Central open author name*
journals before? paper in a gold OA journal access filter
previously reused gene the author has published a GEO data reuse catalog author name*, low recall
expression datasets from paper in the GEO data reuse because set is very
GEO catalog incomplete
published papers with shared the author has published PubMed with GEO misses datasets without
microarray data before papers with shared data datasets filter GEO links
before
personally shared data before the author has shared data GEO database submitter author name*, misses
before list datasets without GEO
links
NIH PI author has a current NIH grant NIH CRISP download author name
gender the author is female given name gender low recall for non-Western
database names
*author name issue involves missing data because the same data can have different names (or different
representations via initials), and different authors can have the same name. I suspect this will occur often,
however I believe it will not impact the results since I have no reason to believe it would occur with a different
rate between articles with shared data and those without.
Study characteristics
Feature Hypothesized direction for more probable data Proposed data source Limitations
sharing
organism under study non-human research [11, 116] MeSH terms
disease under study non-cancer research[116] MeSH terms
Will also include human*cancer interaction variable. [116]
Environmental characteristics
Feature Hypothesized Proposed data source Limitations
direction for more
probable data
sharing
year of publication most recent MEDLINE
number of co- more co-authors MEDLINE
authors
author country of US address MEDLINE only captures
residence corresponding author
country
number of funding more funding MeSH terms for funding sources limited to those
sources sources [116] mentioned within
PubMed
journal prestige higher impact factor ISI Journal Citation Reports for 2007 impact factor
[17] changes with time,
some journals not
indexed
open-access journal gold open-access PMC OA list does not include
author-archiving
(preprint archiving, or
green open access)
research-orientation top 25 NIH-funded Carnegie classification reports limited to US
of university university (http://www.carnegiefoundation.org/classifications),
Hendrix medical school data[111]
number of datasets more datasets GEO submitter list, organization limited to GEO
submitted by this
institution
university vs. other university AUTM tech transfer report [131] limited to US
type of institution
relative amount of less tech transfer AUTM tech transfer report [131] limited to US
tech transfer from
this institution
Policy characteristics
Feature Hypothesized direction for Proposed data source Limitations
more probable data sharing
journal data sharing policy policies that require a Categorization in [17] not all journals listed
database accession number
funder data sharing policy NIH grants after 2004 MeSH terms limited to NIH within the
US
To make help identify confounders and spurious results related to my measurement methods, I will include
several additional variables to identify which query engine (PubMed Central, Highwire Press, and/or Scirus)
was used to find each paper, and variables to flag when data for a particular variable was not in the scope of
the data available (for example, articles with a non US-address are not within the scope of my institution
ranking data source).
Features outside current scope
It would be nice to measure the impact of additional variables. Some are listed below. Unfortunately, these
are difficult to systematically extract and therefore likely outside the scope of the current study:
characteristics of all/any of the authors: • have any patents
• age • have a particularly high/low workload
• training location [11] • know how to share data
• institution location • have plans to use the data again
• trained in informatics, medicine, and/or • have plans for IP or commercial spinoffs
biology • degree of social pressure for commercial
• received training on data sharing [11] activities vs. openness at their institution
• have previous positive or negative • believe that data sharing benefits others
experiences with data sharing [11]
• involved in commercial activities [11]
• believe that data sharing benefits characteristics of the environment and study:
themselves • reuses data
• funded by industry [11]
• relative funding level
• data sharing required by other funders
• data sharing plan included in study proposal
• data sharing plan funded specifically
• article has been self-archived (green open
access)
• attributes of the dataset (trial size,
Affymetrix platform, …)
Statistical Analysis
I will compute the univariate odds for each of the features to assess the degree to which they are associated
with sharing datasets that have been produced, following the methods of Eysenbach[69] in my use of the
nonparametric Wilcoxon Mann-Whitney test for continuous variables that are not normally distributed, and a
comparison of proportions using Fisher’s exact test for dichotomous variables and the Freeman-Halton test for
variables with more than two levels.
Finally, I will use multivariate logistic regression to compute the independent association of each variable to the
probability of sharing. I will report the coefficients and 95% confidence intervals.
Risks and Contingency Plans
Some of the variables may prove more difficult to extract than expected. In this case, if the particular variable
is not essential, I will defer its analysis to future work.
Limitations and Assumptions
An important limitation of this proposal: associations do not imply causation. The research here will not be
sufficient to conclude, for example, that a policy change associated with increased data sharing will in fact
cause increased sharing. It would be possible that both factors stem from a common cause.
This study has several additional limitations. Although restricting the study to only microarray allows an in-
depth analysis of specific facets of data sharing, future work should apply the methodology and lessons
learned to other datatypes to quantify generalizability. The study is limited by the accuracy with which I can
identify dataset creation and data sharing. The study is limited to published articles with queriable full text, and
thus will omit some older articles or those published in more obscure journals. I am not considering datasets
as shared if they are available upon request or published online in another venue than a major database, and
may thereby discount an important and effective sharing mechanism. I will be unable to unambiguously
identify authors, and thus my estimations of previous publishing, data sharing behavior, and grant information
will contain errors. This analysis assumes that the first and last authors are the main decision makers about
whether or not to share datasets. This may not be true. Finally, many variables are US-centric, which erodes
our ability to understand the influences of institutional factors or funding levels, for example, in the rest of the
world (about 46% of microarray papers have author addresses within the US).
… PubMed Central not in PubMed Links for PMC (Search gene[text] (3866 of 10000) * 14888
Highwire or Elsevier or NPG AND expression[text] AND (microarrays[text] = 5756
OR microarray[text]) AND ("2007"[EDate] :
"2007"[EDate]))
NOT ("loprovhighwire"[Filter]
OR "loftextnpg"[Filter] OR "loftextes"[Filter])
Totals 60796
Table 7: Number of articles available through full-text query interfaces across various Portals
Literature "gene "gene "gene + links to
Database expression" expression" expression" GEO
microarray microarray microarray
hybridiz*
hybridiz*
accession
PubMed (title 2265 213 2 0
and abstract) 9.40% 0.94%
Google 21100 4620 1870
Scholar (using
“hybridized”) 21.90% 40.48%
PubMed 3148 1851 839 311 58.8%
Central 58.80% 45.33%
PubMed 2063 1203 542 188 58.3%
Central Open
Access
58.31% 45.05%
HighWire 7543 3601 1450 61 of 115 53%
Press hosted (subset of
1450)(30
of 61 in
47.74% 40.27% PMC)
HighWire 2028 1048 436
Press subset 51.68% 41.60%
Scirus articles 5153 2437 916
(includes 191
Science
47.29% Direct) 37.59%
collection
through
centralized
data
portal
1 Bioinformatics Highwire
2 BMC Bioinformatics PMC
3 Cancer Research Highwire
4 BMC Genomics PMC
5 PNAS PMC
6 J Biol Chem Highwire
7 Nucleic Acids Research PMC
8 Oncogene Scirus: NPG
9 Clinical Cancer Research Highwire
10 Physiol Genomics Highwire
41 Faseb J Highwire
42 Gene Scirus: Elsevier
43 Carcinogenesis Highwire
44 Brit J Cancer Scirus: NPG
45 J Neurochem none (Wiley)
46 Dev Biol Scirus: Elsevier
47 Nat Genetics Scirus: NPG
48 Plant Cell Highwire
49 Mol Cancer Ther Highwire
50 J Neurosci Highwire
Science
ArrayExpress 24 6%
SMD 4 1%
journal 11 3%
other 9 2%
Grand Total 397 100%
Table 11: Sample size required for confidence levels by population size
Population size +/- 3% +/- 5% +/- 7.5% +/- 10%
2000 696 323 157 92
3000 788 341 162 94
5000 880 357 165 95
10000 965 370 168 96
20000 1,014 377 169 96
50000 1,045 382 170 96
100000 1,058 383 170 96
From http://www.zoomerang.com/MKT/samplesize-calculator/step3.html and also see
http://www.surveysystem.com/sscalc.htm