Sie sind auf Seite 1von 33

FOUNDATIONAL STUDIES FOR MEASURING

THE IMPACT, PREVALENCE, AND PATTERNS


OF PUBLICLY SHARING BIOMEDICAL RESEARCH DATA

A. Specific Aims
Many initiatives encourage research data sharing in hopes of increasing research efficiency and quality, but
the effectiveness of these early initiatives is not well understood. Sharing and reusing scientific datasets have
many potential benefits: in addition providing detail for original analyses, raw data can be used to explore
related or new hypotheses, particularly when combined with other publicly available data sets. Real data is
indispensable when investigating and developing study methods, analysis techniques, and software
implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives,
helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of
funding and patient population resources by avoiding duplicate data collection.
Eager to encourage the realization of such benefits, funders, publishers, societies, and individual research
groups have developed tools, resources, and policies to encourage investigators to make their data publicly
available. Despite these investments of time and money, we do not yet understand the rewards, prevalence or
patterns of data sharing and reuse, the effectiveness of initiatives, or the costs, benefits, and impact of
repurposing biomedical research data.
Studies examining current data sharing behavior would be useful in three ways. First, an estimate of the
prevalence with which data is shared, either voluntarily or under mandate, would provide a valuable baseline
for assessing future adoption and continued intervention. Second, analyses of current behavior will likely
identify subfields (perhaps research areas with a particular disease or organism focus, or those in well funded
research groups) with relatively high prevalence of data sharing; digging into these can illuminate valuable best
practices. Third, the same analyses will likely reveal subareas in which researchers rarely share their research
datasets. Future research could focus on these challenging areas, to understand their unique obstacles for
data sharing and refine future initiatives accordingly. You can not manage what you do not measure:
understanding the rewards, prevalence, and patterns of data sharing and withholding will facilitate effective
refinement of data sharing initiatives to better address real-world needs.
The long-term goal of this research is to accelerate research progress by increasing effective data reuse
through informed improvement of data sharing and reuse tools and policies. The objective of this proposal is to
examine the feasibility of evaluating data sharing behavior based on examination of the biomedical literature.
The central hypothesis of this proposal is:
Analysis of the impact, prevalence, and patterns with which investigators share and withhold gene
expression microarray research data can uncover rewards, best practices, and opportunities for increased
adoption of data sharing.
To evaluate the central hypothesis, I will perform the following specific aims:

Aim 1: Does sharing have benefit for those who share?


I will investigate the association between sharing raw microarray data and subsequent citation rate of
published studies. I will use datasets generated by a small, relatively homogeneous set of cancer gene
expression microarray clinical trials. Multivariate analysis will be used to statistically controlling for potential
confounders. The results of Aim 1 provide motivation for Aim 2 and preliminary work for Aim 3.
Note: This study has already been completed.

Aim 2: Can sharing and withholding be systematically measured?


Because the manual methods used to conduct Aim 1 do not scale to larger analyses, I will investigate
automatic methods for measuring data sharing and withholding behavior. First, articles that generate gene
expression microarray data will be identified using NLP on full-text research. Second, to assess whether the
authors of these data-generating studies share or withhold their data, I will investigate using database
submission citation links as evidence of data sharing. The results of Aim 2 will be used to generate a dataset
for use in Aim 3.
Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?
First, I will apply the classification systems described in Aim 2 to wide spectrum of the biomedical literature to
identify articles that have generated gene expression microarray data and, subsequently, which of the articles
that generated data also shared it. Then, for each of the articles, I will collect and analyze features related to
the authors, their institutional and funding environment, the study itself, and the publishing mechanism. I will
use univariate and multivariate statistics to investigate which of these features are associated with dataset
sharing. Finally, I will use exploratory factor analysis to derive a model that could be used to explain data
sharing decisions based on my measured variables.
This proposal describes a new, exploratory, and innovative research project that could radically impact the
adoption of data sharing in biomedical research. Expected contributions for this proposal include (a) an
assessment of the observed and measured rewards, prevalence, and patterns of gene expression microarray
dataset sharing, (b) a large, publicly available dataset associating microarray study publications with data
sharing status, and (c) tools and methods for continued research in this area. This developmental work will
provide a strong foundation for refining initiatives to efficiently and effectively encourage data sharing.

B. Background and Significance


Widespread adoption of the Internet now allows research results to be shared more readily than ever before.
This is true not only for published research reports, but also for the raw research data points that underlie the
reports. Investigators who collect and analyze data can submit their datasets to online databases, post them
on websites, and include them as electronic supplemental information – thereby making the data easy to
examine and reuse by other researchers.
Reusing research data has many benefits for the scientific community. New research hypotheses can be
tested more quickly and inexpensively when duplicate data collection is reduced. Data can be aggregated to
study otherwise-intractable issues, and a more diverse set of scientists can become involved when analysis is
opened beyond those who collected the original data. Ethically, it has long been considered a tenet of
scientific behavior to share results[1], thereby allowing close examination of research conclusions and
facilitating others to build directly on previous work. The ethical position is even stronger when the research
has been funded by public money[2], or the data are donated by patients and so should be used to advance
science by the greatest extent permitted by the donors[3].
However, while the general research community benefits from shared data, much of the burden for sharing the
data falls to the study investigator. Unfortunately, these advantages only indirectly benefit the stakeholders
who bear most of the costs for sharing their datasets: the primary data-producing investigators. A major cost
is time: the data have to be formatted, documented, and released. Further, it is sometimes complicated to
decide where to best publish data, since supplementary information and laboratory sites are transient[4-6].
Beyond a time investment, releasing data can induce fear. There is a possibility that the original conclusions
may be challenged by a re-analysis, whether due to possible errors in the original study[7], a misunderstanding
or misinterpretation of the data[26], or simply more refined analysis methods. Future data miners might
discover additional relationships in the data, some of which could disrupt the planned research agenda of the
original investigators. Investigators may fear they will be deluged with requests for assistance, or need to
spend time reviewing and possibly rebutting future re-analyses. They might feel that sharing data decreases
their own competitive advantage, whether future publishing opportunities, information trade-in-kind offers with
other labs, or potentially profit-making intellectual property. Finally, it can be complicated to release data. If not
well-managed, data can become disorganized and lost. Some informed consent agreements may not
obviously cover subsequent uses of data. De-identification can be complex. Study sponsors, particularly from
industry, may not agree to release raw detailed information. Data sources may be copyrighted such that the
data subsets can not be freely shared.
Recognizing that these disincentives make the establishment of a voluntary data sharing culture unlikely
without policy guidance, many initiatives actively encourage or require that investigators make their raw data
available for other researchers. There is a well known adage: you cannot manage what you do not measure.
For those with a goal of promoting responsible data sharing, it would be helpful to evaluate the effectiveness of
requirements, recommendations, and tools. When data sharing is voluntary, insights could be gained by
learning which datasets are shared, on what topics, by whom, and in what locations. When policies make data
sharing mandatory, monitoring is useful to understand compliance and unexpected consequences.
Unfortunately, it is difficult to monitor data sharing because data can be shared in so many different ways.
Previous assessments of data sharing have included manual curation[8-10] and investigator self-reporting[11].
These methods are only able to identify instances of data sharing and data withholding in a limited number of
cases, and therefore are unable to support inquiry into patterns of data sharing behavior.
This proposal addresses three phases of research critical to a full evaluation of data sharing behavior:
Aim 1: Does sharing have benefit for those who share?
Aim 2: Can sharing and withholding be systematically measured?
Aim 3: How often is data shared? What predicts sharing? How can we model sharing behavior?

B.1 Significance of this proposal


This proposal describes a new, exploratory, and innovative research project that could radically impact the
adoption of data sharing in biomedical research.
This work directly supports NIH strategic initiatives. This work will provide a foundation for evaluating the
effectiveness of the NIH’s data sharing policies. Further, the current proposal will contribute to the NLM’s
goals of “contributing to comprehensive strategies for preservation of biomedical information in the US and
worldwide”[12] and “developing linked databases for discovering relationships between clinical data, genetic
information, and environmental factors”[13] . This work would also be relevant for the NCRR in its work to
“facilitate information sharing among biomedical researchers” as part of its current strategic plan[14].

Progress will also be of significant value to a broad cross-section of disciplines, including:


• Funders, policy makers and thought leaders. Although some results of this analysis may be intuitive
(a stronger journal data sharing policy results in more data sharing, or shared data permits reuse and
thus supports a higher citation rate), these relationships have not yet been demonstrated. Concrete,
supporting – or contradictory! – evidence will be of value to a wide spectrum of policy makers and
thought leaders. For example, funders can improve the impact and efficiency of their investments
through applying lessons learned from communities with high data sharing adoption to those with room
for improvement.
• Database, software, and data standard developers. The usage patterns of those who share data
provide critical requirement specification feedback for developing and refining databases, software, and
standards to support data sharing and reuse. Learning who does not currently share data can provide
insight into failings of current tools and opportunities for improvements.
• Biomedical informatics community. Informatics involved evaluation of the generation, use, and
value of information resources; this research addresses this topic from a novel perspective. The
biomedical informatics field will also benefit by exposure to methods it does not commonly apply. For
example, my plans to apply NLP techniques to the biomedical literature through full-text portals could
have wide applicability for information retrieval. Finally, the general biomedical informatics community
will benefit if and when this research leads to initiatives that increase the rate of data sharing.
• Information science and digital library community. Data use behavior and resource usage metrics
are active research topics in information science and digital library research. Several ongoing projects
are investigating data use in the social sciences, but there has been little recent investigation
measurement of research sharing in the biomedical arena. Furthermore, most of the information
science studies have used survey approaches; my emphasis on measured variables will provide a
diverse perspective.
• Open Science community. Grassroots movements to increase openness and transparency in
science will benefit from rigorous, quantitative assessments of current data sharing behavior.
• Primary Investigators. Last but not least, I expect that this research will help inspire investigators to
share their data and help inform the creation of tools that help them. As data sharing is evaluated and
policies and incentives improved, hopefully investigators will become more apt to share and reuse
study data and thus maximize its usefulness to society.

Expected contributions, taking the form of papers and associated datasets, include:
1. an assessment of the observed and measured rewards, prevalence, and patterns of gene expression
microarray dataset sharing
2. a publicly available dataset associating microarray study publications with data sharing status
3. a generalizable approach for developing practical, real-world natural language tools for information
retrieval and extraction within a wide selection of biomedical literature
4. preliminary models of data sharing behavior
Although I plan to limit this study to one datatype allows an in-depth analysis of many specific facets of data
sharing and reuse, I believe the approach and many of the results will be generalizable across domains.
As further support for the significance of this work, our preliminary work has been enthusiastically welcomed by
peer reviewers; reviews have frequently declared the research to be “relevant and timely.” I believe this
developmental work will provide a strong foundation for refining initiatives to efficiently and effectively
encourage data sharing.

B.2 The potential benefits of data sharing


Sharing information facilitates science. Reusing previously-collected data in new studies allows these valuable
resources to contribute far beyond their original analysis[15]. In addition to being used to confirm original
results, raw data can be used to explore related or new hypotheses, particularly when combined with other
publicly available data sets. Real data is indispensable when investigating and developing study methods,
analysis techniques, and software implementations. The larger scientific community also benefits: sharing data
encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new
researchers, and increases efficient use of funding and patient population resources by avoiding duplicate data
collection.
Believing that that these benefits outweigh the costs of sharing research data, many initiatives actively
encourage investigators to make their data available. Some journals require the submission of detailed
biomedical data to publicly available databases as a condition of publication[16, 17]. Since 2003, the NIH has
required a data sharing plan for all large funding grants and has more recently introduced stronger
requirements for genome-wide association studies[18, 19]; other funders have similar policies. Several
government whitepapers[15, 20] and high-profile editorials[21-26] call for responsible data sharing and reuse,
large-scale collaborative science is providing the opportunity to share datasets within and outside of the
original research projects[27, 28], and tools, standards, and databases are developed and maintained to
facilitate data sharing and reuse.

B.3 Current data sharing practice


As highlighted above, sharing research data has many potential benefits to society. Although sharing of data
has always been an aspiration of the scientific enterprise, it has only been common in a few subdisciplines.
Forces are now converging to make it an achievable and everyday practice.
Forces in support of increased data sharing
Datasets are larger than they have ever been – and larger than any single team of scientists can analyze
exhaustively. The ubiquitous sharing and reuse of DNA sequences in Genbank has clearly demonstrated the
power of openly shared data. Other high-throughput hypothesis-generating datasets, such as genome-wide
association studies [29, 30], gene expression microarrays[31], proteomics mass spectra[23], and brain
images[32] allow data to be repurposed to answer multiple research questions. Extensive datasets are also
generated within the clinical setting, particularly through the adoption of electronic health records. Stakeholders
have begun to develop recommendations and guidelines for the complex ethical, legal, and technical issues
surrounding the reuse and sharing of health data beyond primary healthcare[33].
Research is increasingly performed within networks of multi-disciplinary teams. The NIH Roadmap[34] and
other initiatives[28, 35-37] have recognized that significant scientific progress requires collaboration.
Collaborations develop and adopt frameworks, standards, tools, and policies to share data among
investigators. This work can facilitate sharing their data beyond the boundaries of the original research
partners.
Today’s collaborative science on large datasets is performed within an extremely tight biomedical funding
environment. Many funding agencies have instituted data-sharing policies,[21, 38-40] hoping to accelerate
scientific progress while avoiding the cost of duplicative collection efforts. The NIH Data Sharing Policy,
adopted in 2003, requires a data sharing plan for all research grants over $500K[41]. The NIH stipulates
additional requirements for specific domains. For example, all funded genome-wide association studies
(GWAS) are now expected to share their data in the centralized NCBI database, dbGaP[19, 30].
Complementing and extending these funding agency requirements, many biomedical journals require or
recommend that data be shared as a condition of publication[16, 17, 20]. Some journals delineate the
responsibilities in detail and include procedures for addressing data sharing noncompliance[25].
Open, centralized databases such as Genbank, Uniprot, and the Gene Expression Omnibus have evolved into
de facto homes for specific types of data[42]. Standards for minimum data inclusion and data formats have
been developed for many types of datasets. The challenge of integrating datasets has spurred research
progress on ontologies and semantic description methods. Projects such as NCBI’s Entrez database suite [43],
the Semantic Web for Life Sciences [44], the National Center for Biomedical Ontology’s Bioportal framework
[45], and caBIG[36] provide visions for the future of research when data is more universally available and
interoperable.
Data sharing and integration are being actively pursued outside of biomedical research, in other scientific fields
(physics, astronomy, environmental science) and also by the general public[46]. Several websites encourage
uploading and visualizing all sorts of data: the “Tasty Data Goodies” at Swivel (http://www.swivel.com) and
IBM’s Many Eyes (http://www.many-eyes.com) are popular examples. Widespread adoption of Web 2.0
technologies, including blogging, tagging, wikis, and mashups, suggest that our next generation of scientists
will expect and embrace a world of research remixes[46].
Finally, I note the complementary forces of open access and pre-print publications, open notebook science
projects[47], open source code[48], Creative Commons copyright licenses (http://creativecommons.org/) for
many kinds of original content (including data), and two recent public access policies. The NIH Public Access
Policy will require all NIH-funded investigators to submit their peer-reviewed manuscripts to PubMed Central to
ensure public access, beginning in April 2008.[49] In February 2008, the faculty of Harvard University voted to
make all faculty scholarly publications freely available in an online open-access repository[50], the first such
resolution by a university in the United States. While these policies do not apply to data beyond that provided
within the manuscripts, they clearly demonstrate a political will to support sharing research results “to help
advance science and improve human health” (http://publicaccess.nih.gov) and “promote free and open access
to significant, ongoing research”[50].
Forces opposing data sharing
While many forces are converging to enhance our ability to share data, there are significant social,
organizational, technical and legislative factors that may impede them.
Investigators may restrict access to data to maximize the professional and economic benefit that they accrue
from data they generate, even though they also gain advantage by accessing data produced by others.
A recent review of genomic data sharing highlighted the complexity of stakeholder interests both for and
against data sharing[51], beyond those of the original investigators. Study subjects may have personal
interests in privacy and confidentiality that exceed their personal interests in contributing to new methods of
detecting and treating disease. Academic health centers may view data sharing as a threat to intellectual
property, with the potential to impede spin-offs and start-ups that bring revenue and act as incubators for future
research. Industrial sponsorship may hinder plans for sharing data. Changes in the regulatory environment
make the sharing of data more complex, and may necessitate more stringent oversight to ensure compliance
and minimize risk. Finally, limitations imposed by specific technologies undermine the ability of a uniform
approach to generalize across different data types and regulatory requirements.
It is often difficult to effectively incent and mandate data sharing. Mandates are often controversial [52][53, 54]
while requests and unenforced mandates are often ignored [55]. The effect of funder policies like the NIH Data
Sharing Policy have not been systematically studied, but anecdotal evidence suggests that many researchers
view funder policies as optional, since they data sharing plans are not considered as part of scientific
evaluation, and the mandate is only for a plan not the sharing itself. [Personal communication with Jenny
Tucker, pending permission to include]
I believe that a critical element in balancing these opposing forces is a better understanding of current data
sharing behavior, patterns, and predictors to be used for communicating and refining sharing best-practices.
B.4 Related research on data sharing behavior
A few investigations into data sharing behavior and attitudes have initiated work in this area. Findings and
outstanding challenges are highlighted below.
Measuring and modeling data sharing behavior
Most measurements of data sharing prevalence have manually searched for shared datasets across a subset
of journals[8, 9, 55], or systematically contacted authors to ask for shared datasets[56]. These studies have
found that data sharing levels are high (but less than 100%) in a few cases, but overall prevalence is low. For
example, Ochsner et al[8] found that despite the maturity of gene expression microarray data sharing
infrastructure and multitude of funder and journal mandates, overall data sharing across 20 journals in 2007 is
about 50%.
These analyses have not correlated their prevalence findings with other variables to detect patterns.
Multivariate analyses have relied upon surveyed attitudes and intentions (described below), rather than
measured characteristics.
Measuring and modeling data sharing attitudes and intentions
The largest body of knowledge about motivations and predictors for biomedical data sharing and withholding
comes from Campbell and co-authors. They surveyed researchers, asking whether they have ever requested
data and been denied, or themselves denied other researchers from access to data. Results indicated that
participation in relationships with industry, mentors’ discouragement of data sharing, negative past experience
with data sharing, and male gender were associated with data withholding.[11] In another survey, among
geneticists who said they intentionally withheld data related to their published work, 80% said it was too much
effort to share the data, 64% said they withheld data to protect the ability of a junior team member to publish,
and 53% withheld data to protect their own publishing opportunities.[57]
Occasionally, the administrators of centralized data servers publish feedback surveys of their users. As an
example, Ventura reports a survey of researchers who submitted and reviewed microarray studies in the
Physiological Genomics journal after its mandatory data submission policy had been in place for two years.
Almost all (92%) authors said that they believed depositing microarray data was of value to the scientific
community and about half (55%) were aware of other researchers reusing data from the database.[58]
In related research, the information science and management of information systems communities have
developed several models of knowledge sharing. These models often use either case studies [59] or opinions
and attitudes gathered through validated survey instruments([60-63], and many more). Studied domains
include knowledge sharing within an organization, volunteering knowledge in open social networks, physician
knowledge sharing in hospitals, participation in open source projects, academic contributions to institutional
archives, and other related activities.
Identifying instances of data sharing
While surveys have provided insight into sharing and reuse behavior, other issues are best examined by
studying the demonstrated behavior of scientists. Unfortunately, observed measurement of data behavior is
difficult because of the complexity in identifying all episodes of data sharing and reuse. Although indications of
sharing and reuse usually exist within a published research report, the descriptions are in unstructured free text
and thus complex to extract.
Most studies of data sharing to date have used a manual review to identify shared datasets (e.g. [8, 9, 55]).
One automated approach for identifying data sharing behavior is to follow the “primary citation” field of
database submission entries. Unfortunately, this is imperfect, since these references often missing when data
is submitted prior to study publication. Populating the submission citation fields retrospectively requires
intensive manual effort, as demonstrated by the recent Protein Data Bank remediation project[64], and thus is
not usually performed. No effective way exists to automatically retrieve and index data housed on personal or
lab websites or journal supplementary information.
Related research has examined the degree to which data remains available after it has been shared. Multiple
studies underscore the transience of supplementary information[5], website URLs[6], and corresponding author
email addresses[65].
Evaluating the impact of data sharing policies
Despite many funder and journal policies requesting and requiring data sharing, the impact of these policies
have only been measured in small and disparate studies. McCain manually categorized the journal “Instruction
to Author” statements in 1995.[16] A more recent manual review of gene sequence papers found that, despite
requirements, up to 15% of articles did not submit their datasets to Genbank.[9]. Analyses of reproducibility in
the political science literature suggests that only actively enforced journal policies are effective[55].
Studying the impact of data sharing policies is difficult because policies are often confounded with other
variables. If, for example, impact factor is positively correlated with a strong journal data sharing policy as well
as a large research impact, it is difficult to distinguish the direction of causation. Evaluating data sharing
policies would ideally involve a randomized controlled trial, but unfortunately this is impractical.
In related work, evaluations have been done to estimate the impact of reporting guidelines[66].
Estimating the costs and benefits of data sharing
Estimating the costs and benefits of data sharing would be challenging even with a comprehensive dataset of
occurrences. A complete evaluation would require comparing projects that shared with other similar projects
that did not, across a wide variety of variables including person-hours-till-completion, total project cost,
received citations and their impact, the number and impact of future publications, promotion, success in future
grant proposals, and general recognition and respect in the field.
Pienta[67] is currently investigating these questions with respect to social science research data and
publications. Zimmerman[68] has studied the ways in which ecologists find and validate datasets to overcome
the personal costs and risks of data reuse.
Examining variables for their benefits on research impact is a common theme within the field of bibliometrics.
Research impact is usually approximated by citation metrics, despite their recognized limitations.
Related research fields
Evaluation of data sharing and reuse behavior is related to a number of other active research fields: code
reusability in software engineering, motivation in open source projects, online knowledge sharing communities,
and corporate knowledge sharing, tools for collaboration, evaluating research output, the sociological study of
altruism, information retrieval, usage metrics, data standards, the semantic web, open access, and open
notebook science.

B.5 Related research applications of methods


Citation analysis for adoption and impact of open science
Citation analysis has been used to assess several aspects of the adoption and impact of open science,
particularly literature open access. Eysenbach[69] found that authors who chose to make their articles open
access in the Proceedings of the National Academy of Sciences received more citations within the first year
after publication, Wren[70] correlated journal impact factor with the adoption rate of author-shared reprints, and
many others. Other research have used citations to see how scientists use each other’s work[71] and the
relative impact of various study designs [72]
Many authors study factors that underlie citation rate; these highlight important factors to include as potential
confounders whenever a detailed citation analysis of a new variable[73, 74]. Ongoing research attempts to
identify the best way to represent various issues such as author ambiguity[75], author productivity[76, 77],
institutional environment[78], journal impact factor[79-83] and clean and comprehensive citation data[84].
Finally, several researchers have proposed methods for citations of data to make studying the issue of reuse
easier in the long run, such as [26] and [85], and examined the extent to which citations are an accurate proxy
for peer ratings of quality [86].
Natural language processing of the biomedical literature
Natural language processing of the biomedical literature is traditionally organized into information retrieval,
entity recognition, information extraction, hypothesis generation, and heterogeneous integration[87]. Most
work has been on abstracts, because they are free, easy to obtain, and in a standardized for mat from
PubMed. Unfortunately, a great deal of information resides only in article full text. The TREC Genomics
2006/2007 tasks opened up a selection of free text for Information Retrieval research, and the Open Access
subset at PubMed Central is another homogeneous, free, easy subset to obtain. Consequently, more research
is beginning to focus on full-text[88].
Most research has focused on the needs of biologists or curators[89], but starting to be some investigations
into automated techniques to help find articles for review based on the text[90-93], identification of the
relationships between citing and cited articles [94, 95], and analysis of the methods section to enumerate the
diversity of wet lab method use [96]
Techniques vary depending on the task, but stemming, synonyms, and n-grams are a mainstay[97]. Query
expansion to include all query aspects have also been shown to help[98]. The availability of full text articles in
PMC, Google Scholar, and other portals is spurring new approaches[99].
Finally, NLP techniques applied to clinical text might be of informative. For example, Melton et al[100] also
faces the issue of identifying records based on snippets of full text, though in their case it is adverse reactions
in clinical discharge summaries.
Regression and factor analysis for deriving and evaluating models of sharing behavior
Most models of sharing behavior are based on established surveys, and thus evaluate their models using
confirmatory analysis[101-106]. However, a few research projects instead use linear regression, such as [11,
63, 107-109]. Siemsen et al[110] compare a regression model to that derived from constraining factor
analysis. Finally, several studies involve exploratory factor analysis [111-113].

C. Preliminary Studies
In this section, I describe my preliminary results leading to this proposal. They include pilot work for Aims 2
(Section C.1) and 3 (Sections C.2, C.3, C.6), and a few publications that illustrate future application of the
results (Sections C.4, C.5).
C.1 Identifying data sharing in the biomedical literature
In anticipation of Aim 2, I did some preliminary work that looked into the feasibility of identifying statements of
data sharing from full text research articles:
A pilot NLP system has been developed and validated for identifying data sharing from statements within
article full text. Using regular expression patterns and machine learning algorithms on open access
biomedical literature published in 2006, our system was able to identify 61% of articles with shared
datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower
precision (49%). These results demonstrate the feasibility of using an NLP approach to automatically
identify instances of data sharing from biomedical full text research articles.[114]
I extended this work to investigate the feasibility of retrieval through established query interfaces:
In this study, we explore the possibility that deep analysis of full text may not be necessary, thereby
enabling the querying of all reports in PubMed Central. We trained machine learning tree and rule-based
classifiers on full-text open-access article unigram vectors, with the existence of a primary citation link from
NCBI’s Gene Expression Omnibus (GEO) database submission records as the binary output class. We
manually combined and simplified the classifier trees and rules to create a query compatible with the
interface for PubMed Central. The query identified 40% of non-OA articles with dataset submission links
from GEO (recall), and 65% of the returned articles without dataset submission links were manually judged
to include statements of dataset deposit despite having no link from the database (applicable precision).
[115]
I conclude that such approaches allow identification of articles with shared data sets with promising levels of
precision and recall. However, I suspect that for the goal of this proposal, identifying data sharing through
database links may be sufficient and preferable. Nonetheless, the experience gained in developing these
classifiers will be valuable in developing NLP classifiers to identify dataset creation as part of Aim 2.
C.2 Preliminary analysis of prevalence and patterns of microarray data sharing
I have conducted preliminary work in assessing the prevalence and patterns of data sharing. This work
confirms the feasibility of our approach and suggests some interesting findings. However, as preliminary work
it has a major limitation: the article cohort was not filtered to only include articles that create data, and thus the
results may be biased by reuse studies. Our current proposal addresses this limitation in Aim 2, and proposes
a wider and deeper analysis in Aim 3.
We assessed the prevalence and patterns of Dataset-Sharing, using only links from within the GEO or
ArrayExpress database[116]. Of 2503 articles about gene expression microarrays, we found that 440
(18%) had primary-citation data source links from a major microarray database, suggesting that the authors
of these papers shared their microarray data. Interestingly, studies with free full text at PubMed were twice
(OR=2.1) as likely to be linked as a data source than those without free full text, as illustrated in Figure 3.
Studies with human data were less likely to have a link (OR=0.8) than studies with only non-human data.
The proportion of articles identified as a data source has increased over time: the odds of a data-source
link for studies was 2.5 times greater for studies published in 2006 than 2002. As might be expected,
studies with the fewest funding sources had the fewest data-sharing links: only 28 (6%) of the 433 studies
with no funding source were listed within the databases. In contrast, studies funded by the NIH, the US
government, or a non-US government source had data-sharing links in 282 of 1556 cases (18%), while
studies funded by two or more of these mechanisms were listed in the databases in 130 out of 514 cases
(25%).

Figure 1: Preliminary data sharing patterns.


Preliminary Indications of Data-Sharing Patterns with 95% confidence intervals. From [116].

C.3 A review of journal policies for sharing research data


To confirm the feasibility of extracting journal data sharing policies from “Instruction to Author” statements, I
conducted a review of journal policies for sharing microarray data. The results and methods from this study will
be directly useful in Aim 3.
We examined the relationship between data sharing behavior and the strictness of a journal’s data sharing
policy.[17] As expected, we found that journals with the strongest data sharing policies had the highest
proportion of papers with shared datasets. As seen in Figure 4, the journals with no data sharing policy, a
weak policy, and a strong policy had a median data sharing prevalence of 8%, 20%, and 25% respectively.
However, this study lacked a method of determining which articles were data producing, and so, these
proportions should be interpreted relative to each other rather than to a theoretical maximum of 100%.

Figure 2: Relative data sharing prevalence by journal policy strength.


A boxplot of the relative data-sharing prevalence for various journals, grouped by the strength of the
journal’s data-sharing policy. For each group, the heavy line indicates the median, the box
encompasses the interquartile range (IQR, 25th to 75th percentiles), the whiskers extend to data
points within 1.5xIQR from the box, and the notches approximate the 95% confidence interval of the
median. From [17]

C.4 Recommendations for best practice initiatives and incentives


In collaboration with others in the Department of Biomedical Informatics and the caBIG DSIC working group, I
recently published a paper highlighting ways in which Academic Health Centers can and should refine their
initiatives and incentives for data sharing:
Piwowar HA, Becich MJ, Bilofsky H, Crowley RS, on behalf of the caBIG Data Sharing and Intellectual
Capital Workspace (2008) Towards a Data Sharing Culture: Recommendations for Leadership from
Academic Health Centers. PLoS Med 5(9): e183 doi:10.1371/journal.pmed.0050183
This paper demonstrates there is an audience for discussion on data sharing policy; the results of the current
proposal will serve to direct and strengthen such perspective pieces in the future.
C.5 Preliminary analysis and vision for the evaluation of data reuse
I have conducted preliminary work in evaluating data reuse.[117, 118] Although studying reuse is outside the
scope of the current proposal, the methods and results from this study will inspire and facilitate future work in
this innovative and important domain.

D. Research Design and Methods


The central hypothesis of this proposal is:
Analysis of the impact, prevalence, and patterns with which investigators share and withhold gene
expression microarray research data can uncover rewards, best-practices, and opportunities for increased
adoption of data sharing.
To evaluate the central hypothesis, I will perform the following specific aims:
Aim 1: Does sharing have benefit for those who share?
Aim 2: Can sharing and withholding be systematically measured?
Aim 3: How often is data shared? What predicts sharing? How can we model data sharing behavior?
The purpose of this proposal is not to assess all data sharing behavior in biomedical research, but rather to
explore three aspects of such an evaluation: (1) measure whether sharing data is associated with a citation
benefit within a small cohort of clinical trials, (2) enable larger-scale future analyses by developing a system to
automatically identify instances of dataset production and sharing, and (3) analyze the instances of dataset
production and sharing for patterns associated with sharing behavior. These results will provide a strong
foundation for future data sharing evaluations.
To enhance my chances of success in this proposal, I will limit the scope of research as follows:
• I will consider these questions within the context of gene expression microarray data. Microarray data
provides a useful environment for investigation: despite being valuable for reuse costly to collect and
mature in data standard and repository frameworks, it is often but not yet universally shared.
• Shared data will be defined as datasets that have been submitted to major, centralized databases. This
excludes data shared upon request, included as supplementary information, submitted to small databases,
or posted to a lab webpage: finding these resources is beyond the scope of the current project.
• Citation count will serve as a proxy for research impact.
• Studies will be limited to those indexed within PubMed, with English full text can be queried through a
centralized portal (see Data Set section below for discussion)
• Analysis of data sharing predictors will be limited to variables that can be automatically derived from
PubMed, article full text, and other database sources. I hope to consider features that require more
manual interpretation and integration in the future.
The methods I will use to complete Aim 1 are fairly simple, and described in Section D1. Figure 3 illustrates
the method I will use for Aims 2 and 3; details follow in sections D0, D2 and D3.
Figure 3: Method overview for Aims 2 and 3

D.0 Data Sets


This section describes the datasets I will assemble and use in the course of this project. The use of these
datasets will be explained in more detail in sections D1-D3.

Aim 1
The dataset used for Aim 1 consists of all cancer gene expression microarray articles identified in the 2003
systematic review by Ntzani and Ioannidis[119]. Data sharing status was found through manual investigation
of the research articles, predominant gene expression databases, and Google.

Aim 2 and 3
I would ideally measure prevalence and patterns within a comprehensive set of articles that generated
microarray data, manually annotated with data sharing status. Unfortunately, the method for Aim 1 is not
feasible: no systematic review covers all published microarray articles, and the manual approach for
identifying data sharing that was used in Aim 1 is too time consuming. Instead, I propose to develop, evaluate,
and use automated methods to create a large annotated corpus on data sharing behavior.
Reference standard for annotations
Fortuitously, a recently published letter to the editor provides a useful independent reference standard
annotations on microarray dataset creation and sharing[8]. The authors, Ochsner et al, manually reviewed all
eligible articles published in 20 journals in 2007, and annotated each article for whether it produced original
gene expression microarray data, and whether there was evidence that they shared this data in a database or
on a website. Ochsner et al found almost 400 eligible studies, of which almost 200 had evidence of shared
microarray data. Ochsner et al made their own review dataset available: their initial query, plus the PubMed
IDs for the 400 articles that they considered to have generated microarray data, and links to all identified
microarray datasets. I propose to use this dataset as a reference standard for evaluating the performance of
automated annotation.
Specifically, I propose to assemble a large set of gene expression microarray articles by querying the full text
of research articles for indications that the study produced gene expression microarray data, and verify the
precision and recall of this automatic identification using the Ochsner annotations. I will then use a
combination of database links and full-text queries to automatically identify the data sharing status of each
article, and again use the Ochsner study to ensure that the automatic identification of data sharing status is of
sufficient accuracy.
Full text
Access to the full text of a research article is needed both of our annotations: whether a study has performed a
particular wet lab experiment, and also whether the authors declare that they shared their research data.
Queries of abstracts or MeSH terms, for example, have inadequate recall, retrieving only about 30% and 60%,
respectively, of all articles known to have gene expression data deposited in GEO (Table 1). In contrast, a full
text filter retrieves 96% of all articles known to have data deposited in GEO. Admittedly, the precision is likely
extremely low for the simple query presented in Table 1, but a more refined query should hopefully be able to
maintain relatively high recall while improving precision.
Table 1: Poor recall of abstract and MeSH filters for identifying papers with shared microarray data
Literature subset PubMed Central Query Number of Recall
articles
PMC articles published pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate]) 550 reference
in 2007 linked from GEO
datasets
a filter of abstracts and (gene[Title] OR gene[Abstract]) AND (expression[title] OR 175 175/550=
titles expression[abstract]) AND ((microarrays[title] OR 32%
microarrays[abstract]) OR (microarray[title] OR
microarray[abstract])) AND pmc_gds[filter] AND ("2007"[EDate] :
"2007"[EDate])
a MeSH filter ("microarray analysis"[mesh] OR 335 335/550=
"gene expression profiling"[mesh]) 61%
AND pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate])

a full-text filter gene[text] AND expression[text] AND (microarrays[text] OR 526 526/550=


microarray[text]) 96%
AND pmc_gds[filter] AND ("2007"[EDate] : "2007"[EDate])

Full text retrieval access through portals


Two options exist for querying full text: I either need to download an extensive collection of full text articles and
execute my queries locally, or I can issue online queries through one of several “full text query portal”
environments. Unfortunately, downloading computable full text articles is complex, time consuming, and
difficult to maintain due to licensing restrictions and a lack of access standards. In contrast, issuing a full text
query online is simple, scalable, and the growth of Highwire Press and the flood of NIH research into PubMed
Central suggest that online query is a promising approach for the future. Consequently, I propose to query full
text through online portals.
This choice will have several impacts on query development and results. First, it will limit recall to articles that
are indexed in the full text query portals. I will combine the query results from PubMed Central, Highwire Press
and Scirus: they have adequate coverage, clean data, an interface that allows sufficient filtering, and an output
format that can be aggregated with reasonable manual effort. An analysis of relevant articles in PubMed and
datasets in the Gene Expression Omnibus suggests that these portals have access to more than 85% of the
articles licensed to the University of Pittsburgh library (see Table 5 in the Appendix).
Second, online portals offer a limited interface for queries. Only traditional Boolean queries are accepted, most
portals have a fixed stop-word list, and n-gram “phrase” support appears to be weak in some engines
(probably due to lack of indexing of stop words). This will exclude several common NLP techniques, such as
term weighting and part of speech tagging. That said, accessing full text through these interfaces an
underserved research area; our experiences will provide a valuable contribution.
Full text computational access for development
I wish to use statistical and lexical NLP techniques to derive queries that have high recall and precision. The
statistical and lexical NLP analysis requires computational access to articles; access through full text portals is
not sufficient for development analysis. I propose to use the following two bundles of full text for development:
• TREC Gen: The Highwire Press TREC Genomics 2006 cohort includes 162259 full text articles[120],
of which at least 4168 include the words gene, expression, and microarray(s).
• Gold OA: The PMC Open Access subset includes 8816 articles published between 2000 and 2006, of
which 3997 including the words gene, expression, and microarray(s) (there is a small amount of overlap
between the two sets).
Proposed Copra
This approach will require several corpora for query development and evaluation, as well as the final patterns
study. The various corpora have differing requirements (in terms of annotation, size, and scope). I list the
needed corpora in Table 2 with their requirements, then discuss related issues and decisions.

Table 2: Proposed Corpora


Aim Nickname and Name Requirements Copra and Annotations
Legend for
Figure 4
2a DCQ Dev Corpus for • must contain a relatively large number Downloaded full text from PMC
developing the of articles across a wide spectrum of OA + TREC IR corpora; no
microarray journals annotations on most of it, 100
Dataset-Creation articles in the Ochsner et al[8]
• must consist of articles that can be
Query review with data creation
downloaded in machine-computable
annotation
text for natural language processing
(NLP) analysis
2a DCQ Eval Corpus for • must not overlap the DCQ Dev Corpus articles that can retrieved by
evaluating Ochsner et al (20 journals +
• must be >= 300 articles (100 to be
microarray PubMed abstract query for
used for evaluation during
Dataset-Creation “microarray/s OR genome-
development, 200 to be used for
Query Evaluation wide OR expression profile/s
validation of the final query)
OR transcription
• must be annotated with whether or not profile/profiling”); annotated for
the studies produce their own gene data creation via inclusion
expression microarray data within the list of articles by
Ochsner et al[8]
• must have full text available for query
2b DSQ Dev Corpus for • must be >= 100 articles 100 articles in the Ochsner et
developing the al[8]; data sharing annotation
• must be annotated with whether or not
microarray Dataset
the studies share their gene
Sharing Query
expression microarray data
• must have full text available for query
2b DSQ Eval Corpus for • must not overlap the DSQ Dev Corpus 300 articles in the Ochsner et
evaluating the al[8] review; data sharing
• must be >= 200 articles
microarray Dataset annotation
Sharing Query • must be annotated with whether or not
Evaluation the studies share their gene
expression microarray data
• must have full text available for query
3 Patterns Corpus for • must contain as broad a spectrum of articles retrieved from full-text
estimating Data articles as possible portals using data-creation
Sharing query that are ARE retrieved
• must have full text available for query
Prevalence, by the data sharing query vs.
Patterns, and those that are NOT retrieved
Modeling by the data sharing query

The relationship between all published articles, those I will use for our study of prevalence and model building
and those I will use for query development is shown in Figure 4. Estimates for the relative sizes of the subsets
are given in Table 5 in the Appendix.

Figure 4: Relationship between all PubMed articles and those included in study. Legend for proposed
corpora is given in Table 2.

D.1 Aim 1 – Does sharing have benefit for those who share?
Goal: Measure the association between an article’s publication citation rate and whether its authors made their
gene expression datasets publicly available.
Importance: While the general research community benefits from shared data, much of the burden for sharing
the data falls to the study investigator. Demonstrating a boost in citation rate would be a potentially important
motivator for publication authors. To my knowledge, this is the first study to investigate a relationship between
citation rate and biomedical data availability. This work also serves as preliminary work for measuring sharing
prevalence and patterns.
Dataset and Methods: We examined the citation history of 85 cancer microarray clinical trial publications with
respect to the availability of their data.
Findings: The 48% of trials with publicly available microarray data received 85% of the aggregate citations.
Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently
of journal impact factor, date of publication, and author country of origin using linear regression.
Limitations: An important limitation of this proposal: associations do not imply causation. The research here
will not be sufficient to conclude that data sharing causes increased citations. It would be possible that both
factors stem from a common cause, such as a high level of research funding. The study is also performed on
a small, relatively homogeneous set of studies.
Status: In fulfillment of my Master thesis requirement, I have completed and published a study addressing Aim
1:
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased
Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
The complete paper will be included in the final dissertation.
Summary
Sharing research data provides benefit to the general scientific community, but the benefit is less obvious for
the investigator who makes his or her data available. We examined the citation history of 85 cancer microarray
clinical trial publications with respect to the availability of their data. As seen in Table 1, trials published in high
impact journals, prior to 2001, or with US authors were more likely to share their data.
Table 3: Characteristics of eligible trials by data sharing.
Reproduced from [10].

The 48% of trials which shared their data received a total of 5334 citations (85% of aggregate), distributed as
shown in Figure 1.

Figure 5: Distribution of 2004-2005 citation counts of 85 trials by data availability.


Reproduced from [10].
Whether a trial's dataset was made publicly available was significantly associated with the log of its 2004–2005
citation rate (69% increase in citation count; 95% confidence interval: 18 to 143%, p = 0.006), independent of
journal impact factor, date of publication, and US authorship. Detailed results of this multivariate linear
regression are given in Table 2. This result held even for lower-profile publications and thus is relevant to
authors of all trials.
Table 4: Multivariate regression on citation count for 85 publications.
Reproduced from [10].

Research consumes considerable resources from the public trust. As data sharing gets easier and benefits are
demonstrated for the individual investigator, hopefully authors will become more apt to share their study data
and thus maximize its usefulness to society.

D.2 Aim 2 – Can sharing and withholding be systematically measured?


Working Goal: Develop and evaluate methods for identifying biomedical research data sharing and
withholding. This will involve two sub-aims, discussed separately below:
Aim 2a: identify studies that create data, and
Aim 2b: identify the subset of these studies that share their data.
For the purposes of Aim 3, articles identified as creating data (under Aim 2a) but not sharing data (under Aim
2b) will be considered to be withholding data.

Aim 2a – Identify studies that create data


Background
Although MeSH terms (“gene expression profiling” OR “oligonucleotide array sequence analysis”) provide a
useful filter for identifying articles about gene expression microarray data, they are not specific enough to find
only those that generate gene expression microarray data. This is true for two reasons. First, these MeSH
terms are sometimes used to annotate papers about other data types, such as RT-PCR or SAGE. My
preliminary attempts to develop a refined MeSH query that excludes these other datatypes have not been
successful. Second, even if the studies are about gene expression microarray data, annotation with these
MeSH terms does not imply that the study generated its own microarray data. The study could, for example,
be reusing the shared data of other researchers. A reuse study would not be in a position to share the raw
microarray data, since they are not the primary authors. I want to exclude such cases from our “dataset
creation” corpus to avoid considering it a case of data withholding.
The information to determine whether a study created microarray data is often only found in article full text
(usually in the methods section, but also elsewhere).
Proposed Method
I propose to use statistical and lexical NLP approaches to design a query that can run in a full text articles via a
portal, retrieving articles that have run gene expression microarray experiments. The method is illustrated in
Figure 3.
NLP Approaches
I plan to use NLP techniques such as those explored in the preliminary work described in Section C.3 to create
a classifier that identifies articles that have created gene expression microarray datasets. I anticipate that “wet
lab” words such as isolate, hybridize, and probe will be relevant. Statements of data generation, such as "we
generated gene expression" may also have sufficient precision, though may not be practical due to stop word
exclusions within the portals. In general, I plan to investigate the development set for statistically predictive n-
grams as well as develop manual rules.
If this isn’t sufficient, I can explore some approaches that require more manual intervention: For example, a
multi-step query approach, where the portals are queried multiple times, each time with a different strings,
would allow feature weighting. I could involve MeSH terms as features if necessary. I could also experiment
with additional NLP techniques such as semi-supervised training[121], bootstrapping cue phrases[99],
patterns[122], and regular expressions[123].
Reference Standard
As discussed in Section D.0, I will use 300 random articles from the Ochsner et al[8] review as a reference
standard for the performance of the query.
Query Evaluation
Recall and precision will be calculated for the query responses given the reference standard.
The contribution of this filter will be assessed by comparing its performance to a baseline filter in which one of
the following words occurs in the article’s full text: isolate*, hybridiz*, or probe*.
Since there are no established performance requirements, I will consider performance adequate if precision is
above 70% and recall is high enough that use of the filter in Aim 3 will result in sufficient datapoints to power
the subsequent analysis. I estimate having about 30 variables in my regression (see Aim 3). Opinions differ
on how many datapoints are needed to adequately power a regression analysis[124]. A rule of thumb is that
medium effect sizes about 8 datapoints per variable + 50 [124], which suggests that 300 articles are necessary
to estimate covariates for 30 variables. Some statisticians suggest that “the cases-to-Independent Variables
(IVs) ratio should ideally be 20:1”[125] or even 30:1 to limit bias[126]. These requirements would suggest a
need for between 600 and 900 data points. Alternatively, a regression power analysis (power = 80%, alpha =
0.05) suggests that about 1250 datapoints are required to detect a small effect size across 30 variables[124]. I
would like to take the conservative case and ensure that the NLP query retrieves at least 1250 articles that
create microarray datasets.
Risks and Contingency Plans
The largest technical risk in the research plan is that it may be unexpectedly difficult to automatically identify
dataset production with acceptable precision and recall. In this case, I plan to supplement the automated
classification with manual curation, possibly resulting in a smaller cohort of articles for analysis.
Limitations and Assumptions
This approach assumes that data sharing is accompanied by mention in the text. Our preliminary work
suggests this is usually true[10], but it may miss a small-but-important subset of circumstances where data is
shared after publication.

Aim 2b – Identify studies that share their data


Background
The Gene Expression Omnibus (GEO)[127] has emerged as the dominant centralized repository for sharing
gene expression microarray data, with many journal policies requiring submission to it specifically[17]. It is well
integrated with PubMed query results and contains links from submitted datasets to primary citation reports.
Our preliminary work, reported in Section C.1, suggests that database submission links have high recall for
retrieving articles with data shared in centralized databases. Database submission links have the added
benefits of almost-perfect precision, a wide scope without the need for access to full text, and no bias
introduced through community norms in lexical statements of data sharing within full text. Relationships
between articles with shared data and those that I will consider to have shared and withheld data for the
purposes of this study are illustrated in Figure 6Error! Reference source not found..

Figure 6: Data sharing classifications used in this study

Proposed Method
I propose to identify shared data using citation links from GEO, as identified by the PubMed filter
“pubmed_gds[filter]”.
Reference Standard and Query Evaluation
It is important to evaluate the recall of GEO links to ensure it is sufficiently high that my analysis for Aim 3 is
not unacceptably biased due to overlooking valid sharing mechanisms. A response rate of 70% is often
considered sufficient to limit bias in survey[128-130], so I will adopt the same acceptability criterion.
My proposed method for estimating recall is summarized in Figure 3; using 300 random articles from the
Ochsner et al[8] review as a reference standard, I will calculate:
Recall = the number of articles that Ochsner as having shared data that also are linked from GEO divided by
the total number of articles found by Ochsner as having shared data.
Risks and Contingency Plans
If the query evaluation suggests that GEO links provide a recall less than 70%, I will supplement the
identification of data sharing by using article submission links from ArrayExpress and the Stanford Microarray
Database. If recall is still less than 70%, I will develop and apply NLP filters such as the one developed in
[115] to sacrifice some precision for recall.
Limitations and Assumptions
This approach includes sharing to the predominant centralized database and excludes sharing to GEO for
which there is no citation link within the submission entry. Although this is unfortunate and will lead to
underestimating the prevalence of data sharing, I don’t expect it to bias our estimates of data sharing patterns.
Another limitation is that the gold standard is only against articles that refer to data sharing. It is possible that
datasets may be shared without mention in the article, though my preliminary data suggests this is rare.[10]
D.3 Aim 3 – How often is data shared? What predicts sharing? How can we model sharing
behavior?
Working Goal: Measure current data sharing and withholding behavior, and associate these sharing decisions
with features that may predict or influence an investigator’s choice. This will be done through two sub-aims:
Aim 3a: estimate prevalence of data sharing
Aim 3b: assess individual contributions
Aim 3c: investigate multidimensional factors
Importance: Understanding the prevalence and patterns with which datasets are shared is a key step in
evaluating and refining policies that encourage data sharing. To our knowledge, this will be the first extensive
evaluation of observed data sharing behavior in the biomedical literature.
Dataset: As discussed in D.0, the dataset will be comprised of all articles reachable by full text query within
PubMed Central, Highwire Press, and Scirus (NPG + Elsevier) using the query developed in Aim 2a. Articles
cited from within GEO, plus those found by any supplemental methods added in Aim 2b, will be considered to
have shared data; the rest of the articles will be considered to have withheld data.

Aim 3a – Estimating prevalence


Background
Although preliminary work and a recent survey[8] have quantified the prevalence of gene expression data
sharing, this has yet to be done on a large-scale basis, across a wide range of years and journals.
Method
Calculate prevalence of GEO link data sharing within our full sample as described in Section D.0, then adjust
this raw estimate based on the relevant precision and recall values to account for over- and under- estimates in
retrieval numbers due to query imprecision:
• Raw number of data sharing articles = Number of articles identified by Data Sharing Query on Patterns
Cohort
• Precision-adjusted number of data sharing articles = Raw number of data sharing articles * Precision of
Data Sharing Query
• Fully-adjusted number of data sharing articles = Precision-adjusted number of data sharing articles /
Recall of Data Sharing Query

• Raw number of data creation articles = Number of articles identified by Data Creation Query on
Patterns Cohort
• Precision-adjusted number of data creation articles = Raw number of data creation articles * Precision of
Data Creation Query
• Fully-adjusted number of data creation articles = Precision-adjusted number of data creation articles /
Recall of Data Creation Query
Raw prevalence = Number of articles identified by Data Sharing Query on Patterns Cohort / Number of
articles identified by Data Creation Query on Patterns
Adjusted prevalence = Fully-adjusted number of data sharing articles/ Fully-adjusted number of data creation
articles

I will compare this estimate to the recent sample by Ochsner et al.[8]. However, because the estimates were
selected with different criteria, I do not necessarily expect the prevalence rates to be identical.
Aim 3b – Assessing individual contributions
Proposed features
I selected a set of features to collect and analyze, chosen based on degree of directness it serves as a proxy,
completeness for which they are available, and ease of collection within the scope of this project. I
hypothesize the following variables will be associated with an increased prevalence of data sharing:

Author characteristics (for both the first and last authors, separately)
Feature Hypothesized direction for Proposed data source Limitations
more probable data sharing
number of prior gene more gene expression PubMed with gene author name*, inexact
expression publications publications expression MeSH filter filter
number of prior publications more prior publications PubMed author name*
career citations in PMC more citations PubMed with LinkOut to author name*, limited to
PMC citations PMC citations
years since first publication more years since first PubMed author name*
publication
published in open access the author has published a PubMed Central open author name*
journals before? paper in a gold OA journal access filter
previously reused gene the author has published a GEO data reuse catalog author name*, low recall
expression datasets from paper in the GEO data reuse because set is very
GEO catalog incomplete
published papers with shared the author has published PubMed with GEO misses datasets without
microarray data before papers with shared data datasets filter GEO links
before
personally shared data before the author has shared data GEO database submitter author name*, misses
before list datasets without GEO
links
NIH PI author has a current NIH grant NIH CRISP download author name
gender the author is female given name gender low recall for non-Western
database names

*author name issue involves missing data because the same data can have different names (or different
representations via initials), and different authors can have the same name. I suspect this will occur often,
however I believe it will not impact the results since I have no reason to believe it would occur with a different
rate between articles with shared data and those without.

Study characteristics
Feature Hypothesized direction for more probable data Proposed data source Limitations
sharing
organism under study non-human research [11, 116] MeSH terms
disease under study non-cancer research[116] MeSH terms
Will also include human*cancer interaction variable. [116]

Environmental characteristics
Feature Hypothesized Proposed data source Limitations
direction for more
probable data
sharing
year of publication most recent MEDLINE
number of co- more co-authors MEDLINE
authors
author country of US address MEDLINE only captures
residence corresponding author
country
number of funding more funding MeSH terms for funding sources limited to those
sources sources [116] mentioned within
PubMed
journal prestige higher impact factor ISI Journal Citation Reports for 2007 impact factor
[17] changes with time,
some journals not
indexed
open-access journal gold open-access PMC OA list does not include
author-archiving
(preprint archiving, or
green open access)
research-orientation top 25 NIH-funded Carnegie classification reports limited to US
of university university (http://www.carnegiefoundation.org/classifications),
Hendrix medical school data[111]
number of datasets more datasets GEO submitter list, organization limited to GEO
submitted by this
institution
university vs. other university AUTM tech transfer report [131] limited to US
type of institution

relative amount of less tech transfer AUTM tech transfer report [131] limited to US
tech transfer from
this institution

Policy characteristics
Feature Hypothesized direction for Proposed data source Limitations
more probable data sharing
journal data sharing policy policies that require a Categorization in [17] not all journals listed
database accession number
funder data sharing policy NIH grants after 2004 MeSH terms limited to NIH within the
US

To make help identify confounders and spurious results related to my measurement methods, I will include
several additional variables to identify which query engine (PubMed Central, Highwire Press, and/or Scirus)
was used to find each paper, and variables to flag when data for a particular variable was not in the scope of
the data available (for example, articles with a non US-address are not within the scope of my institution
ranking data source).
Features outside current scope
It would be nice to measure the impact of additional variables. Some are listed below. Unfortunately, these
are difficult to systematically extract and therefore likely outside the scope of the current study:
characteristics of all/any of the authors: • have any patents
• age • have a particularly high/low workload
• training location [11] • know how to share data
• institution location • have plans to use the data again
• trained in informatics, medicine, and/or • have plans for IP or commercial spinoffs
biology • degree of social pressure for commercial
• received training on data sharing [11] activities vs. openness at their institution
• have previous positive or negative • believe that data sharing benefits others
experiences with data sharing [11]
• involved in commercial activities [11]
• believe that data sharing benefits characteristics of the environment and study:
themselves • reuses data
• funded by industry [11]
• relative funding level
• data sharing required by other funders
• data sharing plan included in study proposal
• data sharing plan funded specifically
• article has been self-archived (green open
access)
• attributes of the dataset (trial size,
Affymetrix platform, …)

Statistical Analysis
I will compute the univariate odds for each of the features to assess the degree to which they are associated
with sharing datasets that have been produced, following the methods of Eysenbach[69] in my use of the
nonparametric Wilcoxon Mann-Whitney test for continuous variables that are not normally distributed, and a
comparison of proportions using Fisher’s exact test for dichotomous variables and the Freeman-Halton test for
variables with more than two levels.
Finally, I will use multivariate logistic regression to compute the independent association of each variable to the
probability of sharing. I will report the coefficients and 95% confidence intervals.
Risks and Contingency Plans
Some of the variables may prove more difficult to extract than expected. In this case, if the particular variable
is not essential, I will defer its analysis to future work.
Limitations and Assumptions
An important limitation of this proposal: associations do not imply causation. The research here will not be
sufficient to conclude, for example, that a policy change associated with increased data sharing will in fact
cause increased sharing. It would be possible that both factors stem from a common cause.
This study has several additional limitations. Although restricting the study to only microarray allows an in-
depth analysis of specific facets of data sharing, future work should apply the methodology and lessons
learned to other datatypes to quantify generalizability. The study is limited by the accuracy with which I can
identify dataset creation and data sharing. The study is limited to published articles with queriable full text, and
thus will omit some older articles or those published in more obscure journals. I am not considering datasets
as shared if they are available upon request or published online in another venue than a major database, and
may thereby discount an important and effective sharing mechanism. I will be unable to unambiguously
identify authors, and thus my estimations of previous publishing, data sharing behavior, and grant information
will contain errors. This analysis assumes that the first and last authors are the main decision makers about
whether or not to share datasets. This may not be true. Finally, many variables are US-centric, which erodes
our ability to understand the influences of institutional factors or funding levels, for example, in the rest of the
world (about 46% of microarray papers have author addresses within the US).

Aim 3c – Factor model of data sharing behavior


Background and Method
As the last component of this project, I intend to use the data collected in Aim 3b to derive a model of data
sharing behavior using exploratory factor analysis. Many models have been explored for data sharing attitudes
and intentions (highlighted in Section B.4), but to my knowledge this will be the first model to explain
demonstrated data sharing actions from observed variables. As such, an exploratory factor analysis is
appropriate.
I plan to use standard techniques to assess how many factors are appropriate, based on the data analysis, and
attempt a general interpretation of the resulting factors (e.g. Kolekofsky et al’s exploratory factor analysis of
information sharing within an organization[113]).
Expected Results
I expect that the resulting model will provide some insight into data sharing behavior, and provide a useful
springboard for confirmatory analysis in a different domain, with different proxy variables, or with survey
opinion data.
Risks and Contingency Plans
The analysis may fail to find a robust or interpretable model, in which case I will attempt principal component
analysis(e.g. [78]), and/or other exploratory clustering techniques to form homogeneous composite variables,
similar to the approach taken by [112].
Limitations and Assumptions
This approach assumes that my list of observed variables includes proxies for many of the actual decision-
making influences. It will be important to emphasize that the model does not imply causality, but only
association.

D.4 Future Directions


This work will provide valuable experience, tools, and results for future explorations. Possible directions
include:
• Confirming validity of the data model with an independent data set
• Supplementing observed characteristics with opinions and attitudes gathered through an opinion survey
of authors
• Studying the sharing of additional datatypes
• Identifying datasets shared outside of GEO databases
• Identifying and analyzing data reuse
• Implementing, evaluating, and analyzing a Data Reuse Registry[117]
• Social network analysis of data sharing and reuse behavior
• Identifying attributes associated with which author decides whether study data will be shared, and
which author does the submitting work
• Articulating and advocating best-practices, to hopefully reduce the “activation energy” of data
sharing[132]

D.5 Time Table


Bibliography and References Cited
1. Merton, R.K., The Sociology of Science: Theoretical and Empirical Investigations., 1973.
2. Gass, A., Open Access As Public Policy. PLoS Biology, 2004. 2(10): p. e353.
3. Vickers, A., Whose data set is it anyway? Sharing raw data from randomized trials. Trials, 2006. 7: p.
15.
4. Santos, C., J. Blake, and D.J. States, Supplementary data need to be kept in public repositories.
Nature, 2005. 438(7069): p. 738.
5. Evangelou, E., T.A. Trikalinos, and J.P. Ioannidis, Unavailability of online supplementary scientific
information from articles published in major journals. Faseb J, 2005. 19(14): p. 1943-4.
6. Wren, J.D., URL decay in MEDLINE--a 4-year follow-up study. Bioinformatics, 2008. 24(11): p. 1381-5.
7. SULLIVAN, M.G., Controversy Erupts Over Proteomics Studies. Ob. Gyn. News, 2005.
8. Ochsner, S.A., et al., Much room for improvement in deposition rates of expression microarray
datasets. Nature Methods, 2008. 5(12): p. 991.
9. Noor, M.A., K.J. Zimmerman, and K.C. Teeter, Data Sharing: How Much Doesn't Get Submitted to
GenBank? PLoS Biol, 2006. 4(7).
10. Piwowar, H.A., R.S. Day, and D.B. Fridsma, Sharing detailed research data is associated with
increased citation rate. PLoS ONE, 2007. 2(3).
11. Blumenthal, D., et al., Data withholding in genetics and the other life sciences: prevalences and
predictors. Acad Med, 2006. 81(2): p. 137-145.
12. NLM's Long Range Plan 2006-2016 - Goal One.
13. NLM's Long Range Plan 2006-2016 - Goal Three.
14. National Center for Research Resources - Strategic Plan 2009-2013 - Strategic Initiatives - IV.
Informatics Approaches to Support Research.
15. Fienberg, S.E., M.E. Martin, and M.L. Straf, Sharing research data. 1985, Washington, D.C.: National
Academy Press. viii, 225 p.
16. McCain, K., Mandating Sharing: Journal Policies in the Natural Sciences. Science Communication,
1995. 16(4): p. 403-431.
17. Piwowar, H.A. and W.W. Chapman, A review of journal policies for sharing research data. 2008,
Available from Nature Precedings <http://hdl.handle.net/10101/npre.2008.1700.1>.
18. NIH. Availability of Information from the NIH Grants Policy Statement (12/03) - Part II: Terms and
Conditions of NIH Grant Awards - Subpart A: General -- File 2 of 5. 2003; Available from:
http://grants.nih.gov/grants/policy/nihgps_2003/NIHGPS_Part5.htm#_Toc54600098.
19. NiH. NOT-OD-08-013: Implementation Guidance and Instructions for Applicants: Policy for Sharing of
Data Obtained in NIH-Supported or Conducted Genome-Wide Association Studies (GWAS). 2007;
Available from: http://grants.nih.gov/grants/guide/notice-files/NOT-OD-08-013.html.
20. Cech, T., Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life
Sciences. 2003, Washington: National Academies Press.
21. Got data? Nat Neurosci, 2007. 10(8): p. 931-931.
22. Compete, collaborate, compel. Nat Genet, 2007. 39(8).
23. Democratizing proteomics data. Nat Biotech, 2007. 25(3): p. 262-262.
24. Time for leadership. Nat Biotech, 2007. 25(8): p. 821-821.
25. How to encourage the right behaviour. Nature, 2002. 416(6876): p. 1-1.
26. Altman, M. and G. King, A proposed standard for the scholarly citation of quantitative data. D-Lib
Magazine, 2007. 13(3/4).
27. Kakazu, K.K., L.W. Cheung, and W. Lynne, The Cancer Biomedical Informatics Grid (caBIG):
pioneering an expansive network of information and tools for collaborative cancer research. Hawaii
Med J, 2004. 63(9): p. 273-5.
28. New models of collaboration in genome-wide association studies: the Genetic Association Information
Network. Nat Genet, 2007. 39(9): p. 1045-1051.
29. NIH., NOT-OD-08-013: Implementation Guidance and Instructions for Applicants: Policy for Sharing of
Data Obtained in NIH-Supported or Conducted Genome-Wide Association Studies (GWAS). 2007.
30. Mailman, M., et al., The NCBI dbGaP database of genotypes and phenotypes. Nat Genet, 2007.
39(10): p. 1181-1186.
31. Geschwind, D.H., Sharing gene expression data: an array of options. Nat Rev Neurosci, 2001. 2(6): p.
435-8.
32. Martone, M.E., A. Gupta, and M.H. Ellisman, E-neuroscience: challenges and triumphs in integrating
distributed data from molecules to brains. Nat Neurosci, 2004. 7(5): p. 467-472.
33. Safran, C., et al., Toward a national framework for the secondary use of health data: an American
Medical Informatics Association White Paper. J Am Med Inform Assoc, 2007. 14(1): p. 1-9.
34. Zerhouni, E., Medicine. The NIH Roadmap. Science, 2003. 302(5642): p. 63-72.
35. Committee, S. Nass, and B. Stillman, Large-Scale Biomedical Science: Exploring Strategies for Future
Research. 2003: NATIONAL ACADEMY PRESS.
36. The Cancer Biomedical Informatics Grid (caBIG): infrastructure and applications for a worldwide
research community. Medinfo, 2007. 12(Pt 1): p. 330-334.
37. Grethe, J.S., et al., Biomedical informatics research network: building a national collaboratory to hasten
the derivation of new understanding and treatment of disease. Stud Health Technol Inform, 2005. 112:
p. 100-9.
38. Sinnott, R.O., et al., Large-scale data sharing in the life sciences: Data standards, incentives, barriers
and funding models (The Joint Data Standards Study). The Biotechnology and Biological Sciences
Research Council, The Department of Trade and Industry, The Joint Information Systems Committee
for Support for Research, The Medical Research Council, The Natural Environment Research Council
and The Wellcome Trust, 2005.
39. BBSRC's Data Sharing Policy. Available from:
http://www.bbsrc.ac.uk/publications/policy/data_sharing_policy.pdf.
40. Lowrance, W., Access to Collections of Data and Materials for Heath Research: A report to the Medical
Research Council and the Wellcome Trust. 2006.
41. NIH Data Sharing Policy and Implementation Guidance. 2003; Available from:
http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm.
42. Ball, C.A., et al., Submission of microarray data to public repositories. PLoS Biol, 2004. 2(9).
43. Geer, R.C. and E.W. Sayers, Entrez: Making use of its power. Briefings in Bioinformatics, 2003.
44. Ruttenberg, A., et al., Advancing translational research with the Semantic Web. BMC Bioinformatics,
2007. 8(Suppl 3).
45. Li, K., et al., BioPortal: A Portal for Deployment of Bioinformatics Applications on Cluster and Grid
Environments. LECTURE NOTES IN COMPUTER SCIENCE, 2007.
46. Butler, D., Data sharing: the next generation. Nature, 2007. 446(7131): p. 10-11.
47. Bradley, J., Open Notebook Science Using Blogs and Wikis. Available from Nature Precedings, 2007.
http://dx.doi.org/10.1038/npre.2007.39.1.
48. Social software. Nat Meth, 2007. 4(3): p. 189-189.
49. NOT-OD-08-033 Revised Policy on Enhancing Public Access to Archived Publications Resulting from
NIH-Funded Research.
50. Harvard to collect, disseminate scholarly articles for faculty — The Harvard University Gazette.
51. Foster, M. and R. Sharp, Share and share alike: deciding how to distribute the scientific and social
benefits of genomic data. Nat Rev Genet, 2007. 8(8): p. 633-639.
52. Campbell, P., Controversial Proposal on Public Access to Research Data Draws 10,000 Comments.
The Chronicle of Higher Education, 1999: p. A42.
53. Melton, G.B., Must Researchers Share their Data? Law and Human Behavior, 1988. 12(2): p. 159-162.
54. Gleditsch, N.P. and C. Metelitis, The Replication Debate. International Studies Perspectives, 2003.
4(1): p. 72-79.
55. McCullough, B.D., K.A. McGeary, and T.D. Harrison, Do Economics Journal Archives Promote
Replicable Research? papers.ssrn.com.
56. Reidpath, D.D. and P.A. Allotey, Data sharing in medical research: an empirical investigation. Bioethics,
2001. 15(2): p. 125-34.
57. Campbell, E.G., et al., Data withholding in academic genetics: evidence from a national survey. JAMA,
2002. 287(4): p. 473-480.
58. Ventura, B., Mandatory submission of microarray data to public repositories: how is it working? Physiol
Genomics, 2005. 20(2): p. 153-6.
59. Lee, C.P., P. Dourish, and G. Mark, The human infrastructure of cyberinfrastructure. Proceedings of the
2006 20th anniversary conference on …, 2006.
60. Hedstrom, M., Producing Archive-Ready Datasets: Compliance, Incentives, and Motivation. IASSIST
Conference 2006: Presentations, 2006.
61. Constant, D., S. Kiesler, and L. Sproull, What’s mine is ours, or is it? A study of attitudes about
information sharing. Information Systems Research, 1994.
62. Ryu, S., S.H. Ho, and I. Han, Knowledge sharing behavior of physicians in hospitals. Expert Systems
With Applications, 2003.
63. Seonghee, K. and J. Boryung, An analysis of faculty perceptions: Attitudes toward knowledge sharing
and collaboration in an academic institution. Library 2008. 30(4): p. 282-290.
64. PDBj News Letter. in Volume 7, March 2006 <http://www.pdbj.org/NewsLetter/newsletter_vol7_e.pdf>.
2006.
65. Wren, J.D., J.E. Grissom, and T. Conway, E-mail decay rates among corresponding authors in
MEDLINE. The ability to communicate with and request materials from authors is being eroded by the
expiration of e-mail addresses. EMBO Rep, 2006. 7(2): p. 122-7.
66. Plint, A.C., et al., Does the CONSORT checklist improve the quality of reports of randomised controlled
trials? A systematic review. Med J Aust, 2006. 185(5): p. 263-7.
67. Pienta, A., 1R01LM009765-01 Barriers and Opportunities for Sharing Research Data. 2007, NIH.
68. Zimmerman, A., Data Sharing and Secondary Use of Scientific Data: Experiences of Ecologists. 2003.
69. Eysenbach, G., Citation advantage of open access articles. PLoS Biol, 2006. 4(5): p. e157.
70. Wren, J.D., Open access and openly accessible: a study of scientific publications shared via the
internet. Bmj, 2005. 330(7500): p. 1128.
71. Lynne Mckechnie, G.R.G., How human information behaviour researchers use each other's work: a
basic citation analysis study.
72. Patsopoulos, N.A., A.A. Analatos, and J.P. Ioannidis, Relative citation impact of various study designs
in the health sciences. Jama, 2005. 293(19): p. 2362-6.
73. Lokker, C., et al., Prediction of citation counts for clinical articles at two years using data available
within three weeks of publication: retrospective cohort study. BMJ, 2008: p. bmj.39482.526713.BE.
74. Fu, L. and C. Aliferis, Models for predicting and explaining citation count of biomedical articles. AMIA
Annual Symposium proceedings / AMIA Symposium AMIA Symposium, 2008: p. 222-6.
75. Torvik, V.I., et al., A probabilistic similarity metric for Medline records: A model for author name
disambiguation. Journal of the American Society for Information Science and …, 2005.
76. Lautrup, B.E., S. Lehmann, and A.D. Jackson, Measures for measures. Nature, 2006.
77. Hirsch, J.E., An index to quantify an individual's scientific research output. Proceedings of the National
Academy of Sciences, 2005.
78. Hendrix, D., An analysis of bibliometric indicators, National Institutes of Health funding, and faculty size
at Association of American Medical Colleges medical schools, 1997-2007. Journal of the Medical
Library Association : JMLA, 2008. 96(4): p. 324-34.
79. Adler, R., J. Ewing, and P. Taylor, Citation Statistics. A Report from the Joint, 2008.
80. Stringer, M., et al., Effectiveness of Journal Ranking Schemes as a Tool for Locating Information. PLoS
ONE, 2008. 3(2): p. e1683.
81. Taylor, M., P. Perakakis, and V. Trachana, The siege of science. ESEP., 2008. 8: p. 17-40.
82. Coleman, A., Assessing the value of a journal beyond the impact factor. Journal of the American
Society for Information Science and Technology, 2007. 58(8): p. NA.
83. Rodriguez, M., J. Bollen, and H. Van de Sompel, A Practical Ontology for the Large-Scale Modeling of
Scholarly Artifacts and their Usage. 2007.
84. Bakkalbasi, N., et al., Three options for citation tracking: Google Scholar, Scopus and Web of Science.
Biomedical Digital Libraries, 2006. 3: p. 7.
85. Eysenbach, G. and M. Trudel, Going, going, still there: using the WebCite service to permanently
archive cited web pages. J Med Internet Res, 2005. 7(5): p. e60.
86. West, R. and A. McIlwaine, What do citation counts count for in the field of addiction? An empirical
evaluation of citation counts and their link with peer ratings of quality. Addiction, 2002. 97(5): p. 501-
504.
87. Jensen, L., J. Saric, and P. Bork, Literature mining for the biologist: from information retrieval to
biological discovery. Nature Reviews Genetics. 7(2): p. 119-129.
88. Hearst, M., et al., BioText Search Engine: beyond abstract search. Bioinformatics, 2007. 23(16): p.
2196-2197.
89. Karamanis, N., et al., Natural Language Processing in aid of FlyBase curators. BMC Bioinformatics,
2008. 9(1).
90. Cohen, A., Optimizing feature representation for automated systematic review work prioritization. AMIA
Annual Symposium proceedings / AMIA Symposium AMIA Symposium, 2008: p. 121-5.
91. Kilicoglu, H., et al., Towards automatic recognition of scientifically rigorous clinical research evidence.
Journal of the American Medical Informatics Association : JAMIA, 2009. 16(1): p. 25-31.
92. Nakov, P., A.S. Schwartz, and M. Hearst, Citances: Citation sentences for semantic analysis of
bioscience text. Proceedings of the SIGIR’04 workshop on Search and …, 2004.
93. Rubin, D.L., et al., A statistical approach to scanning the biomedical literature for pharmacogenetics
knowledge. J Am Med Inform Assoc, 2005. 12(2): p. 121-9.
94. Siddharthan, A. and S. Teufel. Whose idea was this, and why does it matter? Attributing scientific work
to citations. in Proceedings of NAACL/HLT-07. 2007.
95. Marco, C., F. Kroon, and R. Mercer, Using Hedges to Classify Citations in Scientific Articles, in
Computing Attitude and Affect in Text: Theory and Applications. 2006. p. 247-263.
96. Eales, J.M., et al., Methodology capture: discriminating between the" best" and the rest of community
practice. BMC Bioinformatics, 2008.
97. Rekapalli, H.K., A.M. Cohen, and W.R. Hersh, A comparative analysis of retrieval features used in the
TREC 2006 Genomics Track passage retrieval task. AMIA Annual Symposium proceedings / AMIA
Symposium AMIA Symposium, 2007: p. 620-4.
98. Yoo, S. and J. Choi, Reflecting all query aspects on query expansion. AMIA Annual Symposium
proceedings / AMIA Symposium AMIA Symposium, 2008: p. 1189.
99. Abdalla, R. and S. Teufel. A bootstrapping approach to unsupervised detection of cue phrase variants.
in ACL '06: Proceedings of the 21st International Conference on Computational Linguistics and the 44th
annual meeting of the ACL. 2006: Association for Computational Linguistics.
100. Melton, G.B. and G. Hripcsak, Automated detection of adverse events using natural language
processing of discharge summaries. Journal of the American Medical Informatics Association : JAMIA,
2005. 12(4): p. 448-57.
101. Faraj, S., Why Should I Share? Examining Social Capital and Knowledge Contribution in Electronic
Networks of Practice. MIS Quarterly, 2005. vol 29.
102. Kuo, F. and M. Young, A study of the intention–action gap in knowledge sharing practices. Journal of
the American Society for Information Science and Technology, 2008. 59(8): p. 1224-1237.
103. Bock, G., R.W. Zmud, and J. Lee, Behavioral Intention Formation in Knowledge Sharing: Examining
the Roles of Extrinsic Motivators, Social-Psychological Forces, and Organizational Climate. MIS
Quarterly, 2005. 29(1): p. 87-112.
104. Ting-PengLiang, C. Liu, and Chia-HsienWu, Can Social Exchange Theory Explain Individual
Knowledge Sharing Behavior? A Meta Analysis. 2008: p. 38.
105. Hsu, M., et al., Knowledge sharing behavior in virtual communities: The relationship between trust, self-
efficacy, and outcome expectations. International Journal of Human-Computer Studies, 2007. 65(2): p.
153-169.
106. Chiu, C., M. Hsu, and E. Wang, Understanding knowledge sharing in virtual communities: An
integration of social capital and social cognitive theories. Decision Support Systems, 2006. 42(3): p.
1872-1888.
107. HARDER, M.I.E., How Do Rewards and Management Styles Influence the Motivation to Share
Knowledge? Center for Strategic Management and Globalization, 2008.
108. Samieh, H.M. and K. Wahba, Knowledge Sharing Behavior From Game Theory And Socio-Psychology
Perspectives. HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, 2007.
109. Cabrera, A., W.C. Collins, and J.F. Salgado, Determinants of individual engagement in knowledge
sharing. The International Journal of Human Resource Management, 2006.
110. Siemsen, E., A. Roth, and S. Balasubramanian, How motivation, opportunity, and ability drive
knowledge sharing: The constraining-factor model. Journal of Operations Management, 2007.
111. Hendrix, D., An analysis of bibliometric indicators, National Institutes of Health funding, and faculty size.
Journal of the Medical Library Association: JMLA, 2008.
112. Wstebro, T., J. Michela, and X. Zhang, The Survival of Innovations: Patterns and Predictors.
manuscript, 2001.
113. Kolekofski, K., Beliefs and attitudes affecting intentions to share information in an organizational setting.
Information & Management, 2003. 40(6): p. 521-532.
114. Piwowar HA, Chapman WW (2008) Identifying data sharing in the biomedical literature. AMIA 2008
Annual Symposium. Available from Nature Precedings <http://hdl.handle.net/10101/npre.2008.1721.1>.
2008.
115. Piwowar, H.A. and W.W. Chapman, Linking database submissions to primary citations with PubMed
Central. BioLINK Workshop at ISMB, 2008: p. 4. Available at
http://www.researchremix.org/data/bioLINK2008%20Piwowar.doc
116. Piwowar, H.A. and W.W. Chapman, Prevalence and Patterns of Microarray Data Sharing. Poster at
PSB 2008, Available from Nature Precedings <http://dx.doi.org/10.1038/npre.2008.1701.1>.
117. Piwowar, H.A. and W.W. Chapman, Envisioning a Biomedical Data Reuse Registry. Poster at AMIA
2008, Available from Google Docs <http://docs.google.com/Doc?docid=dgqz7h9q_136q7rfhrcd>.
118. Piwowar, H.A. and W.W. Chapman, Examining the uses of shared data. 2007, Available from Nature
Precedings <http://dx.doi.org/10.1038/npre.2007.425.3>.
119. Ntzani, E.E. and J.P. Ioannidis, Predictive ability of DNA microarrays for cancer outcomes and
correlates: an empirical assessment. Lancet, 2003. 362(9394): p. 1439-44.
120. Hersh, W., et al., TREC 2006 genomics track overview. The Fifteenth Text Retrieval Conference, 2006.
121. Medlock, B., Exploring hedge identification in biomedical literature. J Biomed Inform, 2008.
122. Riloff, E., Automatically Generating Extraction Patterns from Untagged Text. Proceedings of the
Thirteenth National Conference on Artificial Intelligence, 1996.
123. Ku, C.H., A. Iriberri, and G. Leroy, Natural language processing and e-Government: crime information
extraction from heterogeneous data. Proceedings of the 2008 international conference on Digital
Government Research, 2008.
124. Green, S.B., How Many Subjects Does It Take To Do A Regression Analysis. Multivariate Behavioral
Research, 1991.
125. Abrams, D.R., DSS - Introduction to Regression.
126. Nunnally, J.C. and I.H. Bernstein, Psychometric theory. rds.epi-ucsf.org, 1978.
127. Barrett, T., et al., NCBI GEO: mining tens of millions of expression profiles--database and tools update.
Nucleic Acids Res, 2007. 35(Database issue).
128. Young, J., Mail surveys of general practice physicians: response rates and non-response bias. Swiss
medical weekly : official journal of the Swiss Society of Infectious Diseases, the Swiss Society of
Internal Medicine, the Swiss Society of Pneumology, 2005. 135(13-14): p. 187-8.
129. NLPES Question of the Month. Available at http://www.ncsl.org/nlpes/question/question.htm.
130. Shulman M., Gilbert, B.C., and Lansky A. Pregnancy Risk Assessment Monitoring System (PRAMS).
Public Health Rep. 2006 Jan–Feb; 121(1): 74–83.
131. BOSTROM, D., R. TIECKELMANN, and S. FLANIGAN, Association of University Technology
Managers (AUTM) Licensing Survey: Data Overview. papers.ssrn.com.
132. Smith, K.P., L. Seligman, and V. Swarup, Everybody Share: The Challenge of Data-Sharing Systems.
Computer, 2008.
Appendices

Table 5: Recall of articles through full-text query interfaces


Sample Query Number of articles in Number of articles GEO as
text match of linked from GEO percent
abstracts of text
AND
(full text is matches
pubmed_gds[filter]
expected to find
many more)
gene[text] AND
expression[text]
AND
(microarray[text] OR
microarrays[text])
Number of PubMed "2000"[PDAT] : "2007"[PDAT] 20880 4287
articles 21%
Number of PubMed "2000"[PDAT] : "2007"[PDAT] 19582 4221
articles with links to
AND "loattrfull text"[sb]
full text 22%
Number of PubMed "2000"[PDAT] : "2007"[PDAT] 16323 3776
articles with links to
AND "loprovupittlib"[Filter]
full text from Pitt 23%
a) that are housed "2000"[PDAT] : "2007"[PDAT] 4465 1604
at PMC
AND pubmed_pmc[filter] 36%
b) that are housed "2000"[PDAT] : "2007"[PDAT] 7239 2213
at Highwire
AND loprovhighwire[filter] 31%
c) that are housed "2000"[PDAT] : "2007"[PDAT] 4738 969
at Elsevier Science
AND ("loftextnpg"[Filter] OR
or Nature
"loftextes"[Filter])
Publishing Group
(probably
reachable via
Scirus) 20%
Any of a+b+c "2000"[PDAT] : "2007"[PDAT] 13844 3719
minus overlaps
AND (pubmed_pmc[filter] OR
loprovhighwire[filter] OR
"loftextnpg"[Filter] OR
"loftextes"[Filter])
27%
Reach as a percent 13844/16323= 85% 3719 / 3776 = 98%
of those available
13844/20880 = 66% 3719 / 4287 = 87%
via Pitt library
Table 6: Estimated maximum number of articles that will be found by NLP data creation filter
Query engine Query Number of articles
returned for 2000-2007
a) Highwire All: gene expression (microarray OR 34373
microarrays)
Citation Year: 2007
Highwire-hosted
b) Scirus ScienceDirect and gene expression (microarray OR microarrays) 20667
Nature Publishing Group Infotype: Articles
Source: SciDirect and NPG
For 2007
c) PubMed Central gene[text] AND expression[text] AND 14888
(microarrays[text] OR microarray[text]) AND
("2007"[EDate] : "2007"[EDate])

… PubMed Central not in PubMed Links for PMC (Search gene[text] (3866 of 10000) * 14888
Highwire or Elsevier or NPG AND expression[text] AND (microarrays[text] = 5756
OR microarray[text]) AND ("2007"[EDate] :
"2007"[EDate]))
NOT ("loprovhighwire"[Filter]
OR "loftextnpg"[Filter] OR "loftextes"[Filter])

Totals 60796

Table 7: Number of articles available through full-text query interfaces across various Portals
Literature "gene "gene "gene + links to
Database expression" expression" expression" GEO
microarray microarray microarray
hybridiz*
hybridiz*

accession
PubMed (title 2265 213 2 0
and abstract) 9.40% 0.94%
Google 21100 4620 1870
Scholar (using
“hybridized”) 21.90% 40.48%
PubMed 3148 1851 839 311 58.8%
Central 58.80% 45.33%
PubMed 2063 1203 542 188 58.3%
Central Open
Access
58.31% 45.05%
HighWire 7543 3601 1450 61 of 115 53%
Press hosted (subset of
1450)(30
of 61 in
47.74% 40.27% PMC)
HighWire 2028 1048 436
Press subset 51.68% 41.60%
Scirus articles 5153 2437 916
(includes 191
Science
47.29% Direct) 37.59%

Table 8: Full text accessibility across journals


Full text availability of journals that publish gene expression microarray data frequently (rank 1-10) and less
frequently (rank 40-50). Additional column indicates whether the aricles will be included in the data collection
set (requires full text queries through PMC, Highwire, or Scirus NPG+Elsevier)
rank Journal Full text
queriable

collection
through
centralized

data
portal

1 Bioinformatics Highwire 
2 BMC Bioinformatics PMC 
3 Cancer Research Highwire 
4 BMC Genomics PMC 
5 PNAS PMC 
6 J Biol Chem Highwire 
7 Nucleic Acids Research PMC 
8 Oncogene Scirus: NPG 
9 Clinical Cancer Research Highwire 
10 Physiol Genomics Highwire 
41 Faseb J Highwire 
42 Gene Scirus: Elsevier 
43 Carcinogenesis Highwire 
44 Brit J Cancer Scirus: NPG 
45 J Neurochem none (Wiley)
46 Dev Biol Scirus: Elsevier 
47 Nat Genetics Scirus: NPG 
48 Plant Cell Highwire 
49 Mol Cancer Ther Highwire 
50 J Neurosci Highwire 

Table 9: Availability of full text for journals used in Ochsner et al.


Journal Full text queriable through
centralized portal
evaluation

Proc Natl Acad Sci U S A PMC 


Cancer Res Highwire 
J Biol Chem Highwire 
J Immunol Highwire 
Blood Highwire 
Mol Cell Biol Highwire 
Endocrinology Highwire 
Am J Pathol PMC 
Mol Endocrinol Highwire 
FASEB J Highwire 
J Endocrinol Highwire 
Scirus: Elsevier 
Mol Cell
Nat Methods Scirus: NPG
Nature Scirus: NPG
EMBO J PMC 
Nat Genet Scirus: NPG
Nat Med Scirus: NPG
Nat Cell Biol Scirus: NPG
Scirus: Elsevier 
Cell

Science

Table 10: Data sharing breakdown from Ochsner et al.


location number percent
not shared 211 53%

GEO 138 35%

ArrayExpress 24 6%

SMD 4 1%
journal 11 3%
other 9 2%
Grand Total 397 100%

Table 11: Sample size required for confidence levels by population size
Population size +/- 3% +/- 5% +/- 7.5% +/- 10%
2000 696 323 157 92
3000 788 341 162 94
5000 880 357 165 95
10000 965 370 168 96
20000 1,014 377 169 96
50000 1,045 382 170 96
100000 1,058 383 170 96
From http://www.zoomerang.com/MKT/samplesize-calculator/step3.html and also see
http://www.surveysystem.com/sscalc.htm

Das könnte Ihnen auch gefallen