Sie sind auf Seite 1von 8

Managing Data Overload in the Fields of Synthetic Biology and Public Health

Lars Crawford John Gennari 5/31/12

Introduction: An amazing aspect of science is that it continually builds upon itself: the longer any form of scientific knowledge exists the more data can be collected, which allows for the solidification of ideas in the form of theory. The amassing of this data also births novel hypotheses through a series of careful observations which ultimately give rise to new fields. However, as the amount of data across the multitude of fields grows, it becomes increasingly difficult to manage. This brings us to one of the major themes in biomedical informatics and informatics in general: data overload. Humans have been keeping track of information for millennia, but until the invention of computer systems that foster digitization, data was written down and stored manually. The digitization of information has been monumental not only for our ability to store data but to categorize it, access it, interpret it, and compare it as well. Inevitably, as the technologies for managing data continue to develop, so too do the technologies for obtaining it. The question then becomes: can the systems designed for managing data keep pace with those designed to acquire it? This question becomes particularly pertinent when considering the fields of synthetic biology and public health; the field of synthetic biology is currently exploding, and as the United States, as well as the globe, continues to develop, more and more information falls under the realm of public health. In this paper I will discuss how vast amounts of data are both beneficial and detrimental and the methods by which the fields of synthetic biology and public health attempt to overcome the challenges presented by an overload of information.

Effects of a Large Data Pool: One of the key aspects of the scientific method revolves around conducting multiple trials of an experiment to ensure a large sample size. This enables scientists to come to clearer conclusions about the data that is collected as it removes the effects of outliers and error as well as enunciates trends. Although the public health sector does not directly apply the scientific method in its operation, the benefits of a large sample size are noteworthy. Take, for example, the use of the syndromic surveillance system Google Flu (discussed further below): a single query for the symptoms of an Influenza Like Illness (ILI) provides little information about the health of the public while thousands of queries for ILI symptoms have the ability to accurately represent the rates of disease diagnosis around the country. In this regard, a vast amount of data is advantageous. Beyond the sample size of a single experiment, having data from a large number of experiments is also extremely beneficial. In a field like biology that seeks to determine the biochemical pathways in living organisms, and a field like synthetic biology that seeks to

manipulate these pathways for industrial or other purposes, having more data creates a clearer picture of how specific molecules function and interact. With a rich understanding of these pathways comes the ability to apply them in unique ways to produce new products. Once again, it is clear a large data pool is exceedingly valuable. Though it is true that greater amounts of data result in increasingly fruitful systems, as data pools continue to grow the issue of their management comes into play. Data management is extremely important in keeping these systems functioning succinctly. It is so important, in fact, that without such management these systems would be nearly impossible to use. However, that their only detriment is that they are difficult to manage means that the benefits of large data pools can be fully realized if standards systems are created to do so. In this way, data overload/management and standardization are closely related themes. It may seem a bit redundant to create systems to manage systems, but without data management the fields of synthetic biology and public health would not function optimally.

Data Overload and Synthetic Biology: The rapid expansion of the field of synthetic biology is a direct result of the ever increasing amount of data being collected. This data comes from several different techniques utilized across a plethora of experiments. One such technique called in situ hybridization (ISH) creates microarrays that correspond to differential gene expression in organisms under different conditions. Microarrays are exceptionally useful for synthetic biology as they help scientists determine the conditions under which a specific organism manufactures certain molecules. This knowledge then allows for the creation of systems that function optimally and produce desired products. As the number of species whose genomes are completely sequenced grows, so too does microarray expression data. Once a microarray has been generated for an organism under specific conditions, the logical next step is to catalogue it so that future scientists may make use of the information it contains. Researchers at Stanford University who have been exploiting the ISH technique since its conception took it upon themselves to create a system that stores raw and normalized data from microarray experiments, and provides web interfaces for researchers to retrieve, analyze and visualize their data.1 They named this system Stanford Microarray Database (SMD) and, as it was developed in 1999, is one of the first of its kind. SMD was initially created for researchers at Stanford to catalogue the various microarrays produced from a variety of experiments, but as it is a free web-based system, other researchers have been able to employ the system for their own experiments. However, as synthetic biology is a relatively new field, the names for all the different genes catalogued in

the SMD are not universal. This, in combination with the fact that SMD contains over 50,000 genes and that new ones are being added almost daily, necessitates simple, intuitive interfaces that enable the user to narrow down the experiments with which they are dealing. As such, SMDs query model makes use of a record of the researcher who catalogued the microarray as well as several different categories that describe the biological nature of the experiment and the organism whose DNA was hybridized. The query field allows for one or more of these to be input at a time in order to narrow down the search for specific microarrays. This aspect of a broad to narrow query function is necessary when retrieving data from a large database to which standards are not completely universal. Without the ability to search broadly one may never find the data they are looking for. Conversely, without the ability to search narrowly, one may spend an excessive amount of time searching for the data they need. This is especially true for a field as new and quickly progressing as synthetic biology. Since the creation of SMD, other array databases have sprung up, such as ArrayExpress with which SMD collaborates closely. Unlike SMD, that only accepts data input by Stanford researchers, ArrayExpress is a public repository for microarrays that allows for the two way communication of data. This means that researchers outside of Stanford can both access and deposit microarray data. Another notifiable aspect of ArrayExpress is that data imported into the system is converted to a MAGE-ML file under which the Minimum Information About a Microarray Experiment (MIAME) standard is supported. The MIAME standard is extremely useful as its definitions are based on the content and structure of the necessary information rather than the technical format for capturing it.2 Generating a standard for a system like this aids in the dissemination of data to the public as it allows systems with source codes unrelated to that of SMD to access the microarray data. The utilization of the MIAME standard by ArrayExpress demonstrates another key aspect of the successful management of data: even data pools as massive as SMD become easily accessible once a system that implements standards is put in place. Another microarray database, ArrayDB, which was created by the National Institute of Health for categorizing gene expression of infectious agents, is an exhibit of a system that is functionally similar to SMD but is instead used to guide public health. A final aspect of successful data management systems like SMD, ArrayExpress, and similar systems like ArrayDB is the ability of the system to be flexible enough to accommodate new statistical data mining tools as they become available.4 Systems like these that are a part of a rapidly growing field need to be able to incorporate new methods of data retrieval so that existing systems can be upgraded instead of requiring new systems to be built. The other side of this aspect of flexibility in data management systems is demonstrated by the intuitive query system developed for another database named PROFESS (PROtein Function, Evolution, Structure and Sequence). The query system used by the PROFESS database incorporates a variety of similarity functions capable of generating data relationships not conceived during the

creation of the database.5 This also allows systems to progress with the field with respect to new data rather than new methodologies. Taken together, these aspects of data management systems like SMD and PROFESS enable the scientists involved with evolving fields like synthetic biology to reach their full potential.

Data Overload and Public Health: Though its goals are vastly different from those in synthetic biology, the aspects of the systems created to manage the voluminous amount of data in the field of public health are closely related: the successful management of data relies on the development of systems that can retrieve data, categorize it, analyze it, and store it. Unlike synthetic biology though, that these systems function optimally and succinctly is of utmost importance as far more is at stake than research; the safety of the public at large is put at risk when these systems do not function properly. Another dissimilarity between data management in synthetic biology and public health is that, unlike synthetic biology, public health has an organization that centralizes the majority of public health data in America: the Center for Disease Control and Prevention (CDC). This centralization provides a unique opportunity for a universal standard to be implemented for all the data collected for the field of public health. Centralization is beneficial to the entire public health sector as the CDC can then make this data accessible and useful to public health practitioners at the local, state, national, and even international levels of decision making. 6 Starting in the early 90s the CDC created a system, WONDER (Wide-ranging ONline Data for Epidemiologic Research), that does just that. The WONDER system provides access to a wide variety of sources including surveillance systems, specialized studies, the Morbidity and Mortality Week Report (MMWR), and descriptions of state and local health department activities. With this information freely available online, multiple levels of public health care are constantly up to date with current trends of diseases and other notifiable conditions around the country. That this data can be retrieved close to real time allows health care providers to be better prepared for outbreaks as they occur. Since the creation of the WONDER system, new and powerful syndromic surveillance systems like Google Flu, that uses query data from the search engine Google, have been created that aid in the tracking of ILIs. As both the WONDER and Google Flu systems arose in times of rapid technological advancement, they were designed for and to evolve with public health.6 This exemplifies another similarity between synthetic biology and public health with respect to data management: flexible systems have the ability to progress with the field as more data is acquired and new data acquisition methods are developed.

Another similarity arises when considering the application of these fields: both are geared toward providing services to the public. Though this is true, a distinction arises between the sources of the data; public health directly involves the public that it cares for, such as with search queries, whereas synthetic biology does not. That public health relies on the public itself for its data points, enables the acquisition of data from any realm that directly involves the public, including social media. Just as Google Flu discerns trends from search queries, social media applications like Twitter also have the ability to [provide] the first clues to influenza outbreaks by [tracking] health trends in real time,7 says Nicole Lurie, M.D. Lurie, the assistant secretary for preparedness and response of the Department of Health and Human Services, is offering a prize to the first person or team that can develop such a system. Although using the public as a source of data to track disease trends is useful as it facilitates a response to potential outbreaks, such a response cannot be mounted if the pathogen causing the outbreak is unknown and no treatment is available. As such, systems created through the study of pathogens that contain this information are equally beneficial. The PATRIC (Pathosystems Resource Integration Center) system was developed to assist scientists in the study of infectious diseases. PATRIC is a particularly useful system because not only does it categorize many of the major bacterial lineages, but it also carries annotations that allow for the comparative analysis of infectious agents with closely related free-living, symbiotic, and commensal species.8 This comparison leads scientists to determine the major genes associated with the pathogenicity of an organism they are studying. These annotations are standardized through a system called RAST (rapid annotation using subsystems technology) which contains over 2,800 complete bacterial genomes. RAST predicts genes, assigns gene functions, and reconstructs metabolic pathways based on these annotated genomes. When a new bacterial strain is discovered and sequenced, PATRIC compares it to the vast store of genomic data in RAST to determine its pathogenicity. If a new strain of bacteria is isolated from an ill individual, PATRIC works to relate its genome to an identifiable strain generating potential treatment options. The PATRIC system and RAST standard demonstrate that successful data storage, retrieval, analysis, and comparison enable public health to protect the public.

Personal Opinions and Broader Impressions: Personally I think that data overload is a bioinformatics issue that is relatively simple to solve, and that many current systems already successfully do so. Though it does take a great deal of time and effort to create systems that manage data from all fields, including synthetic biology and public health, and that interconnectivity of already existing systems requires even more time in the creation of standards, all else being equal, data overload is a relatively benign

issue. In comparison to an issue like privacy which sparks all sorts of ethical debates and dilemmas, the data overload issue has a clear solution that everyone who works in the field of bioinformatics can agree on. Though bioinformaticians may not necessarily agree on the specific design of systems and standards, they do all agree that in order for the fields that utilize digital media to function optimally, data must be managed efficiently. With this unifying mindset and the blinding pace of technological advancement, I believe it is only a matter of time until universal systems and standards are created to connect all informatics fields.

Bibliography: 1) Sherlock, Gavin et al. "The Stanford Microarray Database." Nucleic Acids Research. Oxford
Journals, 28 July 2000. Web. 01 June 2012. http://nar.oxfordjournals.org/content/29/1/152.full

2) Brazma, Alvis et al. "Minimum Information about a Microarray Experiment (MIAME)Toward


Standards for Microarray Data." Genetics.nature.com. Nature Publishing Group, Dec. 2001. Web. 1 June 2012. http://compbio.dfci.harvard.edu/pubs/MIAME.pdf

3) Ball, Catherine et al. "The Stanford Microarray Database Accommodates Additional Microarray
Platforms and Data Formats." Nucleic Acids Research. Oxford Journals, 03 Sept. 2004. Web. 01 June 2012. http://nar.oxfordjournals.org/content/33/suppl_1/D580.full

4) Ermolaeva, Olga et al. "Data Management and Analysis for Gene Expression
Arrays."Genetics.nature.com. Nature Publishing Group, Sept. 1998. Web. 1 June 2012. http://www.markboguski.net/docs/publications/Ermolaeva-etal.pdf

5) Triplet, Thomas et al. "PROFESS: A PROtein Function, Evolution, Structure and Sequence
Database." Database. Oxford Journals, 6 June 2010. Web. 01 June 2012. http://database.oxfordjournals.org/content/2010/baq011.abstract

6) Friede, Andrew et al. "CDC WONDER: A Comprehensive On-Line Public Health Information
System of the Centers for Disease Control and Prevention." Aphapublications.org. American Journal of Public Health, Sept. 1993. Web. 1 June 2012. http://ajph.aphapublications.org/doi/pdf/10.2105/AJPH.83.9.1289

7) "Can Twitter Help with Public Health Surveillance?" Healthdatamanagement.com. SourceMedia,


21 Mar. 2012. Web. 01 June 2012. http://www.healthdatamanagement.com/news/hhs-public-health-surveillance-twitterapp-44208-1.html

8) Gillespie, Joseph et al. "PATRIC: The Comprehensive Bacterial Bioinformatics Resource with a
Focus on Human Pathogenic Species." Infection and Immunity. American Society for Microbiology, 6 Sept. 2011. Web. 01 June 2012. http://iai.asm.org/content/79/11/4286.abstract

Das könnte Ihnen auch gefallen