Sie sind auf Seite 1von 3

Managing the 'Billions and Billions' of the Data Deluge

It's pretty safe to say that, unless you've been living in a cave for the past 10 years, your daily
actions and behavior have contributed to big data at some point, in some What is Master Data
Management? capacity or other.
The term "big data" seems to have evolved over time. The appearance of the term in an October
1997 article by Michael Cox and David Ellsworth, "Application-controlled demand paging for out-o-core visualization," is certainly one of the first.
But big data is by no means new. Science research has been generating and publishing large
volumes of data sets for well over 150 years. Noted astrophysicist Carl Sagan alluded to numerous
examples of it in his final book, Billions and Billions, published well over 15 years ago. But the
development of the Internet and the shift from an offline world to an online world opened the
floodgates, creating a data deluge that continues to grow at an astounding rate. In the latest Digital
Universe study, technology analyst IDC reveals that the digital universe is expected to double every
two years and will multiply 10-fold between now and 2020 -- from 4.4 trillion gigabytes to 44 trillion
gigabytes. To give you a sense of scale, if you imagine each byte of data as equal to one inch, it
would be like 1 million round trips between Earth and Pluto.
Big data has certainly arrived in a big way, but big insights on effective management of it all lag
behind. In my mind, big data has reached two frontiers. The first is privacy. The recent decision by
the European Court of Justice on the "right to be forgotten" set off a firestorm of debate. Does the
public really have a right to know everything? Or does the right to privacy trump freedom of
information? Scientific research is disputed on a regular basis, and those disputes are considered
part of the discovery process at large, but what if a researcher wanted to delete all references to
heavily disputed papers in order to protect his or her reputation? Should this be allowed? What
policies must be developed to balance big data collection and reputation? Is that even possible?
The second frontier is the analysis and management of research data to unleash the power of the
information that it holds. A recent editorial in Big Data Research argues that the true value of big
data lies in knowledge derived from analysis. By its very nature, scientific big data is selfperpetuating. Researchers generate multiple data sets as a byproduct of their own research, and
those data sets are then used and cited in other research. Digital information solutions providers like
Elsevier, my employer, sit on vast databases of high-quality scientific, technical and medical
research content that has been collected, curated, aggregated, disseminated and published for more
than 10 decades. With so much scientific big data available -- and increasing as we speak -- finding
the right information is akin to looking for a needle in a pile of needles. Without sophisticated
analytics, a gigantic pile of data without a framework to provide meaning or context is just that: a
pile of data, not particularly useful in itself.
Sources of research information are also expanding and include social media streams, images, audio
and video files and crowdsourced data. We're also seeing bigger files (seismic scans can be 5
terabytes per file) and massive numbers of smaller files (email, social media, etc.), all potentially
valuable data when processed alongside other appropriate and relevant sources. New capture,
search, discovery and analysis tools can provide insights from the increasing pools of unstructured
data, which account for more than 90 percent of the digital universe. It is crucial, then, for those of
us in scholarly publishing to help researchers find relevant data quickly through smart collection
tools, recommended reading lists and data banks that offer a variety of sort and search applications.

Intelligence around data has been part of the information and communications technology at the
core of business intelligence (BI) and data warehousing applications for over 20 years. However, it is
limited in scope and typically thought of as being retrospective, dealing with only structured data to
analyze what has already happened. Big data, by contrast, can be prospective. Through the use of
advanced analytics and predictive tools, it projects potential outcomes by studying structured data
as well as an increasingly expanding pool of unstructured data from multiple sources, and then
sharing the results across collaborative platforms for meaningful correlations. It can reveal such
insights as:
"This happened because of..." (i.e., diagnostic analytics)"What will happen if...?" (i.e., predictive
analytics)"This can happen / can be avoided by doing..." (i.e., prescriptive analytics)
For example, a study from the McKinsey Global Institute estimates that if the U.S. healthcare sector
can exploit and process the vast ocean of electronic information at its disposal, such as data from
clinical trials and research experiments alongside insurance data, the effective management of such
information has the potential to improve the efficiency and effectiveness of U.S. health care by more
than US$300 billion a year. Strategically administered, then, big data is able to harvest previously
unknown but useful information and insights in order to spark ideas that drive new discoveries,
hence fueling the cycle of academic research.

Recent big data management and processing systems, such as high-performance computer cluster
(HPCC) systems and Apache Hadoop, are able to correlate and analyze large and varied types of
data sets. Developed by LexisNexis Risk Solutions, HPCC is used to solve complex data and analytics
challenges by combining proven processing methodologies with propriety linking algorithms that
turn data into intelligence that can be applied to improving the quality of research outcomes. The
Apache Hadoop software library is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models. The harvesting of big data
by such methods enables information providers to facilitate information exchanges that create
opportunities for serendipitous discoveries by breaking down the discipline silos. In short,
information about information is gold.

As with all new technology, we do not and cannot know what can be fully achieved with big data
solutions. As the century advances, the "billions and billions" of data will quickly grow into "trillions
and trillions"; therefore, when talking about big data, it should be articulated that big data is itself a
journey of discovery.
Carl Sagan loved discoveries. I like to think he would have been pleased.

http://www.huffingtonpost.com/olivier-dumon/innovations-in-science-ma_b_5960306.html

Das könnte Ihnen auch gefallen