Beruflich Dokumente
Kultur Dokumente
• Data mining:
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) information or patterns from data in large databases
– Organizations integrate their various databases into
data warehouses. Data warehousing is defined as a
process of centralized data management and retrieval
(data capture, processing power, data transmission,
and storage capabilities ).
• Knowledge mining (knowledge discovery in databases):
– Extraction of interesting (previously unknown and potentially useful)
models from data in large databases
Definitions
• Data stream: A sequence of digitally encoded
signals used to represent information in
transmission [Federal Standard 1037C data
stream].
Decimal
Value Metric
1000 kB kilobyte
10002 MB megabyte
10003 GB gigabyte
10004 TB terabyte
10005 PB petabyte
10006 EB exabyte
10007 ZB zettabyte
10008 YB yottabyte
Binary
Value JEDEC IEC
1024 KB Kilobyte KiB kibibyte
10242 MB Megabyte MiB mebibyte
10243 GB Gigabyte GiB gibibyte
10244 TB Terabyte TiB tebibyte
10245 PiB pebibyte
10246 EiB exbibyte
10247 ZiB zebibyte
10248 YiB yobibyte
See also: Bit and Byte prefixes
Orders of magnitude of data
Background
• Big data are difficult to work with using most relational
database management systems and desktop statistics
and visualization packages, requiring instead
"massively parallel software running on tens,
hundreds, or even thousands of servers“
• The trend to larger data sets is due to the additional
information derivable from analysis of a single large set
of related data, as compared to separate smaller sets
with the same total amount of data, allowing
correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link
legal citations, combat crime, and determine real-time
roadway traffic conditions."
Background- Evolution of Database
Technology
• 1960s: Data collection, database creation, IMS and
network DBMS
• 1970s: Relational data model, relational DBMS
implementation
• 1980s: RDBMS, advanced data models (extended-
relational, OO, deductive, etc.) and application-oriented
DBMS (spatial, scientific, engineering, etc.)
• 1990s—2000s: Data mining and data warehousing,
multimedia databases, and Web databases
• 2000s – 2020s: Cloud based distributed databases, HaDoop
Background
• Business Intelligence uses descriptive statistics
with data with high information density to
measure things, detect trends etc.;
• Big Data uses inductive statistics with data [4]
with low information density whose huge
volume allow to infer laws (regressions…) and
thus giving (with the limits of inference
reasoning) to Big Data some predictive
capabilities
Background
• Big data requires exceptional technologies to
efficiently process large quantities of data
within tolerable elapsed times.
• Real or near-real time information delivery is
one of the defining characteristics of big data
analytics.
Background
• In 2000, Seisint Inc. develops C++ based distributed file sharing framework for data storage and querying.
Structured, semi-structured and/or unstructured data is stored and distributed across multiple servers.
Querying of data is done by modified C++ called ECL which uses apply scheme on read method to create
structure of stored data during time of query.
• In 2004 LexisNexis acquired Seisint Inc. and 2008 acquired ChoicePoint, Inc. and their high speed parallel
processing platform. The two platforms were merged into HPCC Systems and in 2011 was open sourced
under Apache v2.0 License. Currently HPCC and Quantcast File System are the only publicly available
platforms to exceed multiple exabytes of data.
• In 2004, Google published a paper on a process called MapReduce that used such an architecture. The
MapReduce framework provides a parallel processing model and associated implementation to process
huge amount of data. With MapReduce, queries are split and distributed across parallel nodes and
processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The
framework was very successful,[51] so others wanted to replicate the algorithm. Therefore, an
implementation of the MapReduce framework was adopted by an Apache open source project named
Hadoop.
• MIKE2.0 is an open approach to information management that acknowledges the need for revisions due
to big data implications in an article titled "Big Data Solution Offering".The methodology addresses
handling big data in terms of useful permutations of data sources, complexity in interrelationships, and
difficulty in deleting (or modifying) individual records.
• Recent studies show that the use of a multiple layer architecture is an option for dealing with big data. The
Distributed Parallel architecture distributes data across multiple processing units and parallel processing
units provide data much faster, by improving processing speeds. This type of architecture inserts data into
a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of
framework looks to make the processing power transparent to the end user by using a front end
application server.
Definitions
• The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets across clusters of computers using simple
programming models. It is designed to scale up from
single servers to thousands of machines, each offering
local computation and storage.
• Hadoop is a rapidly evolving ecosystem of components
for implementing the Google MapReduce algorithms
[3] in a scalable fashion on commodity hardware.
Hadoop enables users to store and process large
volumes of data and analyze it in ways not previously
possible with less scalable solutions or standard SQL-
based approaches.
Definitions
• Hadoop is a highly scalable compute and storage
platform. While most users will not initially
deploy servers numbered in the hundreds or
thousands, Dell recommends following the design
principles that drive large, hyper-scale
deployments. This ensures that as you start with
a small Hadoop environment, you can easily scale
that environment without rework to existing
servers, software, deployment strategies, and
network connectivity [2].