Beruflich Dokumente
Kultur Dokumente
Lecture 12
Big Data
Outline
Part 1: Concepts of Big Data and its terminology
Part 2: Big Data in Networking
Part 3: A recent work by Microsoft Research Cambridge
Camdoop
Network Requirements
Recent Developments
Google FS
SDN
NFV
Big Tables
MapReduce
Analytics
BIG DATA
Data is measured by 3V's:
Volume: TB
Velocity: TB/sec. Speed
Variety: Type (Text, audio, video, images, geospatial, ...)
[1/2]
1.
2.
3.
4.
5.
7.
8.
9.
Open-source software
ACID Requirements
Atomicity: All or nothing. If anything fails, entire transaction fails.
Example, Payment and ticketing.
Consistency: If there is error in input, the output will not be written to
the database. Database goes from one valid state to another valid
states.
Valid=Does not violate any defined rules.
Terminology
Structured Data: Data that has a pre-set format, e.g., Address Books, product
catalogs, banking transactions,
Unstructured Data: Data that has no pre-set format. Movies, Audio, text files, web
pages, computer programs, social media,
Semi-Structured Data: Unstructured data that can be put into a structure by
available format descriptions
80% of data is unstructured.
Batch vs. Streaming Data
Real-Time Data: Streaming data that needs to analyzed as it comes in. E.g.,
Intrusion detection. Aka Data in Motion
Data at Rest: Non-real time. E.g., Sales analysis.
Metadata: Definitions, mappings, scheme
10
11
Big Table
Distributed storage system built on Google File System
Data stored in rows and columns
Optimized for sparse, persistent, multidimensional sorted
map.
Uses commodity servers
Not distributed outside of Google but accessible via
Google App Engine
12
MapReduce
Software framework to process massive amounts of unstructured data in parallel
Goals:
Map: Takes a set of data and converts it into another set of key-value pairs..
Reduce: Takes the output from Map as input and outputs a smaller set of keyvalue pairs
13
MapReduce Example
100 files with daily temperature in two cities. Each file has 10,000
entries.
For example, one file may have (Toronto 20), (New York 30), ..
Our goal is to compute the maximum temperature in the two cities.
Assign the task to 100 Map processors each works on one file. Each
processor outputs a list of key-value pairs, e.g., (Toronto 30), New
York (65), ...
Now we have 100 lists each with two elements. We give this list to
two reducers one for Toronto and another for New York.
The reducer produce the final answer: (Toronto 55), (New York 65)
14
Optimization in MapReduce
Scheduling:
Synchronization:
Code/Data Collocation:
The data for map jobs should be at the processors that are going to map.
Fault/Error Handling:
15
Hadoop [1/2]
An open source implementation of MapReduce framework
Three components:
Hadoop Common Package (files needed to start Hadoop)
Hadoop Distributed File System: HDFS
MapReduce Engine
HDFS
16
Hadoop [2/2]
MapReduce Engine
Data node: Constantly ask the job tracker if there is
something for them to do
Used to track which data nodes are up or down
Job tracker: Assigns the map job to task tracker nodes that
have the data or are close to the data (same rack)
Task Tracker: Keep the work as close to the data as possible.
17
18
Columnar Database
Relational Databases:
In Relational databases, data in each row of the table is stored together:
Eg: 001:101,Smith,10000; 002:105,Jones,20000; 003:106,John;15000
Easy to find all information about a person.
Difficult to answer queries about the aggregate:
How many people have salary between 12k-15k?
Columnar Databases:
In Columnar databases, data in each column is stored together.
Eg. 101:001,105:002,106:003; Smith:001, Jones:002,003; 10000:001, 20000:002,
150000:003
Easy to get column statistics
Very easy to add columns
Good for data with high variety simply add columns
20
Analytics [1/2]
Analytics: Guide decision making by discovering patterns in data
using statistics, programming, and operations research.
SQL Analytics: Count, Mean
Descriptive Analytics: Analyzing historical data to explain past
success or failures.
Predictive Analytics: Forecasting using historical data.
Prescriptive Analytics: Suggest decision options. Continually update
these options with new data.
Data Mining: Discovering patterns, trends, and relationships using
Association rules, Clustering, Feature extraction
21
Analytics [2/2]
Machine Learning: An algorithm technique for learning
from empirical data and then using those lessons to
predict future outcomes of new data
Web Analytics: Analytics of Web Accesses and Web
users.
Learning Analytics: Analytics of learners (students)
Data Science: Field of data analytics (everything inside
the analytics field)
22
23
2.
Virtualization
3.
4.
25
1. Virtualization [1/6]
Recent networking technologies and standards:
1.
2.
Virtualizing Computation
Virtualizing Storage
3.
4.
5.
26
1. Virtualization [2/6]
1.1 Virtualizing Computation
Initially data centers consisted of multiple IP subnets
Each subnet = One Ethernet Network
Ethernet addresses are globally unique and do not change
IP addresses are locators and change every time you move
If a VM moves inside a subnet No change to IP address
Fast
If a VM moves from one subnet to another Its IP address
changes All connections break Slow Limited VM mobility
27
1. Virtualization [3/6]
1.2 Virtualization Storage
Initially data centers used Storage Area Networks (Fibre Channel)
for server-to-storage communications and Ethernet for server-toserver communication
IEEE added 4 new standards to make Ethernet offer low loss, low
latency service like Fibre Channel:
28
1. Virtualization [4/6]
29
Multi-Root IOV
Multi-root IO Virtualization (MRIOV) standard allows
multiple PCIe cards to server multiple servers and VMs in
the same rack
30
1. Virtualization [5/6]
1.4 Virtualizing Data Center Storage
IEEE 802.1BR-2012 Virtual Bridgeport Extension (VBE) allows
multiple switches to combine in to a very large switch
Storage and computers located anywhere in the data
center appear as if connected to the same switch
31
1. Virtualization [6/6]
1.5 Virtualizing Metro Storage
Data center Interconnection standards:
Virtual Extensible LAN (VXLAN),
Network Virtualization using GRE (NVGRE), and
Transparent Interconnection of Lots of Link (TRILL)
32
Virtualization
Virtualizing Global Storage:
Energy Science Network (ESNet) uses virtual switch to
connect members located all over the world
Virtualization Fluid networks The world is flat You draw
your network Every thing is virtually local
33
34
35
Tomorrows
Datacenter
Tens of tenants
1k of clients
10k of pSwitches
Thousands of servers
1M of VMs
Hundreds of Administrators
Tens of Administrators
100k of vSwitches
36
Internet of things
Big Data generation and analytics
37
38
39
References
MapReduce:
http://static.googleusercontent.com/media/research.google.com/en//archive/
mapreduce-osdi04.pdf
IBMs MapReduce:
http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
F. Chang, et al., "Bigtable: A Distributed Storage System for Structured Data," 2006,
http://static.googleusercontent.com/media/research.google.com/en//archive/
bigtable-osdi06.pdf
Wikipedia
40
Questions?
41
Thank you
42