LTCN L13

Future Trends in Computer Network
Lecture 12
Big Data
Raheel Ahmed Memon

1
Outline
Part 1: Concepts of Big Data and its terminology
Part 2: Big Data in Networking
Part 3: A recent work by Microsoft Research Cambridge
Camdoop
Part 1 & 2 Organization

Part [1] Database
Part [2] Networking
What is Big Data?
Big Data in Networking
Why Big Data Now?
Network Requirements
Where Big Data Can be used?
Recent Developments
Some Database concepts
High speed Ethernet

Virtualization
Google FS
SDN
NFV
Big Tables
MapReduce
Big Data Tools
Analytics
Datacenters Present and Future
BIG DATA
Data is measured by 3V's:
Volume: TB
Velocity: TB/sec. Speed
Variety: Type (Text, audio, video, images, geospatial, ...)
Increasing processing power, storage capacity, and

networking have caused data to grow in all 3 dimensions.
Volume, Location, Velocity, Churn, Variety, Veracity
(accuracy, correctness, applicability)
Examples: social network data, sensor networks, Internet
Search, astronomy, ...
5
WHY BIG DATA NOW?
[1/2]
1.
Low cost storage to store data that was discarded earlier
2.
Powerful multi-core processors
3.
Low latency possible by distributed computing: Compute

clusters and grids connected via high-speed networks
4.
Virtualization Partition, Aggregate, isolate resources in

any size and dynamically change it Minimize latency for
any scale
5.
Affordable storage and computing with minimal man

power via clouds
Possible because of advances in Networking
WHY BIG DATA NOW? [2/2]

6.
Better understanding of task distribution (MapReduce)
7.
Advanced analytical techniques
8.
Managed Big Data Platforms: Cloud service providers,

such as Amazon Web Services provide Elastic MapReduce,
Simple Storage Service (S3) and HBase column oriented
database. Google BigQuery and Prediction API.
9.
Open-source software
10. Obama announced $200M for Big Data research.

Distributed via NSF, NIH, DOE, DoD, DARPA, and USGS
(Geological Survey)
Big Data Applications

Monitor premature infants to alert when interventions is
needed
Predict machine failures in manufacturing
Prevent traffic jams, save fuel, reduce pollution
ACID Requirements
Atomicity: All or nothing. If anything fails, entire transaction fails.
Example, Payment and ticketing.
Consistency: If there is error in input, the output will not be written to
the database. Database goes from one valid state to another valid
states.
Valid=Does not violate any defined rules.
Isolation: Multiple parallel transactions will not interfere with each

other.
Durability: After the output is written to the database, it stays there
forever even after power loss, crashes, or errors.
Relational databases provide ACID while non-relational databases
aim for BASE (Basically Available, Soft, and Eventual Consistency)
9
Terminology
Structured Data: Data that has a pre-set format, e.g., Address Books, product
catalogs, banking transactions,
Unstructured Data: Data that has no pre-set format. Movies, Audio, text files, web
pages, computer programs, social media,
Semi-Structured Data: Unstructured data that can be put into a structure by
available format descriptions
80% of data is unstructured.
Batch vs. Streaming Data
Real-Time Data: Streaming data that needs to analyzed as it comes in. E.g.,
Intrusion detection. Aka Data in Motion
Data at Rest: Non-real time. E.g., Sales analysis.
Metadata: Definitions, mappings, scheme
10
Google File System

Commodity computers serve as Chunk Servers and store multiple
copies of data blocks
A master server keeps a map of all chunks of files and location of
those chunks.
All writes are propagated by the writing chunk server to other
chunk servers that have copies.
Master server controls all read-write accesses
11
Big Table
Distributed storage system built on Google File System
Data stored in rows and columns
Optimized for sparse, persistent, multidimensional sorted
map.
Uses commodity servers
Not distributed outside of Google but accessible via
Google App Engine
12
MapReduce
Software framework to process massive amounts of unstructured data in parallel
Goals:
Distributed: over a large number of inexpensive processors

Scalable: expand or contract as needed
Fault tolerant: Continue in spite of some failures
Map: Takes a set of data and converts it into another set of key-value pairs..
Reduce: Takes the output from Map as input and outputs a smaller set of keyvalue pairs
13
MapReduce Example
100 files with daily temperature in two cities. Each file has 10,000
entries.
For example, one file may have (Toronto 20), (New York 30), ..
Our goal is to compute the maximum temperature in the two cities.
Assign the task to 100 Map processors each works on one file. Each
processor outputs a list of key-value pairs, e.g., (Toronto 30), New
York (65), ...
Now we have 100 lists each with two elements. We give this list to
two reducers one for Toronto and another for New York.
The reducer produce the final answer: (Toronto 55), (New York 65)
14
Optimization in MapReduce
Scheduling:
Task is broken into pieces that can be computed in parallel

Map tasks are scheduled before the reduce tasks.
If there are more map tasks than processors, map tasks continue until all of them are
complete.
A new strategy is needed to assign Reduce jobs so that it can be done in parallel
The results are combined.
Synchronization:
The map jobs should be comparable so that they finish together.

Similarly reduce jobs should be comparable.
Code/Data Collocation:
The data for map jobs should be at the processors that are going to map.
Fault/Error Handling:
If a processor fails, its task needs to be assigned to another processor.
15
Hadoop [1/2]
An open source implementation of MapReduce framework
Three components:
Hadoop Common Package (files needed to start Hadoop)
Hadoop Distributed File System: HDFS
MapReduce Engine
HDFS
Requires data to be broken into blocks. Each block is

stored on 2 or more data nodes on different racks.
Name node: Manages the file system name space
keeps track of where each block is.
16
Hadoop [2/2]
MapReduce Engine
Data node: Constantly ask the job tracker if there is
something for them to do
Used to track which data nodes are up or down
Job tracker: Assigns the map job to task tracker nodes that
have the data or are close to the data (same rack)
Task Tracker: Keep the work as close to the data as possible.
17
Apache Hadoop Tools
18
Other Apache Big Data Tools

Big SQL
Apache Elastic MapReduce
Apache HyperTable
File System-in User Space FUSE

19
Columnar Database
Relational Databases:
In Relational databases, data in each row of the table is stored together:
Eg: 001:101,Smith,10000; 002:105,Jones,20000; 003:106,John;15000
Easy to find all information about a person.
Difficult to answer queries about the aggregate:
How many people have salary between 12k-15k?
Columnar Databases:
In Columnar databases, data in each column is stored together.
Eg. 101:001,105:002,106:003; Smith:001, Jones:002,003; 10000:001, 20000:002,
150000:003
Easy to get column statistics
Very easy to add columns
Good for data with high variety simply add columns
20
Analytics [1/2]
Analytics: Guide decision making by discovering patterns in data
using statistics, programming, and operations research.
SQL Analytics: Count, Mean
Descriptive Analytics: Analyzing historical data to explain past
success or failures.
Predictive Analytics: Forecasting using historical data.
Prescriptive Analytics: Suggest decision options. Continually update
these options with new data.
Data Mining: Discovering patterns, trends, and relationships using
Association rules, Clustering, Feature extraction
21
Analytics [2/2]
Machine Learning: An algorithm technique for learning
from empirical data and then using those lessons to
predict future outcomes of new data
Web Analytics: Analytics of Web Accesses and Web
users.
Learning Analytics: Analytics of learners (students)
Data Science: Field of data analytics (everything inside
the analytics field)
22
Big Data and Networking
23
Networking Requirements for Big

Data
Code/Data Collocation: The data for map jobs should be at
the processors that are going to map.
Elastic bandwidth: to match the variability of volume
Fault/Error Handling: If a processor fails, its task needs to be
assigned to another processor.
Security: Access control (authorized users only), privacy
(encryption), threat detection, all in real-time in a highly
scalable manner
Synchronization: The map jobs should be comparable so
that they finish together. Similarly reduce jobs should be
comparable.
24
Recent Developments in Networking

1.
High-Speed: 100 Gbps Ethernet

400 Gbps 1000 Gbps
Cheap storage access. Easy to move big data.
2.
Virtualization
3.
Software Defined Networking
4.
Network Function Virtualization
25
1. Virtualization [1/6]
Recent networking technologies and standards:
1.
2.
Virtualizing Computation
Virtualizing Storage
3.
4.
Virtualizing Rack Storage Connectivity

Virtualizing Data Center Storage
5.
Virtualizing Metro and Global Storage
26
1.1 Virtualizing Computation
Initially data centers consisted of multiple IP subnets
Each subnet = One Ethernet Network
Ethernet addresses are globally unique and do not change
IP addresses are locators and change every time you move
If a VM moves inside a subnet No change to IP address
Fast
If a VM moves from one subnet to another Its IP address
changes All connections break Slow Limited VM mobility
IEEE 802.1ad-2005 Ethernet Provider Bridging (PB), IEEE

802.1ah-2008 Provider Backbone Bridging (PBB) allow
Ethernets to span long distances Global VM mobility
27
1.2 Virtualization Storage
Initially data centers used Storage Area Networks (Fibre Channel)
for server-to-storage communications and Ethernet for server-toserver communication
IEEE added 4 new standards to make Ethernet offer low loss, low
latency service like Fibre Channel:
Priority-based Flow Control (IEEE 802.1Qbb-2011)

Enhanced Transmission Selection (IEEE 802.1Qaz-2011)
Congestion Control (IEEE 802.1Qau-2010)
Data Center Bridging Exchange (IEEE 802.1Qaz-2011)
Result: Unified networking Significant CapEx/OpEx saving
28
1.3 Virtualizing Rack Storage Connectivity

MapReduce jobs are assigned to the nodes that have the
data
Job tracker assigns jobs to task trackers in the rack where
the data is.
High-speed Ethernet can get the data in the same rack.
Peripheral Connect Interface (PCI) Special Interest Group
(SIG)s Single Root I/O virtualization (SR- IOV) allows a
storage to be virtualized and shared among multiple VMs.
29
Multi-Root IOV
Multi-root IO Virtualization (MRIOV) standard allows
multiple PCIe cards to server multiple servers and VMs in
the same rack
30
1.4 Virtualizing Data Center Storage
IEEE 802.1BR-2012 Virtual Bridgeport Extension (VBE) allows
multiple switches to combine in to a very large switch
Storage and computers located anywhere in the data
center appear as if connected to the same switch
31
1.5 Virtualizing Metro Storage
Data center Interconnection standards:
Virtual Extensible LAN (VXLAN),
Network Virtualization using GRE (NVGRE), and
Transparent Interconnection of Lots of Link (TRILL)
Data centers located far away to appear to be on the

same Ethernet
32
Virtualization
Virtualizing Global Storage:
Energy Science Network (ESNet) uses virtual switch to
connect members located all over the world
Virtualization Fluid networks The world is flat You draw
your network Every thing is virtually local
33

Centralized Programmable Control Plane
Allows automated orchestration (provisioning) of a large
number of virtual resources (machines, networks, storage)
Large Hadoop topologies can be created on demand
34
Network Function Virtualization (NFV)

Fast standard hardware Software based Devices Virtual networking modules
(DHCP, Firewall, DNS, ...) running on standard processors
Modules can be combined to create any combination of function for data
privacy, access control, ...
Virtual Machine implementation Quick provisioning
Standard Application Programming Interfaces (APIs)
Networking App Market

Privacy and Security for Big data in the multi-tenant clouds
35
Big Data for Networking

Today's Datacenter
Tomorrows
Datacenter
Tens of tenants
1k of clients
Hundreds of switches and

routers
10k of pSwitches
Thousands of servers
1M of VMs
Hundreds of Administrators
Tens of Administrators
100k of vSwitches
36
Big Data for Networking

Need to monitor traffic patterns and rearrange virtual
networks connecting millions of VMs in real-time
Managing clouds is a real-time big data problem.
Internet of things
Big Data generation and analytics
37
Summary Part [1]

There is large data in world, and we were discarding stuff without analyzing it, but
Big Data allows to analyze and get meaning full results from that data
Now we have several mature processes, we have cheep computing, cheep
networking, we have virtualization so analyzing Big Data is possible
Big data for Big Change
Relational, non relational, columnar databases.
Google FS, Data Nodes, Task Trackers, Job Trackers, Blocks
MapReduce
Big Data Tools
Analytics: Descriptive, predictive, prescriptive
38
Summary Part [2]

I/O virtualization allows all storage in the rack to appear
local to any VM in that rack Solves the co-location
problem of MapReduce
Network virtualization allows storage anywhere in the data
center or even other data centers to appear local
Software defined networking allows orchestration of a large
number of resources Dynamic creation of Hadoop
clusters
Network function virtualization will allow these clusters to
have special functions and security in multi-tenant clouds.
39
References
Hurwitz Big Data for Dummies, Wiley, 2013

http://pdf.th7.cn/down/files/1312/big_data_for_dummies.pdf
MapReduce:
http://static.googleusercontent.com/media/research.google.com/en//archive/
mapreduce-osdi04.pdf
IBMs MapReduce:
http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
Google File System:

http://static.googleusercontent.com/media/research.google.com/en//archive/gfssosp2003.pdf
F. Chang, et al., "Bigtable: A Distributed Storage System for Structured Data," 2006,
http://static.googleusercontent.com/media/research.google.com/en//archive/
bigtable-osdi06.pdf
Wikipedia
40
Questions?
41
Thank you
42

LTCN L13

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

LTCN L13

Hochgeladen von

Copyright:

Verfügbare Formate

Future Trends in Computer Network

Raheel Ahmed Memon

Part 1 & 2 Organization

Part [2] Networking

What is Big Data?

Big Data in Networking

Why Big Data Now?

Where Big Data Can be used?

Some Database concepts

High speed Ethernet

Big Data Tools

Datacenters Present and Future

Increasing processing power, storage capacity, and

WHY BIG DATA NOW?

Low cost storage to store data that was discarded earlier

Powerful multi-core processors

Low latency possible by distributed computing: Compute

Virtualization Partition, Aggregate, isolate resources in

Affordable storage and computing with minimal man

WHY BIG DATA NOW? [2/2]

Better understanding of task distribution (MapReduce)

Advanced analytical techniques

Managed Big Data Platforms: Cloud service providers,

10. Obama announced $200M for Big Data research.

Big Data Applications

Isolation: Multiple parallel transactions will not interfere with each

Google File System

Distributed: over a large number of inexpensive processors

Task is broken into pieces that can be computed in parallel

The map jobs should be comparable so that they finish together.

If a processor fails, its task needs to be assigned to another processor.

Requires data to be broken into blocks. Each block is

Apache Hadoop Tools

Other Apache Big Data Tools

File System-in User Space FUSE

Big Data and Networking

Networking Requirements for Big

Recent Developments in Networking

High-Speed: 100 Gbps Ethernet

Software Defined Networking

Network Function Virtualization

Virtualizing Rack Storage Connectivity

Virtualizing Metro and Global Storage

IEEE 802.1ad-2005 Ethernet Provider Bridging (PB), IEEE

Priority-based Flow Control (IEEE 802.1Qbb-2011)

Result: Unified networking Significant CapEx/OpEx saving

1.3 Virtualizing Rack Storage Connectivity

Data centers located far away to appear to be on the

Software Defined Networking

Network Function Virtualization (NFV)

Networking App Market

Big Data for Networking

Hundreds of switches and

Big Data for Networking

Summary Part [1]

Summary Part [2]

Hurwitz Big Data for Dummies, Wiley, 2013

Google File System:

Das könnte Ihnen auch gefallen