Sie sind auf Seite 1von 24

Big Data Analytics

A crash course
What is Big Data
Large and complex datasets
Structured, semi-structured or unstructured
Typically does not fit in memory to be
processed
Distributed storage structure
3Vs of Big Data
Velocity
Volume
Variety
Velocity
Low latency real-time speed
Examples
Telephone call records
Social media
Retail sales

Volume
Size of dataset
KB, MB, GB, TB, PB
Facebook
40 PB of data
100 TB/day
Twitter
8 TB/day
Yahoo
60 PB of data
Big Data size varies from company to company
Variety
Text
Audio
Video
Photos
Documents
Big Data Stack
Physical Infrastructure
Hardware & Network
Performance
Availability
Resilient & redundant
Scalability
Flexibility
Cost

Security Infrastructure
Data access
Application Access
Data encryption
Threat detection
Cloud and Big Data
IaaS Amazon EC2
PaaS Heruku, Pagodabox
SaaS GotoMeeting, SalesForce
DaaS Amazon
Major Providers
Cloudera, Amazon, Azure, Google, Openstack
Databases
Organizing Data Services
Distributed File System
Serialization & Coordination
ETL Tools
Workflow
Big Data Applications
Log Data Applications
Splunk, Loggly
Ad/Media Applications
Bluefin, DataXu
Marketing Applications
Bloomreach, Myrrix
Apache Hadoop
Open source framework for processing and
querying vast amounts of data on large
clusters of commodity hardware
Enterprise-ready cloud computing technology
Industry standard for Big Data
Jave based but abstractions available for
various languages
Concurrency, Scalability, Reliability
HDFS
Hadoop Distributed File System
File system to store large datasets
Blocks of 64 MB instead of 4-32 KB
Optimized for throughput over latency
High availability through replication instead of
redundancy
Optimized for read-many and write-once
DataNode and NameNode
MapReduce
Data processing paradigm
How data will input (Map)
How data will output (Reduce)
Works with arbitrarily large datasets
Integrates tightly with HDFS
Parallel processing
Divide and conquer
Key-value pair instead of RDBMS Schemas
Job tracker and task tracker
Other components
Mahout Machine learning
Pig High level language for interacting with
Hadoop
Hive Data warehousing
HBase Distributed, column-oriented DB
Sqoop SQL to Hadoop and vice versa
Ambari Web based Hadoop cluster
management
R + Hadoop
Hadoop for data storage, computation power
R for advanced analytics, visualization, data
loading
Cloud based
RHadoop
Data mining with R
Regression
lm
Classification
glm, ksvm, svm, randomforest, glmnet
Clustering
knn, kmeans, dist, pvclust, Mclust
Recommendation
recommenderlab
Hadoop
Linux based
Cloudera based
Java required
Singlenode or multinode

RHIPE
R and Hadoop Integrated Programming
Environment
Divide and Recombine Technique
RHadoop
Revolution Analytics
Rhdfs
Rmr
Rhbase

MapReduce in R
Real Time Data Streaming
IBM Infosphere
Twitter Storm
Apache S4 (Simple Scalable Streaming System)
Data Management at Enterprise

Das könnte Ihnen auch gefallen