Beruflich Dokumente
Kultur Dokumente
Agenda
Big Data NoSQL databases Hadoop MR examples Case Study
Genome analysis Prediction of stock prices
New Video Uploads: 4.5 Terabytes per day 100 Terabytes per year Massive Volumes of Data .. LHC: 40 Terabytes per second
Challenge
Big Datas characteristics are challenging conventional information management architectures Massive and growing amounts of information residing internal and external to the organization Unconventional semi structured or unstructured (diverse) including web pages, log files, social media, click-streams, instant messages, text messages, emails, sensor data from active and passive systems, etc. Changing information
Surveillance analytics
CDR analytics
History
1) 1970 - atmospheric and oceanic env. 2) Since 2008 more references 3)different disciplines including earth, health, engineering, arts and humanities and environmental sciences. 4) computer science,
BLOG
SMART METER
VOLUME
VELOCITY
VARIETY
VALUE
VolumeGigabyte(109), Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabytes(1021) Variety: Structured,semi-structured, unstructured; Text, image, audio, video, record Velocity (Dynamic, sometimes time-varying)
SOCIAL
Datawarehouses
OLTP
Social Networks
Scientific Devices
Everywhere
Big Data
Databases
Volume
Velocity
Citizen surveys
Social media
Increase Decrease dev., Increase service Increase Increase net industry value assembly costs provider revenue industry value margin by per year by per year by by by
$300 B
50%
$100 B
250 B
60+%
2. RDBMS Performance
Data Trends
Trend 1: Size Trend 2: Connectedness Trend 3: Semi-structure Trend 4: Architecture
distributed, open-source: data partitioning horizontal scalable: new columns / nodes can be added Vertically scalable: more info can be added schema-free replication support, easy API, Consistency
NoSQL Not only SQL 1990s by Carlo Strozzi With/without ACID properties, no predefined schema, no join
NoSQL properties
CONSISTENCY : All database clients see the same data, even with concurrent updates. AVAILABILITY : All database clients are able to access same version of the data. PARTITION TOLERANCE : The database can be split over multiple servers. ACID properties in SQL
Advantages of NoSQL
Cheap, easy to implement Data are replicated and can be partitioned Easy to distribute Don't require a schema Can scale up and down Quickly process large amounts of data Relax the data consistency requirement (CAP)
Disadvantages
New and sometimes buggy Data is generally duplicated, potential for inconsistency No standardized schema No standard format for queries No standard language Most NoSQL systems avoid in-memory storage No guarantee of support
It augments enterprise data architectures by providing an efficient way for storing, processing, managing and analyzing large volumes of semistructured or un-structured data.
Apache Tomcat
Hadoop Components
HDFS Distributed
File System
Chukwa To monitor Large Distributed System Hue GUI to operate & develop Hadoop Applications Flume To move Large Data post processing efficiently
Dist Apps
Hive Data
Warehousing framework
Pig Framework for Parallel Computation Oozie Workflow Service to manage Data Processing Jobs
Apache Zookeeper
Many more .
Apache Tomcat
24
Uses Hadoop for : Search optimization Research Uses Hadoop for : Internal log reporting/parsing systems designed to scale to infinity and beyond. web-wide analytics platform
Uses Hadoop for : Databasing and analyzing Next Generation Sequencing (NGS) dataApache produced for the Cancer Zookeeper Genome Atlas (TCGA) project and other groups 25
And More
Apache Zookeeper
Apache Tomcat
26
A framework for running applications on large clusters of commodity hardware ( 1 0 0 0 nodes) which produces huge data (petabytes zetabytes) and to process it Open source Apache Software Foundation Project
Hadoop Includes HDFS a distributed filesystem to distribute data Map/Reduce HDFS implements this data-parallel programming model.
HDFS Architecture
NameNode: filename, offset> blockid, block > datanode DataNode: maps block > local disk Secondary NameNode: periodically merges edit logs Block is also called chunk
Dataflow in Hadoop
Submit job
map
schedule
reduce
map
reduce
Dataflow in Hadoop
map
reduce
map
reduce
Dataflow in Hadoop
Finished map
Local FS
map
Local FS
reduce
Dataflow in Hadoop
map
Local FS
map
Local FS
reduce
Dataflow in Hadoop
reduce
reduce
4. MR Examples
CALCULATING
PI
The area of the square, denoted As = (2r)^2 or 4r^2. The area of the circle, denoted Ac, is pi * r2.
pi= 4 * No of pts on the circle / num of points on the square Count the number of generated points that are both in the circle and in the s quare MAP PI = 4 * r REDUCE
File Hello World Bye World Hello Hadoop Goodbye Hadoop Map For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
The output of the first combine: < Bye, 1> < Hello, 1> < World, 2> The output of the second combine: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Thus the output of the job (reduce) is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
Map()
Input <filename, file text> Parses file and emits <word, count> pairs
eg. <Hello, 1>
Reduce()
Sums all values for the same key and emits <word, TotalCount>
eg. <Hello, ( 1 1) > => <hello, 2>
I work in a company where the Engineering Department guys produce an amazing amount of CAD files (Computer Assisted Design). So over the years we ended up having hundreds of thousands if not millions of files hosted on different Filler Systems. But quite often, the engineers need to access those files to modify/evolve/ consult the information inside. The problem is that even though the engineers know precisely the name of the file they want, it takes quite a while (sometimes more than an hour) for the Filer System to actually find it and send it back to the engineer's PC. And that is because no indexing system exists on the Filer Hosting System (the system tests every single inode until the correct one is found). The files are not very big (a couple of dozens of MB) - but there are so many of them... Is hadoop suitable???
5. Genome clustering
Genome is a DNA sequence (A T G C) Features characterize a genome motifs are extracted MR1 AAAGGGTTTCCCAAAG Mapper: AAA- 1, AAG- 1, AGG- 1, GG G - 1, GGT- 1, GTT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1,CCA- 1, CAA1, AAA- 1, AAG- 1 Reducer: AAA- 2, AA G- 2, AGG- 1, GGG - 1, GGT- 1, G TT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1, CCA- 1, CAA- 1
Read input
K Means
Fix up cluster centres Find the distance between samples and cluster centres Find which cluster centre is closest to a given sample Add the sample to the cluster Re-evaluate cluster centres
Mapreduce kmeans
Key clusterid, value motif,count Map calculates distance of each point from centers Keyout clusterid , valueout-distance to each cluster center; Reducer: finds closest cluster center; recalculates new centers for next iteration Out- new centers
M R K -Means
L et k be no. of features in a sample, V1(k) is the value for kth feature in the sample. V2(k) is value of kth feature of the cluster centre Mapper: Distance to ith cluster centre is given by
R educer: min [Si] is evaluated for a species considering all cluster centres i Centroid is evaluated for all features (k) in sample V2 and cluster centre V1. This is done to update centroid
Accuracy comparison
Accuracy in percentage
50
Time Efficiency
51
HADOOP ON AWS
Master Instance Core Instances
3 4
Task Instances
P ERFO RMANCE
FILE SIZE UPLOADING TIME COMPUTATION TIME OVERALL TIME
15.9 KB
1 sec
4 min
6 min
1 MB
54.4 MB 118.9 MB
3 sec
5 min 13 min
4 min
6 min 8 min
6 min
9 min 13 min
50
0
sss
ssm
MR Total
ssl
mms
NUMBER OF INSTANCES
TYPE OF INSTANC E S
COMPUTATION TIME
OVERALL TIME
M-1, C- 2 , T -1
M-s, C- s , T- s
8 min
13 min
M-1, C- 2 , T -1
M-m, C- m , T- m 5 min
9 min
M-1, C- 2 , T -1
M-s, C- s , T- m
6 min
11 min
M-1, C- 2 , T -1
M-s, C- s , T- l
7 min
11 min
100
50 0 1Kb 15KB Hadoop AWS 15MB 119MB
Binary classification can be viewed as the task of separating classes in feature space
f(x) = sign(wTx + b)
w Tx + b = 0 w Tx + b > 0 w Tx + b < 0 where w represents normal vector to hyperplane x represents training patterns b represents threshold represents coefficients s represents support vectors
W = i i si
Methodology
Steps
1. Data Collection download data sets for month/ year/day 2. Input feature selection MA and RSI
Vol avg -- +,- points
3. Finding support vectors -- S1, S2 1, 2 4. Construction of hyper plane y = wT x + b 5. Classification of test data using hyper plane and X values into +ive / -ive points
2. a) Input feature
Moving Average
MA = n i=1 CPi n where, CPi = Closing Price on ith day
X= MA = Vol Avg = 8.3 2.8 Total data set avg = 1.54 + point 81.61% increase
October
8.81
31.03
3. Support Vectors
Support vectors are calculated by using Euclidean distance formula In the Euclidean plane, if X = (x1, x2) and Y = (y1, y2) then the distance is given by, D (X,Y) = ( x1 - y1 ) 2 + ( x2 - y2 ) 2 where X and Y are positive points and negative points of Moving Average and Relative Strength Index
+ive point
8.30,79.87
-ive point
10.61,42.53 12.32, 31.50 10.44, 83.97 12.78, 28.05 8.81, 31.03
distance
37.44 48.53 4.62 52.01 48.84 43.31 54.41 2.35 (min) 57.88 54.77
8.97, 85.81
Others are also calculated and min is found S1= 8.97,85.81 S2= 10.44,83.97 8.97 10.44 S1 = 85.81 S2 = 83.97 1 1
3.b)Finding 1 and 2
1 S1 S1 + 2 S1 S2 = + 1 1 S1 S2 + 2 S2 S2 = - 1 7444.81 1 + 7300.11 2 = +1 7300.11 1 + 7160.95 2 = -1 1 = -0.71; 2 = 0.73
4. Construction of Hyperplane
Hyperplane is constructed using the equation , y = wT x + b Where, w = i si w represents normal vector to hyperplane x represents training patterns b represents threshold represents coefficients s represents support vector values are calculated by using the equation
5. Testing set
MA=x1= 12.44, RSI =x2= 50 to predict Volume avg y= 1.25 T x + 0.02 hyperplane 0.37
Y = 34.05 + 0.02 = 34.07 predicted value (vol avg) Overall avg = 18 Increased % = (34.07-18 )/18 = 89%
Multi SVM
Support vector machine is a powerful tool for binary classification, capable of generating very fast classifier functions following a training period
To extend binary class scenario to multi-class scenario, decompose an M-class problem into a series of two-class problems
0 to 50
50 to 100
pairwise
BDT
-100 to 100
-100 to 0
0 to 100
-100 to -50
-50 to 0
0 to 50
50 to 100
ANALYZE
ORGANIZE
ANALYZE
ORGANIZE
ANALYZE
ORGANIZE
Summary
Big data NoSQL Hadoop Case Study
Questions