Sie sind auf Seite 1von 55

Design Patterns For Real Time

Streaming Analytics
Sheetal Dolas
Principal Architect, Hortonworks

19
Page 1Feb 2015
Hortonworks Inc. 2011 2014. All Rights Reserved
Who am I ?
Principal Architect @ Hortonworks
Most of the career has been in field, solving real life
business problems
Last 5+ years in Big Data including Hadoop, Storm etc.
Co-developed Cisco OpenSOC ( http://opensoc.github.io )

sheetal@hortonworks.com
@sheetal_dolas

Page 2 Hortonworks Inc. 2011 2014. All Rights Reserved


Agenda
Streaming Architectural Patterns - Overview
Design Patterns
o What
o Why
o Illustrations
QA

Page 3 Hortonworks Inc. 2011 2014. All Rights Reserved


Streaming Architectural Patterns

Page 4 Hortonworks Inc. 2011 2014. All Rights Reserved


Real Time Streaming Architecture
Source Data Messaging Real Time Storage Access
Systems Collection System Processing

Historic Analytic
Tools
Hive /
R / Python
Sources Flume / Kafka Storm HDFS
Custom BI Tools
Search
Syslog Topology
Agent A Topic A
A Elastic Search
Web
Machine Data Services
Topology / Solr
Agent B Topic B
B REST API
External
Low Latency
Streams
NoSql Web Apps
Topology
Other Agent N Topic N
N HBase
Alerting
Systems

Page 5 Hortonworks Inc. 2011 2014. All Rights Reserved


Lambda Architecture
Batch Layer Serving Layer

All Data Batch View

New Data
Pre-compute Data
Batch View
Views Access
Data
Stream Query
Speed Layer

Stream Real Time


Processing View

Page 6 Hortonworks Inc. 2011 2014. All Rights Reserved


Kappa Architecture

Data Stream Processing Serving DB Data


Source System Access

Job Version n Output table n


Data
Stream Query
Job Version n Output table n
+1 +1

Page 7 Hortonworks Inc. 2011 2014. All Rights Reserved


Design Patterns

Page 8 Hortonworks Inc. 2011 2014. All Rights Reserved


Design Pattern What is it?

Common
ly ContextuSoftware
ReusableSolution Problem
Occurrin al Design
g

A General reusable solution to a commonly occurring


problem within a given context in software design.

Page 9 Hortonworks Inc. 2011 2014. All Rights Reserved


Design Patterns Why ?
Streaming use cases have distinct characteristics
o Unpredictable incoming data patterns
o Correlating multiple streams
o Out-of-sequence and late events
High scale and continuous streams pose new challenges
o Peaks and valleys
o Changing data characteristics over period of time
o Maintain the latency and throughput SLAs

Page 10 Hortonworks Inc. 2011 2014. All Rights Reserved


Streaming Patterns

Data Stream
Architectural Functional
Management Security
Patterns Patterns
Patterns Patterns
Real-time Stream Joins External Message
Streaming Lookup Encryption
Top N
Near-real-time (Trending) Responsive Authorized
Streaming Shuffling Access
Rolling
Lambda Windows Out-of- Secure Cluster
Architecture Sequence Authentication
Events
Kappa
Architecture

Page 11 Hortonworks Inc. 2011 2014. All Rights Reserved


Streaming Patterns Being Discussed

Data Stream
Architectural Functional
Management Security
Patterns Patterns
Patterns Patterns
Real-time Stream Joins External Message
Streaming Lookup encryption
Top N
Near-real-time (Trending) Responsive Authorized
Streaming Shuffling Access
Rolling
Lambda Windows Out-of- Secure Cluster
Architecture Sequence Authentication
Events
Kappa
Architecture

Page 12 Hortonworks Inc. 2011 2014. All Rights Reserved


External Lookup
Dynamic, High Speed Enrichments With External Data
Lookup

Page 13 Hortonworks Inc. 2011 2014. All Rights Reserved


External Lookup - Description

Referencing frequently changing external system data for


event enrichments, filters or validations
by minimizing the event processing latencies, system
bottlenecks and maintaining high throughput.

Page 14 Hortonworks Inc. 2011 2014. All Rights Reserved


Page 14
External Lookup - Challenges
Increased latency due to frequent external system calls
Insufficient memory to hold all reference data in memory
Scalability and performance issues with large data
reference sets
Dynamic reference data needs frequent cache purge and
refreshes
External systems can become a bottleneck

Page 15 Hortonworks Inc. 2011 2014. All Rights Reserved


Page 15
External Lookup Potential Options

Performance Scalability Fault Tolerance

Always Fetch

Cache
Everything

Partition and
Cache on the
go

Page 16 Hortonworks Inc. 2011 2014. All Rights Reserved


External Lookup - A Reference Use Case
Real Time Credit Card Fraud Identification and Alert
o Credit card transaction data comes as stream (typically through
Kafka)
o External system holds information about the card holders
recent location
o Each credit card transaction is looked up against users current
location
o If the geographic distance between the credit card transaction
location and users recent known location is significant, the
credit card transaction is flagged as potential fraud

Page 17 Hortonworks Inc. 2011 2014. All Rights Reserved


Page 17
External Lookup - Topology Overview
External
Alerting
Source Stream Storm Reference
System
Data
Looks up users current
location from external User Location
system and finds geo Information
distance between
transaction location and
user location
Credit Card Fraud
Partitioner
Transaction Analyzer
Bolt
Spout Bolt

Locally caches Fraud Alert


Partitions data the user Email
based on area location data.
code of the Cache validity
mobile numbers is time bound

Page 18 Hortonworks Inc. 2011 2014. All Rights Reserved


Page 18
External Lookup - Peek in the Bolts
Storm

Stream is CA NV TX Local cache


partitioned based Fraud Analyzer (time sensitive)
on area code Bolt (Use lightweight
Partitioner Bolt Instance 1 caching solution
like Guava)
Instance 1
NY CT MA
Fraud Analyzer
Partitioner Bolt Bolt
Instance 2 Instance 2

Partitioner Bolt FL NC OH
Instance n Fraud Analyzer
Bolt
Instance n

Page 19 Hortonworks Inc. 2011 2014. All Rights Reserved


Page 19
External Lookup - Benefits of the approach
Only required data is cached (on demand)
Each bolt caches only partition of reference data
Data is locally cached so trips to external system are
reduced
Cache is time sensitive
On the go cache building handles failures elegantly

Page 20 Hortonworks Inc. 2011 2014. All Rights Reserved


Page 20
External Lookup Applicability
Stream processing depends on external data
External data is sufficiently large that could not be hold
in memory of each task
External data keeps changing
External system has scalability limitations

Page 21 Hortonworks Inc. 2011 2014. All Rights Reserved


Responsive Shuffling

Page 22 Hortonworks Inc. 2011 2014. All Rights Reserved


Responsive Shuffling - Description

Automatically adjust shuffling for better performance and


throughput during peaks and varying data skews in streams

Page 23 Hortonworks Inc. 2011 2014. All Rights Reserved


Responsive Shuffling - Challenges
Incoming data stream is unpredictable and can be
skewed
Skew can change from time to time
Managing latency and throughput with skews is difficult
Since streams are continuously flowing, restarting
topology with new shuffling logic is practically not
possible

Page 24 Hortonworks Inc. 2011 2014. All Rights Reserved


Shuffling Potential Options

Latency & System


Uptime
Throughput Reliability

Static Shuffle

Responsive
Shuffle

Page 25 Hortonworks Inc. 2011 2014. All Rights Reserved


External Lookup - A Reference Use Case
Optimized HBase Inserts
o Event data is stored in HBase after storm processing
o Group events such that a bolts can insert more events in HBase
with less trips to region servers
o Over period of time HBase regions can split/merge
o Automatically adjust the event grouping as HBase region layout
changes over period of time

Page 26 Hortonworks Inc. 2011 2014. All Rights Reserved


Page 26
Example HBase writes w/o responsive
shuffling
App Bolt HBase Bolt Region Server
Instance 1 Instance 1 Instance 1
(100 events) (100 events) (100 events)

App Bolt HBase Bolt Region Server


Instance 2 Instance 2 Instance 2
(100 events) (100 events) (100 events)

App Bolt HBase Bolt Region Server


Instance 3 Instance 3 Instance 3
(100 events) (100 events) (100 events)
300 300 9 trips 300
events events to events
sent sent region received
servers
Page 27 Hortonworks Inc. 2011 2014. All Rights Reserved
Responsive Shuffling - Design

Page 28 Hortonworks Inc. 2011 2014. All Rights Reserved


Example HBase writes with responsive
shuffling
App Bolt
HBase Bolt Region Server
Instance 1 RS Aware
Instance 1 Instance 1
(100 Partitioner
(100 events) (100 events)
events)
Partitioner App Bolt
automaticall HBase Bolt Region Server
Instance 2 RS Aware
y adapts to Instance 2 Instance 2
(100 Partitioner
splitting/me (100 events) (100 events)
events)
rging HBase
regions App Bolt
HBase Bolt Region Server
Instance 3 RS Aware
Instance 3 Instance 3
(100 Partitioner
(100 events) (100 events)
events)
300 300 3 trips 300
events events to events
sent sent region received
servers
Page 29 Hortonworks Inc. 2011 2014. All Rights Reserved
Responsive Shuffling - Benefits
Topology responds to changes in data patterns and
adopts accordingly
Maintains high level of SLA and throughput adherence
Minimizes needs for maintenance & hence downtimes

Page 31 Hortonworks Inc. 2011 2014. All Rights Reserved


Responsive Shuffling - Applicability
Change in shuffle pattern does not impact final outcome
Data stream has varying skews
Target/Reference system specifications change over
period of time

Page 32 Hortonworks Inc. 2011 2014. All Rights Reserved


Out-of-Sequence Events

Page 33 Hortonworks Inc. 2011 2014. All Rights Reserved


Out-of-Sequence Events - Description

An out-of-sequence event is one that's received late,


sufficiently late that you've already processed events that
should have been processed after the out-of-sequence
event was received.

Page 34 Hortonworks Inc. 2011 2014. All Rights Reserved


Out-of-Sequence Events - Challenges
Hard to determine if all events in given window have
been received
Need referencing of relevant data for late events
Builds more pressure on processing components
Increased latency and degraded overall system
performance

Page 35 Hortonworks Inc. 2011 2014. All Rights Reserved


Out-of-Sequence Events Potential Options

Result Operational
Latency
Accuracy Ease

Drop

Wait

Fan Out

Page 36 Hortonworks Inc. 2011 2014. All Rights Reserved


Out-of-Sequence Events - Processing
Monitors currently being
processed events and
identifying out-of-
sequence events

Ordered
Typical
Event Filter events
Source Spout Processing
Bolt Bolt

Out-of-
Sequence
events
Based on
complexities in
processing, this
Special
can be extended
Handling Bolt as different
topology

Page 37 Hortonworks Inc. 2011 2014. All Rights Reserved


Out-of-Sequence Events Benefits
Separation of concerns
Maintain the the overall throughput and latency
requirements
Independent scaling of components

Page 38 Hortonworks Inc. 2011 2014. All Rights Reserved


Out-of-Sequence Events - Applicability
When order of events matter
Processing out-of-sequence events needs special and
complex logic
Stream has relatively low volume of out-of-sequence
events

Page 39 Hortonworks Inc. 2011 2014. All Rights Reserved


Thank You!
sheetal@hortonworks.com
@sheetal_dolas

Page 40 Hortonworks Inc. 2011 2014. All Rights Reserved


Appendix

Page 41 Hortonworks Inc. 2011 2014. All Rights Reserved


Data Security in Kafka

Page 42 Hortonworks Inc. 2011 2014. All Rights Reserved


Data Security in Kafka - Description

Ability to use Kafka as secure data transfer mechanism.


Apache Kafka is widely used messaging platform in
streaming applications. Unfortunately Kafka does not have
built in support for Authentication & Authorization (yet)

Page 43 Hortonworks Inc. 2011 2014. All Rights Reserved


Data Security in Kafka - Flow
Source Data Messaging
Real Time Processing
Systems Collection System

Sources Custom Kafka Storm


Collector

Encryptin Kafka Decrypti App


Encrypted
Syslog g
Producer
Messages Spout ng Bolt Bolt

Page 44 Hortonworks Inc. 2011 2014. All Rights Reserved


Data Security in Kafka Encryption Details
Messaging
Data Collection Real Time Processing
System

Event Producer Kafka Storm Decrypting Bolt


Topic

Encryp
Encrypt Event(s)
t AES
Event event(s) Envelope
key w/ Decryp
w/ AES Decryp
RSA t
t AES
Event(s) event( Event
key w/
Envelope s) w/
RSA
Event(s) Envelope AES
Encrypted AES
Key (w/ RSA)

Encrypted Event Event(s)


(w/ AES) Envelope

Page 45 Hortonworks Inc. 2011 2014. All Rights Reserved


Data Security in Kafka Encryption Details
RSA public/private keys are generated ahead of time and
securely shared with topology
AES key is randomly generated and periodically
refreshed
Only user having appropriate RSA private key can read
the data
One event or a batch of events can be encrypted
together as per needs

Page 46 Hortonworks Inc. 2011 2014. All Rights Reserved


Data Security in Kafka - Applicability
Multiple applications want to use Kafka as their source to
the stream
Data is sensitive and can not be shared between
applications
Other components in the pipeline are secured

Page 47 Hortonworks Inc. 2011 2014. All Rights Reserved


Micro Batching

Page 48 Hortonworks Inc. 2011 2014. All Rights Reserved


Micro Batching - Description

Micro-batching is a technique that allows a process or task


to treat a stream as a sequence of small batches or
chunks of data.
For incoming streams, the events can be packaged into
small batches and delivered to a batch system for
processing

Page 49 Hortonworks Inc. 2011 2014. All Rights Reserved


Micro Batching - Challenges
Data delivery reliability
Unnecessary data duplication
Increased latency
Complexity in time-bound batching

Page 50 Hortonworks Inc. 2011 2014. All Rights Reserved


Micro Batching - Design Options
Thread-based Model
Controller stream to trigger batch flush
Use of Tick Tuples

Page 51 Hortonworks Inc. 2011 2014. All Rights Reserved


Tick Tuples

Tick tuples are system generated tuples that Storm can


send to your bolt if you need to perform some actions at a
fixed interval

Page 52 Hortonworks Inc. 2011 2014. All Rights Reserved


Micro Batching - Benefits
Takes advantages of system characteristic by batching
events together
Adheres to processing latency needs by ensuring that
batches are executed by certain intervals
Prevents data loss by acknowledging events only after
successful processing
Simple, elegant and easy to maintain code

Page 53 Hortonworks Inc. 2011 2014. All Rights Reserved


Micro Batching - Applicability
Target systems are more efficient with bulk transactions
Processing group of events is more efficient than
individual event
End to end event latency is not super sensitive

Page 54 Hortonworks Inc. 2011 2014. All Rights Reserved


Micro Batching Sample Code

Page 55 Hortonworks Inc. 2011 2014. All Rights Reserved


Thank You!
sheetal@hortonworks.com
@sheetal_dolas

Page 56 Hortonworks Inc. 2011 2014. All Rights Reserved