Sie sind auf Seite 1von 26

Real time stream processing using

Apache Kafka
Agenda

What is Apache Kafka?


Why do we need stream processing?
Stream processing using Apache Kafka
Kafka @ Hotstar

Feel free to stop me for questions

2
$ whoami

Personalisation lead at Hotstar


Led Data Infrastructure team at Grofers and TinyOwl
Kafka fanboy
Usually rant on twitter @jayeshsidhwani

3
What is Kafka?

Kafka is a scalable,
fault-tolerant, distributed queue
Producers and Consumers
Uses
Asynchronous communication in
event-driven architectures
Message broadcast for database
replication

Diagram credits: http://kafka.apache.org 4


P P P

Inside Kafka

TOPIC
TOPIC

TOPIC
Brokers
Heart of Kafka
Stores data
Data stored into topics
Zookeeper C C C

Manages cluster state information


Leader election

BROKER BROKER BROKER

ZOOKEEPER ZOOKEEPER

5
Inside a topic

Topics are partitioned


A partition is a append-only
commit-log file
Achieves horizontal scalability
Messages written in a
partitions are ordered
Each message gets an
auto-incrementing offset #
{user_id: 1, term: GoT} is
a message in the topic searched

Diagram credits: http://kafka.apache.org 6


How do consumers read?

Consumer subscribes to a topic


Consumers read from the head
of the queue
Multiple consumers can read
from a single topic

Diagram credits: http://kafka.apache.org 7


Kafka consumer scales horizontally

Consumers can be grouped


Consumer Groups
Horizontally scalable
Fault tolerant
Delivery guaranteed

Diagram credits: http://kafka.apache.org 8


Stream processing and its use-cases

9
Discrete data processing models
Request / Response
APP APP APP
processing mode
Processing time <1
second
Clients can use this
data

10
Discrete data processing models
Request / Response
APP APP APP
processing mode
Processing time <1
second
Clients can use this
data

Batch processing
mode
Processing time few
hours to a day
DWH HADOOP Analysts can use this
data
11
Discrete data processing models

APP APP APP APP


As the system grows, such
synchronous processing
model leads to a spaghetti CACHE
SEARCH
MONIT
and unmaintainable design

12
Promise of stream processing
APP APP APP APP

Untangle movement of data


Single source of truth
No duplicate writes
Anyone can consume anything
Decouples data generation STREAM PROCESSING FRAMEWORK
from data computation

SEARCH
MONIT

CACHE

13
Promise of stream processing
APP APP APP APP

Untangle movement of data


Single source of truth
No duplicate writes
Anyone can consume anything
STREAM PROCESSING FRAMEWORK
Process, transform and react
on the data as it happens
Sub-second latencies Filter
Window
Anomaly detection on bad stream
Join
quality
Timely notification to users who
dropped off in a live match Anomaly Action Intelligence

14
Stream processing using Kafka

15
Stream processing frameworks

Write your own?


Windowing
State management
Fault tolerance
Scalability
Use frameworks such as Apache Spark, Samza, Storm
Batteries attached
Cluster manager to coordinate resources
High memory / cpu footprint

16
Kafka Streams

Kafka Streams is a simple, low-latency, framework


independent stream processing framework
Simple DSL
Same principles as Kafka consumer (minus operations
overhead)
No cluster manager! yay!

17
Writing Kafka Streams

Define a processing topology


Source nodes
Processor nodes
One or more
Filtering, windowing, joins etc
Sink nodes
Compile it and run like any other java
application

18
Demo
Simple Kafka Stream

19
Kafka Streams architecture and operations

Kafka manages
Parallelism
Fault tolerance
Ordering
State Management

Diagram credits: http://confluent.io 20


Streaming joins and state-stores

Beyond filtering and windowing


Streaming joins are hard to scale
Kafka scales at 800k writes/sec*
How about your database?
Solution: Cache a static stream
in-memory
Join with running stream
Stream<>table duality
Kafka supports in-memory cache OOB
RocksDB
In-memory hash
Persistent / Transient
Diagram credits: http://confluent.io 21
*
achieved using librdkafka c++ library
Demo

Inputs:
Incoming stream of benchmark stream
quality from CDN provider
Incoming stream quality reported by
Hotstar clients
Output:
Calculate the locations reporting bad
QoS in real-time

Diagram credits: http://confluent.io 22


*
achieved using librdkafka c++ library
Demo CDN
benchmarks

Inputs:
Incoming stream of benchmark stream
quality from CDN provider Client
Incoming stream quality reported by reports

Hotstar clients
Output:
Calculate the locations reporting bad
QoS in real-time

Alerts

Diagram credits: http://confluent.io 23


*
achieved using librdkafka c++ library
KSQL - Kafka Streams ++

24
Kafka @ Hotstar

25
26

Das könnte Ihnen auch gefallen