Zackathon 20150121

Zackathon
January 2015
Agenda
• Big Data Quick Recap
• Proof Of Concept : Background
• Big Data POC technologies overview
• Putting it all together
• Teams and timelines
© Zyme Solutions Inc. 2014

Big Data: The 3 V’s
Big data is high Volume, high Velocity, and/or high Variety information assets that
require new forms of processing to enable enhanced decision making, insight
discovery and process optimization.
- Gartner
Zyme and most other enterprises are awash with ever-growing data of all types, and Zyme needs to
handle exponential growth in data VOLUMES as it:
- adds many, many more customers and partners
- collects sales data from additional Tiers (Tier 2, Tier 3,…)
- collects sales data not just at SKU level but at serial number level
Zyme provides rich interfaces for analytics and reporting on “structured data”, but …..Zyme also has lots of
unstructured data, for example
- raw data reported by partners in many different formats
- the logs containing valuable metadata as the data is processed through various sub-systems
Can these data which is available in different VARIETIES, be mined for extracting valuable insights for
both Zyme and our customers??
Ever growing demand for “near-real-time” feed-outs!

- Though the data volumes are growing
- Though the variety of data being processed is growing
- Though the nature of computations and analytics performed is increasing
The time to process them shrinks, ability to process data in high VELOCITY is a necessity

Zyme Data Lake
TrueData DW Data Reports
Mart JDBC
RT Process RT ETL ETL

POS POS POS
Analytics
ETL
R POS
T
Zyme Data Lake
TruePay TruePay TruePay
Inventory Inventory Inventory COMPUTE

• Massively Scalable Risk Risk
Risk
• Highly Available
• Commodity HW
Batch Real-time Stream Interactive ACCESS
POS POS
POS STORE
INV INV
INV
Background to POC: Goods Movement with Invoice #s
ZID ZID ZID

LOCID LOCID LOCID
PARTNERCODE PARTNERCODE PARTNERCODE
Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC

RMA Returns
Returns
Disti 1 Reseller 1
• SN-1
SKU-X • SN-1
• SN-3 SKU-X
• SN-2
• SN-333
SKU-Z
• SN-1 • SN-444 SKU-Y • SN-11
SKU-X • SN-2 Invoice 1
• SN-N • SN-333 SKU-Y: SN-11
SKU-Z
• SN-444
Disti 2
• SN-11 SKU-X • SN-2
SKU-Y
• SN-22
Invoice 1 SKU-Y • SN-11
Reseller 2
SKU-Y • SN-11
• SN-111
• SN-222 Invoice 1
SKU-Z
• SN-333 Disti 3 Reseller 3
• SN-444
• SN-1
SKU-X
• SN-3 • SN-9
• SN-111 SKU-X
SKU-Z • SN-8
• SN-222
SKU-Z • SN-111
SKU-X: SN-2

Product States (At Serial # level)
ZID ZID ZID
LOCID LOCID LOCID

RMA Returns
SKU: 123 Returns
Serial #: 456
Status: Available
ZID ZID ZID

LOCID LOCID LOCID

RMA SKU: 123 Returns SKU: 123
Serial #: 456 Serial #: 456 Returns
Status: Unavailable Status: Available
ZID ZID ZID

LOCID LOCID LOCID

Status: Available Status: Unavailable
ZID ZID ZID
LOCID LOCID LOCID

Status: Unavailable Status: Unavailable

Questions to be answered by Inventory APIs
All SKUs
- Given a Partner (identified by ZID, LOCID and/or PartnerCode), return all SKUs with
available quantity associated with that Partner
- Given a Partner and a SKU, return available quantity
- Given a Partner, a SKU and a key field (like Invoice #), return available quantity
- Given a Partner and a SKU, return history of “events” which resulted in the current
calculated quantity since the last “reset/opening balance” event
- Given a Partner, a SKU and a key field (like Invoice #), return history as above
Serialized SKUs
- Given a Partner and a SKU, return available Serial #s
- Given a Partner, a SKU and a key field (like Invoice #), return available Serial
#s
- Given a Partner, SKU and Serial #, check if the Serial # was previously sold by the
Partner (no longer available with the partner, but was it sold by the partner
previously? Also answers the question if the sale was reported earlier/aka dup
check)
- Given a Serial #, fetch the corresponding SKU code
- Given a Serial #, fetch the corresponding SKU code and current location (Partner)
- Given a (CVIS) Transaction ID, fetch the corresponding Serial #s

Performance considerations
Taking a large Mobile manufacturer as an example
- Around 5 million Serial #s manufactured per week
- For every Tier the Product hops, there are equivalent # of “Goods movement” events captured. So, if the
tracking is done for 4 Tiers, there are 25 million “Goods movement” events every week
1. Sell-In (5 Million)
2. T1->T2 Sell-Thru (5 Million)
5. T4->EC Sell-Out (5 Million)
- Simplifying assumption: Each “goods movement” transaction will include a batch of 10 Serial #s
- Our standard SLA refers to 18 months (72 weeks) of data retention
- This implies tracking 2.5 million “transactions” per week and 180 Million Transactions in 18
months
- This also implies around 1.8 Billion goods-movements events to be captured and tracked at a
Serial # level
Largest current customer in production

- 1 million POS and 1.5 million INV transactions processed per week
- POS table contains 18 months of data totaling up to 72 million transactions and INV 108 million
MySql performance enhancement approaches

- Optimize the schema
- Add more core/memory
- Partition/shard data
- NDB Cluster
- MySql Fabric

Inventory Tracking
Inventory Tracking History Inventory Positions
Group Fields Group Fields

Party Partner Code Party Partner Code
ZymeID ZymeID
LocationID LocationID
SKU Code Product SKU Code
Val Val Available
Product
Serial # Val Val Available
Val Serial # Val Unavailable
Events Event Val Available
Event Quantity Val Available
Current Balance Available Quantity Current Balance Available Quantity
References Source References Bucket ID
Transaction ID CREATED_TS
Bucket ID MODIFIED_TS
CREATED_TS
Sell-In Sell-In Invoice
MODIFIED_TS Sell-Through #???
Sell-Out
Returns
CVIS/TDAWB/? RMA
---------------------
Sell-In Invoice Reported Inventory
#??? Opening Balance
---------------------
Purges
Adjustments
Inventory Tracking History Sample
Group Fields Step 1 Step 2 Step 3 Step 4a Step 4b Step 5a Step 5b
Party Partner Code PC1 PC1 PC1 PC1
ZymeID ZID1 ZID2 ZID1 ZID1 ZID2 ZID2 ZID1
LocationID ZLOCID1 ZLOCID1 ZLOCID1 ZLOCID1
Product SKU Code SKU1 SKU1 SKU1 SKU1 SKU1 SKU1 SKU1
Serial #s SN1, SN2 SN5, SN6 SN3, SN4 SN1, SN2 SN1, SN2 SN2 SN2
Events Event Opening Opening Sell-In Sell-Thru Sell-Thru Returns Returns
Balance Balance
Event Qty 1000 2000 500 100 100 10 10
Balance Available 1000 2000 1500 1400 2100 2090 1410
Qty
Refs Source TDAWB TDAWB CVIS CVIS CVIS CVIS CVIS
Txn ID 12345 54321 54321 88888 88888
UDF I1 I2 I3
CREATED 5/3 10:15 5/3 10:15 5/4 10:15 5/6 10:15 5/6 10:15 5/7 10:15 5/7 10:15
MODIFIED 5/3 10:15 5/3 10:15 5/4 10:15 5/6 10:15 5/6 10:15 5/7 10:15 5/7 10:15
Step 1 Step 2 Step 3 Step 4a Step 4b

SKU1 SKU1 SKU1 SKU1 SKU1
Bucket ID (I1) Bucket ID (I2) Bucket ID (I1) Bucket ID (I1) Bucket IT(I2)
Serial # Status Serial # Status Serial # Status Serial # Status Serial # Status
SN1 A SN5 A SN1 A SN1 U SN5 A
SKU1 SKU1 SKU1
SN2 A SN6 A SN2 A SN2 U SN6 A
Bucket ID (I3) Bucket ID (I3) Bucket ID (I3)
Serial # Status Serial # Status Serial # Status
SN3 A SN3 A SN1 A
SN4 A SN4 A SN2 A

Inventory Management Interfaces
Serial # (ABR)
CVIS Post-Finalization
Quick SISO Reported Serial # validations Initialize Balances Report Goods Movement Report Purges
Check (POS) Inventory Check Eg. Dup Serial # Reported Inventory Sell-In Purges
Given a Partner, a (INV) check POS (ST/SO)
SKU (and Given a Partner, a Given a Partner, SKU Returns
BucketID), return SKU (and and Serial #, check if
available quantity BucketID), return the Serial # was Given a (CVIS)
and check if it is available quantity previously sold by the Txn ID, fetch the
within threshold and check if it is Partner corresponding
within threshold Inventory Mgmt API Serial #s
Given a Serial #,
GET POST Feed-Out
fetch the
corresponding
SKU code
Given a
History Positions Configs Partner/SKU/Fisc
ZGW
al Date, return
balance
Given a Serial #,
fetch the
TruePay
corresponding
SKU code (and
current location Reset Balance Adjustments
(Partner)) Opening Balance Adjustments
TDAWB
ZymeNet
Exceptions Feeds
Inventory Mgmt App Configs
Feeds
POW Inventory Mgmt Given a Partner, Given a Partner, a Given a Partner, a Inventory Mgmt
get all SKUs with SKUs (and SKU (and
available quantity BuckedID), return BucketID), return
“history” available quantity
Performance considerations
Inventory Transaction History table

• 1 Sell-In Transaction will result in 1 entry (added to Seller)
• 1 Sell-Through Transaction will result in 2 entries in this table (deducted from Seller and added to Buyer)
• 1 Returns Transaction will result in 2 entries in this table (added to Seller and deducted from Buyer)
• 1 Sell-Out transaction will result in 1 entry (deducted from Seller)
• With the above example of assuming 2.5 million transactions per week, 4 million entries will be created
per week (1 Sell-In 0.5 million, 3 Sell-Thru’s 3 million, 1 Sell-Out .5 million)
• This implies about 288 million entries for 18 months (72 weeks) (cumulative size of Sell-In and POS
tables)
• Entries are only “added” to this table (and purged during archival)
• Out of 4 million entries per week, 3 million will be on peak days (Mon/Tue) and within the peak 10 hours
on those days, which implies 150K inserts per hour
Inventory Positions table

- Every serial # will have one entry for every Tier in this table (4 Tiers in this eg. T1, T2, T3, T4)
- 5 million Serial #s manufactured per week implies 20 million entries per week
- This implies about 1.5 billion entries for 18 months (72 weeks)
- Every “Goods movement” signal will result in created/updated/deleted entry
- With the above example of assuming 0.5 million transactions per week with 10 Serial #s each, 40 million
entries will be created/updated/deleted per week (1 Sell-In 5 million, 3 Sell-Thru’s 30 million, 1 Sell-Out 5
million)
- Out of 40 million updates per week, 30 million will be on peak days (Mon/Tue) and within the peak 10
hours on those days, which implies 1.5 Million updates per hour

Apache Cassandra is an open source distributed database management
system designed to handle large amounts of data across many commodity servers,
providing high availability with no single point of failure.
Cassandra offers robust support for clusters spanning multiple data-centers, with
asynchronous master less replication allowing low latency operations for all clients.
Tables may be created, dropped, and altered at runtime without blocking updates and
queries.
Cassandra does not support joins or subqueries, except for batch analysis
via Hadoop. Rather, Cassandra emphasizes denormalization through features like
collection
• http://cassandra.apache.org/
• http://planetcassandra.org/
• http://projects.spring.io/spring-data-Cassandra/

Decentralized
• Every node in the cluster has the same role. There is no single point of failure.
Supports replication and multi data center replication
• Key features of Cassandra’s distributed architecture are specifically tailored for multiple-
data center deployment, for redundancy, for failover and disaster recovery.
Scalability
• Read and write throughput both increase linearly as new machines are added, with no
downtime or interruption to applications
Fault-tolerant
• Data is automatically replicated to multiple nodes for fault-tolerance. Replication across
multiple data centers is supported. Failed nodes can be replaced with no downtime.
Tunable consistency
• Writes and reads offer a tunable level of consistency, all the way from "writes never fail"
to "block for all replicas to be readable", with the quorum level in the middle.
MapReduce support
• Cassandra has Hadoop integration, with MapReduce support. There is support also for
Apache Pig and Apache Hive
Query language
• Cassandra introduces CQL (Cassandra Query Language), a SQL-like alternative to the
traditional RPC interface. Language drivers are available for Java (JDBC), Python
(DBAPI2), Node.JS (Helenus) and Go (gocql).

Kafka
Kafka is a distributed, partitioned, replicated commit log service. It provides

the functionality of a messaging system, but with a unique design.
• Kafka maintains feeds of messages in categories called topics.

• We'll call processes that publish messages to a Kafka topic producers.
• We'll call processes that subscribe to topics and process the feed of
published messages consumers..
• Kafka is run as a cluster comprised of one or more servers each of which is
called a broker.
So, at a high level, producers send messages over the network to the Kafka
cluster which in turn serves them up to consumers like this:

Kafka
At a high-level Kafka gives the following guarantees:

• Messages sent by a producer to a particular topic partition will be
appended in the order they are sent. That is, if a message M1 is sent by
the same producer as a message M2, and M1 is sent first, then M1 will
have a lower offset than M2 and appear earlier in the log.
• A consumer instance sees messages in the order they are stored in the
log.
• For a topic with replication factor N, we will tolerate up to N-1 server
failures without losing any messages committed to the log.
http://kafka.apache.org/
http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
http://www.michael-noll.com/blog/2014/05/27/kafka-storm-integration-
example-tutorial/
Storm
Apache Storm is a free and open source distributed realtime computation system. Storm
makes it easy to reliably process unbounded streams of data, doing for realtime
processing what Hadoop did for batch processing. Storm is simple, can be used with any
programming language, and is a lot of fun to use!
The core abstraction in Storm is the “stream”. A stream is an unbounded sequence of

tuples. Storm provides the primitives for transforming a stream into a new stream in a
distributed and reliable way.
The basic primitives Storm provides for doing stream transformations are “spouts” and
“bolts”.
Spouts and bolts have interfaces that you implement to run your application-specific logic.
A spout is a source of streams. For example, a spout may read tuples off of a Kafka queue
and emit them as a stream. Or a spout may connect to the Twitter API and emit a stream
of tweets.
A bolt consumes any number of input streams, does some processing, and possibly emits
new streams. Complex stream transformations require multiple steps and thus multiple
bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do
streaming joins, talk to databases, and more.

Storm: Key properties
 Extremely broad set of use cases: Storm can be used for processing messages and updating
databases (stream processing), doing a continuous query on data streams and streaming the results
into clients (continuous computation), parallelizing an intense query like a search query on the fly
(distributed RPC), and more. Storm’s small set of primitives satisfy a stunning number of use cases.
 Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have
to do is add machines and increase the parallelism settings of the topology. As an example of Storm’s
scale, one of Storm’s initial applications processed 1,000,000 messages per second on a 10 node
cluster, including hundreds of database calls per second as part of the topology. Storm’s usage of
Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
 Guarantees no data loss: A realtime system must have strong guarantees about data being
successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees
that every message will be processed, and this is in direct contrast with other systems like S4.
 Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage,
Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of
managing Storm clusters as painless as possible.
 Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as
necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
 Programming language agnostic: Robust and scalable realtime processing shouldn’t be limited to a
single platform. Storm topologies and processing components can be defined in any language, making
Storm accessible to nearly anyone.

Storm
Networks of spouts and bolts are packaged into a “topology” which is the top-level
abstraction that you submit to Storm clusters for execution
https://storm.apache.org/
https://storm.apache.org/documentation/Tutorial.html
https://github.com/hmsonline/storm-cassandra
http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-
a-storm-topology/

Apache ZooKeeper

Problem Statement: Master-Worker Application in Distributed
System
 Problems
Master crashes
Worker crashes
Communication failures
 Problems for Master Crashes

Use a backup master
Recover the latest state ?
Backup master may suspect the primary master has crashed
 Problems for Worker Crashes
Master must detect worker crashes
Recover assigned tasks
 Problems for Communication Failures
Execute a same task only once

What is ZooKeeper
 An Open source, High Performance coordination service for distributed

applications
 Reliable, fault tolerant and highly-available
 Simple
• ZooKeeper allows distributed processes to coordinate with each other through
a shared hierarchal namespace which is organized similarly to a standard file
system.
 Replicated
• ZooKeeper itself is intended to be replicated over a sets of hosts called an
ensemble
 Distributed Cluster Management
 Node Join/Leave
 Node Status in real time

ZooKeeper Service
 The servers that make up the ZooKeeper service must all know about each other
 As long as a majority of the servers are available, the ZooKeeper service will be available
 All machines store a copy of the data (in memory)
 A leader is elected on service start-up
 Clients only connect to a single server & maintains a TCP connection.
Client can read from any server, writes go through the leader & needs majority consensus.

Case Study
 Kafka uses Zookeeper for the following tasks:

• Detecting the addition and the removal of brokers and consumers,
• Triggering a rebalance process in each consumer when the above events happen, and
• Maintaining the consumption relationship and keeping track of the consumed offset of each partition
 Steps:
1. Unzip the Apache Kafka version 0.8.1.1
2. Apache Kafka requires zookeeper to be available and running. Start zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
3. Start Kafka Server using the following command
bin/kafka-server-start.sh config/server.properties
4. Create a topic in Apache Kafka using the following command
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1
--partitions 1 --topic test
5. Ask Zookeeper to list available topics on Apache Kafka.
bin/kafka-topics.sh --list --zookeeper localhost:2181
6. Send test messages to Apache Kafka topic called Test using command line producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Please enter some messages like "Hello vulab" press enter
7. Use command line consumer to check for messages on Apache Kafka Topic called test
bin/kafka-console-consumer.sh –zookeeper localhost:2181 –topic test –from-beginning

Proof Of Concept: Inventory Tracking
CVIS Inventory
Mgmt
RT
POS
R
T
Integrate
R
T
R
T
POS Sync Inventory Tracking
Inventory API Process

RT
R
R
T R T
T
POS Inventory STORE

Positions

Deployment Architecture
CVIS
CVIS DB
Inventory API
4x16x500
4x16x10
172.16.1.194
172.1.16.193
Shared File System
Kafka Node 1 Kafka Node 2

ZooKeeper Node 1 ZooKeeper Node 2
2x4x100 2x4x100
172.16.1.195
Storm Node 1 Storm Node 2

4x16x100 4x16x100
172.16.1.196
C* Node 2 C* Node 3 C* Node 4

C* Node 1
2x4x300 2x4x300 2x4x300
2x4x300
172.16.1.197

Teams
Sharath Prasad M Raju Amar

Rafi Soumya
Komali
DB IT

Inventory Tracking: In-line Inventory Positions Calculations
Finalize CVIS
RT
POS
POS
POS
Existing
NAS
AMQ New
Kafka
Listener
Topic
POS Inventory
History
Inventory
Positions
Kafka
Listener POS-1 POS-1
POS-2 POS-2
Json Parser File Reader CSV Reader
Kafka Spout Calculate
and store in
Cassandra

Inventory Tracking: Inventory Positions Consumption
Inventory Management App

(TLS)
Existing
Inventory API New
Data Cassandra
Inventory Latest Fiscal

History Inventory Week
Positions Inventory
Positions

Zackathon 20150121

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Zackathon 20150121

Hochgeladen von

Copyright:

Verfügbare Formate

Zackathon

• Big Data Quick Recap

• Proof Of Concept : Background

• Big Data POC technologies overview

• Putting it all together

• Teams and timelines

© Zyme Solutions Inc. 2014

Ever growing demand for “near-real-time” feed-outs!

© Zyme Solutions Inc. 2014

RT Process RT ETL ETL

Zyme Data Lake

TruePay TruePay TruePay

Inventory Inventory Inventory COMPUTE

Batch Real-time Stream Interactive ACCESS

ZID ZID ZID

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC

© Zyme Solutions Inc. 2014

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC

ZID ZID ZID

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC

ZID ZID ZID

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC

© Zyme Solutions Inc. 2014

© Zyme Solutions Inc. 2014

Largest current customer in production

MySql performance enhancement approaches

© Zyme Solutions Inc. 2014

Group Fields Group Fields

Step 1 Step 2 Step 3 Step 4a Step 4b

Serial # Status Serial # Status Serial # Status

SN3 A SN3 A SN1 A

SN4 A SN4 A SN2 A

Inventory Transaction History table

Inventory Positions table

© Zyme Solutions Inc. 2014

© Zyme Solutions Inc. 2014

© Zyme Solutions Inc. 2014

Kafka is a distributed, partitioned, replicated commit log service. It provides

• Kafka maintains feeds of messages in categories called topics.

© Zyme Solutions Inc. 2014

At a high-level Kafka gives the following guarantees:

The core abstraction in Storm is the “stream”. A stream is an unbounded sequence of

© Zyme Solutions Inc. 2014

© Zyme Solutions Inc. 2014

© Zyme Solutions Inc. 2014

© Zyme Solutions Inc. 2014

 Problems for Master Crashes

© Zyme Solutions Inc. 2014

 An Open source, High Performance coordination service for distributed

© Zyme Solutions Inc. 2014

© Zyme Solutions Inc. 2014

 Kafka uses Zookeeper for the following tasks:

© Zyme Solutions Inc. 2014

Inventory API Process

POS Inventory STORE

© Zyme Solutions Inc. 2014

Shared File System

Kafka Node 1 Kafka Node 2

Storm Node 1 Storm Node 2

C* Node 2 C* Node 3 C* Node 4

© Zyme Solutions Inc. 2014

Sharath Prasad M Raju Amar