Sie sind auf Seite 1von 29

Zackathon

January 2015
Agenda

• Big Data Quick Recap

• Proof Of Concept : Background

• Big Data POC technologies overview

• Putting it all together

• Teams and timelines

© Zyme Solutions Inc. 2014


Big Data: The 3 V’s
Big data is high Volume, high Velocity, and/or high Variety information assets that
require new forms of processing to enable enhanced decision making, insight
discovery and process optimization.
- Gartner

Zyme and most other enterprises are awash with ever-growing data of all types, and Zyme needs to
handle exponential growth in data VOLUMES as it:
- adds many, many more customers and partners
- collects sales data from additional Tiers (Tier 2, Tier 3,…)
- collects sales data not just at SKU level but at serial number level

Zyme provides rich interfaces for analytics and reporting on “structured data”, but …..Zyme also has lots of
unstructured data, for example
- raw data reported by partners in many different formats
- the logs containing valuable metadata as the data is processed through various sub-systems
Can these data which is available in different VARIETIES, be mined for extracting valuable insights for
both Zyme and our customers??

Ever growing demand for “near-real-time” feed-outs!


- Though the data volumes are growing
- Though the variety of data being processed is growing
- Though the nature of computations and analytics performed is increasing
The time to process them shrinks, ability to process data in high VELOCITY is a necessity

© Zyme Solutions Inc. 2014


Zyme Data Lake
TrueData DW Data Reports
Mart JDBC

RT Process RT ETL ETL


POS POS POS
Analytics
ETL

R POS
T

Zyme Data Lake

TruePay TruePay TruePay

Inventory Inventory Inventory COMPUTE


• Massively Scalable Risk Risk
Risk
• Highly Available
• Commodity HW

Batch Real-time Stream Interactive ACCESS

POS POS
POS STORE
INV INV
INV
© Zyme Solutions Inc. 2014
Background to POC: Goods Movement with Invoice #s

ZID ZID ZID


LOCID LOCID LOCID
PARTNERCODE PARTNERCODE PARTNERCODE

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC


RMA Returns
Returns

Disti 1 Reseller 1
• SN-1
SKU-X • SN-1
• SN-3 SKU-X
• SN-2
• SN-333
SKU-Z
• SN-1 • SN-444 SKU-Y • SN-11
SKU-X • SN-2 Invoice 1
• SN-N • SN-333 SKU-Y: SN-11
SKU-Z
• SN-444
Disti 2
• SN-11 SKU-X • SN-2
SKU-Y
• SN-22
Invoice 1 SKU-Y • SN-11
Reseller 2

SKU-Y • SN-11
• SN-111
• SN-222 Invoice 1
SKU-Z
• SN-333 Disti 3 Reseller 3
• SN-444
• SN-1
SKU-X
• SN-3 • SN-9
• SN-111 SKU-X
SKU-Z • SN-8
• SN-222
SKU-Z • SN-111
SKU-X: SN-2

© Zyme Solutions Inc. 2014


Product States (At Serial # level)
ZID ZID ZID
LOCID LOCID LOCID
PARTNERCODE PARTNERCODE PARTNERCODE

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC


RMA Returns
SKU: 123 Returns
Serial #: 456
Status: Available

ZID ZID ZID


LOCID LOCID LOCID
PARTNERCODE PARTNERCODE PARTNERCODE

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC


RMA SKU: 123 Returns SKU: 123
Serial #: 456 Serial #: 456 Returns
Status: Unavailable Status: Available

ZID ZID ZID


LOCID LOCID LOCID
PARTNERCODE PARTNERCODE PARTNERCODE

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC


RMA SKU: 123 Returns SKU: 123
Serial #: 456 Serial #: 456 Returns
Status: Available Status: Unavailable
ZID ZID ZID
LOCID LOCID LOCID
PARTNERCODE PARTNERCODE PARTNERCODE

Manufacturer Sell-In T1 Sell Thru T2 Sell-Out EC


RMA SKU: 123 Returns SKU: 123
Serial #: 456 Serial #: 456 Returns
Status: Unavailable Status: Unavailable

© Zyme Solutions Inc. 2014


Questions to be answered by Inventory APIs

All SKUs
- Given a Partner (identified by ZID, LOCID and/or PartnerCode), return all SKUs with
available quantity associated with that Partner
- Given a Partner and a SKU, return available quantity
- Given a Partner, a SKU and a key field (like Invoice #), return available quantity
- Given a Partner and a SKU, return history of “events” which resulted in the current
calculated quantity since the last “reset/opening balance” event
- Given a Partner, a SKU and a key field (like Invoice #), return history as above

Serialized SKUs
- Given a Partner and a SKU, return available Serial #s
- Given a Partner, a SKU and a key field (like Invoice #), return available Serial
#s
- Given a Partner, SKU and Serial #, check if the Serial # was previously sold by the
Partner (no longer available with the partner, but was it sold by the partner
previously? Also answers the question if the sale was reported earlier/aka dup
check)
- Given a Serial #, fetch the corresponding SKU code
- Given a Serial #, fetch the corresponding SKU code and current location (Partner)
- Given a (CVIS) Transaction ID, fetch the corresponding Serial #s

© Zyme Solutions Inc. 2014


Performance considerations
Taking a large Mobile manufacturer as an example
- Around 5 million Serial #s manufactured per week
- For every Tier the Product hops, there are equivalent # of “Goods movement” events captured. So, if the
tracking is done for 4 Tiers, there are 25 million “Goods movement” events every week
1. Sell-In (5 Million)
2. T1->T2 Sell-Thru (5 Million)
3. T2->T3 Sell-Thru (5 Million)
4. T3->T4 Sell-Thru (5 Million)
5. T4->EC Sell-Out (5 Million)
- Simplifying assumption: Each “goods movement” transaction will include a batch of 10 Serial #s
- Our standard SLA refers to 18 months (72 weeks) of data retention
- This implies tracking 2.5 million “transactions” per week and 180 Million Transactions in 18
months
- This also implies around 1.8 Billion goods-movements events to be captured and tracked at a
Serial # level

Largest current customer in production


- 1 million POS and 1.5 million INV transactions processed per week
- POS table contains 18 months of data totaling up to 72 million transactions and INV 108 million

MySql performance enhancement approaches


- Optimize the schema
- Add more core/memory
- Partition/shard data
- NDB Cluster
- MySql Fabric

© Zyme Solutions Inc. 2014


Inventory Tracking
Inventory Tracking History Inventory Positions

Group Fields Group Fields


Party Partner Code Party Partner Code
ZymeID ZymeID
LocationID LocationID
SKU Code Product SKU Code
Val Val Available
Product
Serial # Val Val Available
Val Serial # Val Unavailable
Events Event Val Available
Event Quantity Val Available
Current Balance Available Quantity Current Balance Available Quantity
References Source References Bucket ID
Transaction ID CREATED_TS
Bucket ID MODIFIED_TS
CREATED_TS
Sell-In Sell-In Invoice
MODIFIED_TS Sell-Through #???
Sell-Out
Returns
CVIS/TDAWB/? RMA
---------------------
Sell-In Invoice Reported Inventory
#??? Opening Balance
---------------------
© Zyme Solutions Inc. 2014
Purges
Adjustments
Inventory Tracking History Sample
Group Fields Step 1 Step 2 Step 3 Step 4a Step 4b Step 5a Step 5b
Party Partner Code PC1 PC1 PC1 PC1
ZymeID ZID1 ZID2 ZID1 ZID1 ZID2 ZID2 ZID1
LocationID ZLOCID1 ZLOCID1 ZLOCID1 ZLOCID1
Product SKU Code SKU1 SKU1 SKU1 SKU1 SKU1 SKU1 SKU1
Serial #s SN1, SN2 SN5, SN6 SN3, SN4 SN1, SN2 SN1, SN2 SN2 SN2
Events Event Opening Opening Sell-In Sell-Thru Sell-Thru Returns Returns
Balance Balance
Event Qty 1000 2000 500 100 100 10 10
Balance Available 1000 2000 1500 1400 2100 2090 1410
Qty
Refs Source TDAWB TDAWB CVIS CVIS CVIS CVIS CVIS
Txn ID 12345 54321 54321 88888 88888
UDF I1 I2 I3

CREATED 5/3 10:15 5/3 10:15 5/4 10:15 5/6 10:15 5/6 10:15 5/7 10:15 5/7 10:15

MODIFIED 5/3 10:15 5/3 10:15 5/4 10:15 5/6 10:15 5/6 10:15 5/7 10:15 5/7 10:15

Step 1 Step 2 Step 3 Step 4a Step 4b


SKU1 SKU1 SKU1 SKU1 SKU1

Bucket ID (I1) Bucket ID (I2) Bucket ID (I1) Bucket ID (I1) Bucket IT(I2)

Serial # Status Serial # Status Serial # Status Serial # Status Serial # Status
SN1 A SN5 A SN1 A SN1 U SN5 A
SKU1 SKU1 SKU1
SN2 A SN6 A SN2 A SN2 U SN6 A
Bucket ID (I3) Bucket ID (I3) Bucket ID (I3)

Serial # Status Serial # Status Serial # Status

SN3 A SN3 A SN1 A

SN4 A SN4 A SN2 A


© Zyme Solutions Inc. 2014
Inventory Management Interfaces

Serial # (ABR)
CVIS Post-Finalization

Quick SISO Reported Serial # validations Initialize Balances Report Goods Movement Report Purges
Check (POS) Inventory Check Eg. Dup Serial # Reported Inventory Sell-In Purges
Given a Partner, a (INV) check POS (ST/SO)
SKU (and Given a Partner, a Given a Partner, SKU Returns
BucketID), return SKU (and and Serial #, check if
available quantity BucketID), return the Serial # was Given a (CVIS)
and check if it is available quantity previously sold by the Txn ID, fetch the
within threshold and check if it is Partner corresponding
within threshold Inventory Mgmt API Serial #s
Given a Serial #,
GET POST Feed-Out
fetch the
corresponding
SKU code
Given a
History Positions Configs Partner/SKU/Fisc
ZGW
al Date, return
balance
Given a Serial #,
fetch the
TruePay
corresponding
SKU code (and
current location Reset Balance Adjustments
(Partner)) Opening Balance Adjustments
TDAWB
ZymeNet
Exceptions Feeds
Inventory Mgmt App Configs
Feeds

POW Inventory Mgmt Given a Partner, Given a Partner, a Given a Partner, a Inventory Mgmt
get all SKUs with SKUs (and SKU (and
available quantity BuckedID), return BucketID), return
“history” available quantity
© Zyme Solutions Inc. 2014
Performance considerations

Inventory Transaction History table


• 1 Sell-In Transaction will result in 1 entry (added to Seller)
• 1 Sell-Through Transaction will result in 2 entries in this table (deducted from Seller and added to Buyer)
• 1 Returns Transaction will result in 2 entries in this table (added to Seller and deducted from Buyer)
• 1 Sell-Out transaction will result in 1 entry (deducted from Seller)
• With the above example of assuming 2.5 million transactions per week, 4 million entries will be created
per week (1 Sell-In 0.5 million, 3 Sell-Thru’s 3 million, 1 Sell-Out .5 million)
• This implies about 288 million entries for 18 months (72 weeks) (cumulative size of Sell-In and POS
tables)
• Entries are only “added” to this table (and purged during archival)
• Out of 4 million entries per week, 3 million will be on peak days (Mon/Tue) and within the peak 10 hours
on those days, which implies 150K inserts per hour

Inventory Positions table


- Every serial # will have one entry for every Tier in this table (4 Tiers in this eg. T1, T2, T3, T4)
- 5 million Serial #s manufactured per week implies 20 million entries per week
- This implies about 1.5 billion entries for 18 months (72 weeks)
- Every “Goods movement” signal will result in created/updated/deleted entry
- With the above example of assuming 0.5 million transactions per week with 10 Serial #s each, 40 million
entries will be created/updated/deleted per week (1 Sell-In 5 million, 3 Sell-Thru’s 30 million, 1 Sell-Out 5
million)
- Out of 40 million updates per week, 30 million will be on peak days (Mon/Tue) and within the peak 10
hours on those days, which implies 1.5 Million updates per hour

© Zyme Solutions Inc. 2014


Apache Cassandra is an open source distributed database management
system designed to handle large amounts of data across many commodity servers,
providing high availability with no single point of failure.

Cassandra offers robust support for clusters spanning multiple data-centers, with
asynchronous master less replication allowing low latency operations for all clients.

Tables may be created, dropped, and altered at runtime without blocking updates and
queries.

Cassandra does not support joins or subqueries, except for batch analysis
via Hadoop. Rather, Cassandra emphasizes denormalization through features like
collection

• http://cassandra.apache.org/

• http://planetcassandra.org/

• http://projects.spring.io/spring-data-Cassandra/

© Zyme Solutions Inc. 2014


Decentralized
• Every node in the cluster has the same role. There is no single point of failure.
Supports replication and multi data center replication
• Key features of Cassandra’s distributed architecture are specifically tailored for multiple-
data center deployment, for redundancy, for failover and disaster recovery.
Scalability
• Read and write throughput both increase linearly as new machines are added, with no
downtime or interruption to applications
Fault-tolerant
• Data is automatically replicated to multiple nodes for fault-tolerance. Replication across
multiple data centers is supported. Failed nodes can be replaced with no downtime.
Tunable consistency
• Writes and reads offer a tunable level of consistency, all the way from "writes never fail"
to "block for all replicas to be readable", with the quorum level in the middle.
MapReduce support
• Cassandra has Hadoop integration, with MapReduce support. There is support also for
Apache Pig and Apache Hive
Query language
• Cassandra introduces CQL (Cassandra Query Language), a SQL-like alternative to the
traditional RPC interface. Language drivers are available for Java (JDBC), Python
(DBAPI2), Node.JS (Helenus) and Go (gocql).

© Zyme Solutions Inc. 2014


Kafka

Kafka is a distributed, partitioned, replicated commit log service. It provides


the functionality of a messaging system, but with a unique design.

• Kafka maintains feeds of messages in categories called topics.


• We'll call processes that publish messages to a Kafka topic producers.
• We'll call processes that subscribe to topics and process the feed of
published messages consumers..
• Kafka is run as a cluster comprised of one or more servers each of which is
called a broker.

So, at a high level, producers send messages over the network to the Kafka
cluster which in turn serves them up to consumers like this:

© Zyme Solutions Inc. 2014


Kafka

At a high-level Kafka gives the following guarantees:


• Messages sent by a producer to a particular topic partition will be
appended in the order they are sent. That is, if a message M1 is sent by
the same producer as a message M2, and M1 is sent first, then M1 will
have a lower offset than M2 and appear earlier in the log.
• A consumer instance sees messages in the order they are stored in the
log.
• For a topic with replication factor N, we will tolerate up to N-1 server
failures without losing any messages committed to the log.

http://kafka.apache.org/

http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/

http://www.michael-noll.com/blog/2014/05/27/kafka-storm-integration-
example-tutorial/
© Zyme Solutions Inc. 2014
Storm
Apache Storm is a free and open source distributed realtime computation system. Storm
makes it easy to reliably process unbounded streams of data, doing for realtime
processing what Hadoop did for batch processing. Storm is simple, can be used with any
programming language, and is a lot of fun to use!

The core abstraction in Storm is the “stream”. A stream is an unbounded sequence of


tuples. Storm provides the primitives for transforming a stream into a new stream in a
distributed and reliable way.

The basic primitives Storm provides for doing stream transformations are “spouts” and
“bolts”.

Spouts and bolts have interfaces that you implement to run your application-specific logic.
A spout is a source of streams. For example, a spout may read tuples off of a Kafka queue
and emit them as a stream. Or a spout may connect to the Twitter API and emit a stream
of tweets.

A bolt consumes any number of input streams, does some processing, and possibly emits
new streams. Complex stream transformations require multiple steps and thus multiple
bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do
streaming joins, talk to databases, and more.

© Zyme Solutions Inc. 2014


Storm: Key properties

 Extremely broad set of use cases: Storm can be used for processing messages and updating
databases (stream processing), doing a continuous query on data streams and streaming the results
into clients (continuous computation), parallelizing an intense query like a search query on the fly
(distributed RPC), and more. Storm’s small set of primitives satisfy a stunning number of use cases.
 Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have
to do is add machines and increase the parallelism settings of the topology. As an example of Storm’s
scale, one of Storm’s initial applications processed 1,000,000 messages per second on a 10 node
cluster, including hundreds of database calls per second as part of the topology. Storm’s usage of
Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
 Guarantees no data loss: A realtime system must have strong guarantees about data being
successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees
that every message will be processed, and this is in direct contrast with other systems like S4.
 Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage,
Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of
managing Storm clusters as painless as possible.
 Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as
necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
 Programming language agnostic: Robust and scalable realtime processing shouldn’t be limited to a
single platform. Storm topologies and processing components can be defined in any language, making
Storm accessible to nearly anyone.

© Zyme Solutions Inc. 2014


Storm
Networks of spouts and bolts are packaged into a “topology” which is the top-level
abstraction that you submit to Storm clusters for execution

https://storm.apache.org/
https://storm.apache.org/documentation/Tutorial.html
https://github.com/hmsonline/storm-cassandra
http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-
a-storm-topology/

© Zyme Solutions Inc. 2014


Apache ZooKeeper

© Zyme Solutions Inc. 2014


Problem Statement: Master-Worker Application in Distributed
System

 Problems
Master crashes
Worker crashes
Communication failures

 Problems for Master Crashes


Use a backup master
Recover the latest state ?
Backup master may suspect the primary master has crashed
 Problems for Worker Crashes
Master must detect worker crashes
Recover assigned tasks
 Problems for Communication Failures
Execute a same task only once

© Zyme Solutions Inc. 2014


What is ZooKeeper

 An Open source, High Performance coordination service for distributed


applications
 Reliable, fault tolerant and highly-available
 Simple
• ZooKeeper allows distributed processes to coordinate with each other through
a shared hierarchal namespace which is organized similarly to a standard file
system.
 Replicated
• ZooKeeper itself is intended to be replicated over a sets of hosts called an
ensemble
 Distributed Cluster Management
 Node Join/Leave
 Node Status in real time

© Zyme Solutions Inc. 2014


ZooKeeper Service

 The servers that make up the ZooKeeper service must all know about each other
 As long as a majority of the servers are available, the ZooKeeper service will be available
 All machines store a copy of the data (in memory)
 A leader is elected on service start-up
 Clients only connect to a single server & maintains a TCP connection.
Client can read from any server, writes go through the leader & needs majority consensus.

© Zyme Solutions Inc. 2014


Case Study

 Kafka uses Zookeeper for the following tasks:


• Detecting the addition and the removal of brokers and consumers,
• Triggering a rebalance process in each consumer when the above events happen, and
• Maintaining the consumption relationship and keeping track of the consumed offset of each partition

 Steps:
1. Unzip the Apache Kafka version 0.8.1.1
2. Apache Kafka requires zookeeper to be available and running. Start zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
3. Start Kafka Server using the following command
bin/kafka-server-start.sh config/server.properties
4. Create a topic in Apache Kafka using the following command
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1
--partitions 1 --topic test
5. Ask Zookeeper to list available topics on Apache Kafka.
bin/kafka-topics.sh --list --zookeeper localhost:2181
6. Send test messages to Apache Kafka topic called Test using command line producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Please enter some messages like "Hello vulab" press enter
7. Use command line consumer to check for messages on Apache Kafka Topic called test
bin/kafka-console-consumer.sh –zookeeper localhost:2181 –topic test –from-beginning

© Zyme Solutions Inc. 2014


Proof Of Concept: Inventory Tracking
CVIS Inventory
Mgmt
RT
POS

R
T

Integrate
R
T

R
T
POS Sync Inventory Tracking

Inventory API Process


RT

R
R
T R T
T

POS Inventory STORE


Positions

© Zyme Solutions Inc. 2014


Deployment Architecture
CVIS
CVIS DB
Inventory API
4x16x500
4x16x10
172.16.1.194
172.1.16.193

Shared File System

Kafka Node 1 Kafka Node 2


ZooKeeper Node 1 ZooKeeper Node 2
2x4x100 2x4x100
172.16.1.195

Storm Node 1 Storm Node 2


4x16x100 4x16x100
172.16.1.196

C* Node 2 C* Node 3 C* Node 4


C* Node 1
2x4x300 2x4x300 2x4x300
2x4x300
172.16.1.197

© Zyme Solutions Inc. 2014


Teams

Sharath Prasad M Raju Amar


Rafi Soumya
Komali

DB IT

© Zyme Solutions Inc. 2014


Inventory Tracking: In-line Inventory Positions Calculations
Finalize CVIS

RT
POS
POS

POS
Existing
NAS

AMQ New
Kafka
Listener
Topic

POS Inventory
History
Inventory
Positions
Kafka
Listener POS-1 POS-1
POS-2 POS-2
Json Parser File Reader CSV Reader
Kafka Spout Calculate
and store in
Cassandra

© Zyme Solutions Inc. 2014


Inventory Tracking: Inventory Positions Consumption

Inventory Management App


(TLS)

Existing

Inventory API New

Data Cassandra

Inventory Latest Fiscal


History Inventory Week
Positions Inventory
Positions

© Zyme Solutions Inc. 2014

Das könnte Ihnen auch gefallen