Beruflich Dokumente
Kultur Dokumente
January 2015
Agenda
Zyme and most other enterprises are awash with ever-growing data of all types, and Zyme needs to
handle exponential growth in data VOLUMES as it:
- adds many, many more customers and partners
- collects sales data from additional Tiers (Tier 2, Tier 3,…)
- collects sales data not just at SKU level but at serial number level
Zyme provides rich interfaces for analytics and reporting on “structured data”, but …..Zyme also has lots of
unstructured data, for example
- raw data reported by partners in many different formats
- the logs containing valuable metadata as the data is processed through various sub-systems
Can these data which is available in different VARIETIES, be mined for extracting valuable insights for
both Zyme and our customers??
R POS
T
POS POS
POS STORE
INV INV
INV
© Zyme Solutions Inc. 2014
Background to POC: Goods Movement with Invoice #s
Disti 1 Reseller 1
• SN-1
SKU-X • SN-1
• SN-3 SKU-X
• SN-2
• SN-333
SKU-Z
• SN-1 • SN-444 SKU-Y • SN-11
SKU-X • SN-2 Invoice 1
• SN-N • SN-333 SKU-Y: SN-11
SKU-Z
• SN-444
Disti 2
• SN-11 SKU-X • SN-2
SKU-Y
• SN-22
Invoice 1 SKU-Y • SN-11
Reseller 2
SKU-Y • SN-11
• SN-111
• SN-222 Invoice 1
SKU-Z
• SN-333 Disti 3 Reseller 3
• SN-444
• SN-1
SKU-X
• SN-3 • SN-9
• SN-111 SKU-X
SKU-Z • SN-8
• SN-222
SKU-Z • SN-111
SKU-X: SN-2
All SKUs
- Given a Partner (identified by ZID, LOCID and/or PartnerCode), return all SKUs with
available quantity associated with that Partner
- Given a Partner and a SKU, return available quantity
- Given a Partner, a SKU and a key field (like Invoice #), return available quantity
- Given a Partner and a SKU, return history of “events” which resulted in the current
calculated quantity since the last “reset/opening balance” event
- Given a Partner, a SKU and a key field (like Invoice #), return history as above
Serialized SKUs
- Given a Partner and a SKU, return available Serial #s
- Given a Partner, a SKU and a key field (like Invoice #), return available Serial
#s
- Given a Partner, SKU and Serial #, check if the Serial # was previously sold by the
Partner (no longer available with the partner, but was it sold by the partner
previously? Also answers the question if the sale was reported earlier/aka dup
check)
- Given a Serial #, fetch the corresponding SKU code
- Given a Serial #, fetch the corresponding SKU code and current location (Partner)
- Given a (CVIS) Transaction ID, fetch the corresponding Serial #s
CREATED 5/3 10:15 5/3 10:15 5/4 10:15 5/6 10:15 5/6 10:15 5/7 10:15 5/7 10:15
MODIFIED 5/3 10:15 5/3 10:15 5/4 10:15 5/6 10:15 5/6 10:15 5/7 10:15 5/7 10:15
Bucket ID (I1) Bucket ID (I2) Bucket ID (I1) Bucket ID (I1) Bucket IT(I2)
Serial # Status Serial # Status Serial # Status Serial # Status Serial # Status
SN1 A SN5 A SN1 A SN1 U SN5 A
SKU1 SKU1 SKU1
SN2 A SN6 A SN2 A SN2 U SN6 A
Bucket ID (I3) Bucket ID (I3) Bucket ID (I3)
Serial # (ABR)
CVIS Post-Finalization
Quick SISO Reported Serial # validations Initialize Balances Report Goods Movement Report Purges
Check (POS) Inventory Check Eg. Dup Serial # Reported Inventory Sell-In Purges
Given a Partner, a (INV) check POS (ST/SO)
SKU (and Given a Partner, a Given a Partner, SKU Returns
BucketID), return SKU (and and Serial #, check if
available quantity BucketID), return the Serial # was Given a (CVIS)
and check if it is available quantity previously sold by the Txn ID, fetch the
within threshold and check if it is Partner corresponding
within threshold Inventory Mgmt API Serial #s
Given a Serial #,
GET POST Feed-Out
fetch the
corresponding
SKU code
Given a
History Positions Configs Partner/SKU/Fisc
ZGW
al Date, return
balance
Given a Serial #,
fetch the
TruePay
corresponding
SKU code (and
current location Reset Balance Adjustments
(Partner)) Opening Balance Adjustments
TDAWB
ZymeNet
Exceptions Feeds
Inventory Mgmt App Configs
Feeds
POW Inventory Mgmt Given a Partner, Given a Partner, a Given a Partner, a Inventory Mgmt
get all SKUs with SKUs (and SKU (and
available quantity BuckedID), return BucketID), return
“history” available quantity
© Zyme Solutions Inc. 2014
Performance considerations
Cassandra offers robust support for clusters spanning multiple data-centers, with
asynchronous master less replication allowing low latency operations for all clients.
Tables may be created, dropped, and altered at runtime without blocking updates and
queries.
Cassandra does not support joins or subqueries, except for batch analysis
via Hadoop. Rather, Cassandra emphasizes denormalization through features like
collection
• http://cassandra.apache.org/
• http://planetcassandra.org/
• http://projects.spring.io/spring-data-Cassandra/
So, at a high level, producers send messages over the network to the Kafka
cluster which in turn serves them up to consumers like this:
http://kafka.apache.org/
http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
http://www.michael-noll.com/blog/2014/05/27/kafka-storm-integration-
example-tutorial/
© Zyme Solutions Inc. 2014
Storm
Apache Storm is a free and open source distributed realtime computation system. Storm
makes it easy to reliably process unbounded streams of data, doing for realtime
processing what Hadoop did for batch processing. Storm is simple, can be used with any
programming language, and is a lot of fun to use!
The basic primitives Storm provides for doing stream transformations are “spouts” and
“bolts”.
Spouts and bolts have interfaces that you implement to run your application-specific logic.
A spout is a source of streams. For example, a spout may read tuples off of a Kafka queue
and emit them as a stream. Or a spout may connect to the Twitter API and emit a stream
of tweets.
A bolt consumes any number of input streams, does some processing, and possibly emits
new streams. Complex stream transformations require multiple steps and thus multiple
bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do
streaming joins, talk to databases, and more.
Extremely broad set of use cases: Storm can be used for processing messages and updating
databases (stream processing), doing a continuous query on data streams and streaming the results
into clients (continuous computation), parallelizing an intense query like a search query on the fly
(distributed RPC), and more. Storm’s small set of primitives satisfy a stunning number of use cases.
Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have
to do is add machines and increase the parallelism settings of the topology. As an example of Storm’s
scale, one of Storm’s initial applications processed 1,000,000 messages per second on a 10 node
cluster, including hundreds of database calls per second as part of the topology. Storm’s usage of
Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
Guarantees no data loss: A realtime system must have strong guarantees about data being
successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees
that every message will be processed, and this is in direct contrast with other systems like S4.
Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage,
Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of
managing Storm clusters as painless as possible.
Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as
necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
Programming language agnostic: Robust and scalable realtime processing shouldn’t be limited to a
single platform. Storm topologies and processing components can be defined in any language, making
Storm accessible to nearly anyone.
https://storm.apache.org/
https://storm.apache.org/documentation/Tutorial.html
https://github.com/hmsonline/storm-cassandra
http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-
a-storm-topology/
Problems
Master crashes
Worker crashes
Communication failures
The servers that make up the ZooKeeper service must all know about each other
As long as a majority of the servers are available, the ZooKeeper service will be available
All machines store a copy of the data (in memory)
A leader is elected on service start-up
Clients only connect to a single server & maintains a TCP connection.
Client can read from any server, writes go through the leader & needs majority consensus.
Steps:
1. Unzip the Apache Kafka version 0.8.1.1
2. Apache Kafka requires zookeeper to be available and running. Start zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
3. Start Kafka Server using the following command
bin/kafka-server-start.sh config/server.properties
4. Create a topic in Apache Kafka using the following command
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1
--partitions 1 --topic test
5. Ask Zookeeper to list available topics on Apache Kafka.
bin/kafka-topics.sh --list --zookeeper localhost:2181
6. Send test messages to Apache Kafka topic called Test using command line producer
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
Please enter some messages like "Hello vulab" press enter
7. Use command line consumer to check for messages on Apache Kafka Topic called test
bin/kafka-console-consumer.sh –zookeeper localhost:2181 –topic test –from-beginning
R
T
Integrate
R
T
R
T
POS Sync Inventory Tracking
R
R
T R T
T
DB IT
RT
POS
POS
POS
Existing
NAS
AMQ New
Kafka
Listener
Topic
POS Inventory
History
Inventory
Positions
Kafka
Listener POS-1 POS-1
POS-2 POS-2
Json Parser File Reader CSV Reader
Kafka Spout Calculate
and store in
Cassandra
Existing
Data Cassandra