Beruflich Dokumente
Kultur Dokumente
! Share
(http://twitter.com/share?
" Share # Share
text=End%20to%20End%20Application%20for%20Monitoring%20Real-
Time%20Uber%20Data%20Using%20Apache%20APIs:%20%20Kafka,%20
(https://www.linkedin.com/shareArticle?
(https://www.facebook.com/sharer/sharer.ph
%20Part%202:%20Kafka%20and%20Spark%20Streaming&url=https://w
mini=true&url=https://www.mapr.com/blog/monitoring-
u=https://www.mapr.com/blog/monitoring-
real- real- real-
time- time- time-
uber- uber- uber-
data- data- data-
using- using- using-
spark- spark- spark-
machine- machine- machine-
learning- learning- learning-
streaming- streaming- streaming-
and- and- and-
kafka- kafka- kafka-
api- api- api-
part- part- part-
2/) 2/&title=&summary=&source=)
2/)
Part 1 - Spark
Machine Learning
(/blog/author/carol-mcdonald/)
(https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-1/)
This post is the second part in a series where we will build a real-time example for
analysis and monitoring of Uber car GPS trip data (http://data.beta.nyc/dataset/uber-
trip-data-foiled-apr-sep-2014). If you have not already read the first part of this series
(/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-
kafka-api-part-1), you should read that first.
The first post discussed creating a machine learning model using Apache Spark’s K-
means algorithm to cluster Uber data based on location. This second post will discuss
using the saved K-means (http://spark.apache.org/docs/latest/ml-clustering.html#k-
means) model with streaming data to do real-time analysis of where and when Uber cars
are clustered.
The example data set is Uber trip data, which you can read more about in part 1 of this
series (/blog/monitoring-real-time-uber-data-using-spark-machine-learning-
streaming-and-kafka-api-part-1). The incoming data is in CSV format, an example is
shown below , with the header:
date/time, latitude,longitude,base
2014-08-01 00:00:00,40.729,-73.9422,B02598
The enriched Data Records are in JSON format. An example line is shown below:
Spark Kafka Consumer Producer Code
A Scala Uber case class defines the schema corresponding to the CSV records. The
parseUber function parses the comma separated values into the Uber case class.
These are the basic steps for the Spark Streaming Consumer Producer code:
1. Configure Kafka Consumer Producer properties.
2. Initialize a Spark StreamingContext object. Using this context, create a DStream
which reads message from a Topic.
3. Apply transformations (which create new DStreams).
4. Write messages from the transformed DStream to a Topic.
5. Start receiving data and processing. Wait for the processing to be stopped.
We will go through each of these steps with the example application code.
1. Configure Kafka Consumer Producer properties
The first step is to set the KafkaConsumer and KafkaProducer configuration properties,
which will be used later to create a DStream for receiving/sending messages to topics.
You need to set the following paramters:
For more information on the configuration parameters, see the MapR Streams
documentation
(/documentation/v5.2.0/MapR_Streams/differences_in_configuration_parameters_for_prod
We use the DStream foreachRDD method to apply processing to each RDD in this
DStream. We parse the message values into Uber objects, with the map operation on the
DStream. Then we convert the RDD to a DataFrame, which allows you to use
DataFrames and SQL operations on streaming data.
Here is example output from the df.show:
A VectorAssembler is used to transform and return a new DataFrame with the latitude
and longitude feature columns in a vector column.
Then the model is used to get the clusters from the features with the model transform
method, which returns a DataFrame with the cluster predictions.
The DataFrame is then registered as a table so that it can be used in SQL statements.
The output of the SQL query is shown below:
Example message values (the output for temp.take(2) ) are shown below:
{"dt":"2014-08-01
00:00:00","lat":40.729,"lon":-73.9422,"base":"B02598","cluster":7}
{"dt":"2014-08-01
00:00:00","lat":40.7406,"lon":-73.9902,"base":"B02598","cluster":7}
1. Start receiving data and processing it. Wait for the processing to be stopped.
To start receiving data, we must explicitly call start() on the StreamingContext, then call
awaitTermination to wait for the streaming computation to finish.
Now we can query the streaming data to ask questions like: which hours had the highest
number of pickups? (Output is shown in a Zeppelin notebook):
df.groupBy("cluster").count().show()
or
Which hours of the day and which cluster had the highest number of pickups?
%sql select cluster, dt, count(cluster) as count from uber group by dt, cluster
order by dt, cluster
Software
You can download the complete code, data, and instructions to run this example
from here (https://github.com/caroljmcdonald/mapr-sparkml-streaming-uber).
This example runs on MapR 5.2 with Spark 2.0.1. If you are running on the MapR
v5.2 Sandbox (/products/mapr-sandbox-hadoop), you need to upgrade Spark to
2.0.1 (MEP 2.0). For more information on upgrading, see: here
(/documentation/v5.2.0/UpgradeGuide/UpgradingEcoPacks.html) and here
(/documentation/v5.2.0/Spark/Spark_IntegrateMapRStreams.html).
Summary
In this blog post, you learned how to use a Spark machine learning model in a Spark
Streaming application, and how to integrate Spark Streaming with MapR Streams to
consume and produce messages using the Kafka API.
Search
Categories
All (/blog/)
NoSQL (/blog/nosql/)
Partners (/blog/partners/)
Streaming (/blog/streaming/)
50,000+ of the
smartest have already
joined!
Stay ahead of the bleeding
edge...get the best of Big Data in
your inbox.
Join Now
Email
Join Now
1 Comment mapr.com !
1 Login
Sort by Best
# Recommend 2 ⤤ Share
Name
ALSO ON MAPR.COM
Configure … 5…
3 comments • a year ago 1 comment • a year ago
Christopher Tao — Paul Cook —
Applying … Apache …
2 comments • 10 months ago 2 comments • a year ago
Carol McDonald — Hanumath Rao —
✉ Subscribe d Add Disqus to your siteAdd DisqusAdd 🔒 Disqus' Privacy PolicyPrivacy PolicyPrivacy
Get our latest posts in your inbox
Subscribe Now
Why MapR?
(/why-mapr/)
Customers (/customers/)
Solutions (/solutions/)
Products (/products/)
Services (/services/)
Training (/training/)
Company
(/company/)
Press (/company/press-releases/) | News (/company/news/)
Leadership (/company/leadership/)
Investors (/company/investors/)
Partners (/partners/)
Careers (/company/careers/)
Awards (/company/awards/)
Contact Us
(/company/contact-mapr/)
Contact Sales
(mailto:sales@mapr.com)
(tel:4089142390)
(tel:8556696277)
Legal
(/legal/)
(https://www.linkedin.com/company/mapr-technologies)
(https://www.facebook.com/maprtech/) (https://twitter.com/mapr)
(https://www.youtube.com/user/maprtech) (/company/contact-mapr/)