Sie sind auf Seite 1von 19

End to End Application for Monitoring Real-Time Uber

Data Using Apache APIs: Kafka, Spark, HBase – Part 2:


Kafka and Spark Streaming

Blog Machine Learning Current Post

! Share
(http://twitter.com/share?
" Share # Share
text=End%20to%20End%20Application%20for%20Monitoring%20Real-
Time%20Uber%20Data%20Using%20Apache%20APIs:%20%20Kafka,%20
(https://www.linkedin.com/shareArticle?
(https://www.facebook.com/sharer/sharer.ph
%20Part%202:%20Kafka%20and%20Spark%20Streaming&url=https://w
mini=true&url=https://www.mapr.com/blog/monitoring-
u=https://www.mapr.com/blog/monitoring-
real- real- real-
time- time- time-
uber- uber- uber-
data- data- data-
using- using- using-
spark- spark- spark-
machine- machine- machine-
learning- learning- learning-
streaming- streaming- streaming-
and- and- and-
kafka- kafka- kafka-
api- api- api-
part- part- part-
2/) 2/&title=&summary=&source=)
2/)

Contributed by Carol McDonald (/blog/author/carol-mcdonald/)


Editor's Note: This
is a 4-Part Series,
see the previously
published posts
below:

Part 1 - Spark
Machine Learning

(/blog/author/carol-mcdonald/)
(https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-1/)

Part 3 – Real-Time Dashboard Using Vert.x (https://mapr.com/blog/monitoring-uber-


with-spark-streaming-kafka-and-vertx/)

This post is the second part in a series where we will build a real-time example for
analysis and monitoring of Uber car GPS trip data (http://data.beta.nyc/dataset/uber-
trip-data-foiled-apr-sep-2014). If you have not already read the first part of this series
(/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-
kafka-api-part-1), you should read that first.

The first post discussed creating a machine learning model using Apache Spark’s K-
means algorithm to cluster Uber data based on location. This second post will discuss
using the saved K-means (http://spark.apache.org/docs/latest/ml-clustering.html#k-
means) model with streaming data to do real-time analysis of where and when Uber cars
are clustered.

Example Use Case: Real-Time Analysis of


Geographically Clustered Vehicles/Items
The following figure depicts the architecture for the data pipeline:
1. Uber trip data is published to a MapR Streams topic using the Kafka API
2. A Spark streaming application subscribed to the first topic:
1. Ingests a stream of uber trip events
2. Identifies the location cluster corresponding to the latitude and longitude of
the uber trip
3. Adds the cluster location to the event and publishes the results in JSON
format to another topic
3. A Spark streaming application subscribed to the second topic:
1. Analyzes the uber trip location clusters that are popular by date and time

Example Use Case Data

The example data set is Uber trip data, which you can read more about in part 1 of this
series (/blog/monitoring-real-time-uber-data-using-spark-machine-learning-
streaming-and-kafka-api-part-1). The incoming data is in CSV format, an example is
shown below , with the header:

date/time, latitude,longitude,base
2014-08-01 00:00:00,40.729,-73.9422,B02598

​The enriched Data Records are in JSON format. An example line is shown below:
Spark Kafka Consumer Producer Code

Parsing the Data Set Records

A Scala Uber case class defines the schema corresponding to the CSV records. The
parseUber function parses the comma separated values into the Uber case class.

Loading the K-Means Model

The Spark KMeansModel


(https://spark.apache.org/docs/2.0.1/api/scala/index.html#org.apache.spark.ml.clustering.
class is used to load the saved K-means model fitted on the historical Uber trip data.
Output of model clusterCenters:

Below the cluster centers are displayed on a google map:

Spark Streaming Code

These are the basic steps for the Spark Streaming Consumer Producer code:
1. Configure Kafka Consumer Producer properties.
2. Initialize a Spark StreamingContext object. Using this context, create a DStream
which reads message from a Topic.
3. Apply transformations (which create new DStreams).
4. Write messages from the transformed DStream to a Topic.
5. Start receiving data and processing. Wait for the processing to be stopped.

We will go through each of these steps with the example application code.
1. Configure Kafka Consumer Producer properties

The first step is to set the KafkaConsumer and KafkaProducer configuration properties,
which will be used later to create a DStream for receiving/sending messages to topics.
You need to set the following paramters:

Key and value deserializers: for deserializing the message.


Auto offset reset: to start reading from the earliest or latest message.
Bootstrap servers: this can be set to a dummy host:port since the broker address is
not actually used by MapR Streams.

For more information on the configuration parameters, see the MapR Streams
documentation
(/documentation/v5.2.0/MapR_Streams/differences_in_configuration_parameters_for_prod

1. Initialize a Spark StreamingContext object.


ConsumerStrategies.Subscribe, as shown below, is used to set the topics and Kafka
configuration parameters. We use the KafkaUtils createDirectStream
(http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html) method
with a StreamingContext, the consumer and location strategies, to create an input
stream from a MapR Streams topic. This creates a DStream that represents the stream
of incoming data, where each message is a key value pair. We use the DStream map
transformation to create a DStream with the message values.

1. Apply transformations (which create new DStreams)

We use the DStream foreachRDD method to apply processing to each RDD in this
DStream. We parse the message values into Uber objects, with the map operation on the
DStream. Then we convert the RDD to a DataFrame, which allows you to use
DataFrames and SQL operations on streaming data.
Here is example output from the df.show:

A VectorAssembler is used to transform and return a new DataFrame with the latitude
and longitude feature columns in a vector column.
Then the model is used to get the clusters from the features with the model transform
method, which returns a DataFrame with the cluster predictions.

The output of categories.show is below:

The DataFrame is then registered as a table so that it can be used in SQL statements.
The output of the SQL query is shown below:

1. Write messages from the transformed DStream to a Topic


The Dataset result of the query is converted to JSON RDD Strings, then the RDD
sendToKafka method is used to send the JSON key-value messages to a topic (the key is
null in this case).

Example message values (the output for temp.take(2) ) are shown below:

{"dt":"2014-08-01
00:00:00","lat":40.729,"lon":-73.9422,"base":"B02598","cluster":7}

{"dt":"2014-08-01
00:00:00","lat":40.7406,"lon":-73.9902,"base":"B02598","cluster":7}
1. Start receiving data and processing it. Wait for the processing to be stopped.

To start receiving data, we must explicitly call start() on the StreamingContext, then call
awaitTermination to wait for the streaming computation to finish.

Spark Kafka Consumer Code


Next, we will go over some of the Spark streaming code which consumes the JSON-
enriched messages.

We specify the schema with a Spark Structype (http://spark.apache.org/docs/latest/sql-


programming-guide.html#programmatically-specifying-the-schema):
Below is the code for:

Creating a Direct Kafka Stream


Converting the JSON message values to Dataset[Row] using spark.read.json
(http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets)
with the schema
Creating two temporary views for subsequent SQL queries
Using ssc.remember to cache data for queries

Now we can query the streaming data to ask questions like: which hours had the highest
number of pickups? (Output is shown in a Zeppelin notebook):

spark.sql("SELECT hour(uber.dt) as hr,count(cluster) as ct FROM uber group By


hour(uber.dt)")
How many pickups occurred in each cluster?

df.groupBy("cluster").count().show()

or

spark.sql("select cluster, count(cluster) as count from uber group by cluster")

Which hours of the day and which cluster had the highest number of pickups?

spark.sql("SELECT hour(uber.dt) as hr,count(cluster) as ct FROM uber group By


hour(uber.dt)")
Display datetime and cluster counts for Uber trips:

%sql select cluster, dt, count(cluster) as count from uber group by dt, cluster
order by dt, cluster

Software
You can download the complete code, data, and instructions to run this example
from here (https://github.com/caroljmcdonald/mapr-sparkml-streaming-uber).
This example runs on MapR 5.2 with Spark 2.0.1. If you are running on the MapR
v5.2 Sandbox (/products/mapr-sandbox-hadoop), you need to upgrade Spark to
2.0.1 (MEP 2.0). For more information on upgrading, see: here
(/documentation/v5.2.0/UpgradeGuide/UpgradingEcoPacks.html) and here
(/documentation/v5.2.0/Spark/Spark_IntegrateMapRStreams.html).

Summary
In this blog post, you learned how to use a Spark machine learning model in a Spark
Streaming application, and how to integrate Spark Streaming with MapR Streams to
consume and produce messages using the Kafka API.

References and More Information:


Integrate Spark with MapR Streams Documentation
(/documentation/v5.2.0/Spark/Spark_IntegrateMapRStreams.html)
Free Online training on MapR Streams, Spark at learn.mapr.com
(http://learn.mapr.com/)
Apache Spark Streaming Programming Guide
(http://spark.apache.org/docs/latest/streaming-programming-guide.html)
Real-Time Streaming Data Pipelines with Apache APIs: Kafka, Spark Streaming,
and HBase (/blog/real-time-streaming-data-pipelines-apache-apis-kafka-spark-
streaming-and-hbase)
Apache Kafka and MapR Streams: Terms, Techniques and New Designs
(http://spark.apache.org/docs/latest/streaming-programming-guide.html)

This blog post was published January 05, 2017.

Search

Categories
All (/blog/)

Apache Drill (/blog/apache-drill/)

Apache Hadoop (/blog/apache-hadoop/)

Apache Hive (/blog/apache-hive/)

Apache Mesos (/blog/apache-mesos/)

Apache Myriad (/blog/apache-myriad/)

Apache Spark (/blog/apache-spark/)

Cloud Computing (/blog/cloud-computing/)

Enterprise Data Hub (/blog/enterprise-


data-hub/)

Machine Learning (/blog/machine-


learning/)

MapR Platform (/blog/mapr-platform/)


MapReduce (/blog/mapreduce/)

NoSQL (/blog/nosql/)

Open Source Software (/blog/open-source-


software/)

Partners (/blog/partners/)

Streaming (/blog/streaming/)

Use Cases (/blog/use-cases/)

Whiteboard Walkthrough Videos


(/blog/whiteboard-walkthrough-videos/)

50,000+ of the
smartest have already
joined!
Stay ahead of the bleeding
edge...get the best of Big Data in
your inbox.

Email

Join Now
Email

Join Now

1 Comment mapr.com !
1 Login

Sort by Best
# Recommend 2 ⤤ Share

Join the discussion…


LOG IN WITH OR SIGN UP WITH DISQUS ?

Name

Carol McDonald • a year ago


uber new website for public data https://techcrunch.com/2017...
1△ ▽ • Reply • Share ›

ALSO ON MAPR.COM

Configure … 5…
3 comments • a year ago 1 comment • a year ago
Christopher Tao — Paul Cook —

Applying … Apache …
2 comments • 10 months ago 2 comments • a year ago
Carol McDonald — Hanumath Rao —

✉ Subscribe d Add Disqus to your siteAdd DisqusAdd 🔒 Disqus' Privacy PolicyPrivacy PolicyPrivacy
Get our latest posts in your inbox

Subscribe Now

Why MapR?
(/why-mapr/)

Customers (/customers/)

Solutions (/solutions/)

Products (/products/)

Services (/services/)

Training (/training/)

Company
(/company/)
Press (/company/press-releases/) | News (/company/news/)

Leadership (/company/leadership/)

Investors (/company/investors/)

Partners (/partners/)

Careers (/company/careers/)

Awards (/company/awards/)

Contact Us
(/company/contact-mapr/)
Contact Sales

(mailto:sales@mapr.com)

United States: +1 408-914-2390

(tel:4089142390)

Outside the US: +1 855-NOW-MAPR

(tel:8556696277)

Legal
(/legal/)

(https://www.linkedin.com/company/mapr-technologies)

(https://www.facebook.com/maprtech/) (https://twitter.com/mapr)

(https://www.youtube.com/user/maprtech) (/company/contact-mapr/)

© 2018 MapR Technologies, Inc. All Rights Reserved

Das könnte Ihnen auch gefallen