Sie sind auf Seite 1von 7

204

257

CONTENTS

∠ WHY APACHE SPARK?

∠ ABOUT APACHE SPARK

Apache Spark
∠ HOW TO INSTALL APACHE SPARK

∠ HOW APACHE SPARK WORKS

∠ RESILIENT DISTRIBUTED DATASET

∠ DATAFRAMES

∠ RDD PERSISTENCE

∠ SPARK SQL

UPDATED BY TIM SPANN BIG DATA SOLUTIONS ENGINEER, HORTONWORKS


∠ SPARK STREAMING
WRITTEN BY ASHWINI KUNTAMUKKALA SOFTWARE ARCHITECT, SCISPIKE

WHY APACHE SPARK? be quickly transformed iteratively and cached on demand for
Apache Spark has become the engine to enhance many of the subsequent usage.
capabilities of the ever-present Apache Hadoop environment. For
2. Highly accessible through standard APIs built in Java, Scala,
Big Data, Apache Spark meets a lot of needs and runs natively on
Python, R, and SQL (for interactive queries) and has a rich set of
Apache Hadoop’s YARN. By running Apache Spark in your Apache
machine learning libraries available out of the box.
Hadoop environment, you gain all the security, governance, and
scalability inherent to that platform. Apache Spark is also extremely 3. Compatibility with existing Hadoop 2.x (YARN) ecosystems so
well integrated with Apache Hive and gains access to all your Apache companies can leverage their existing infrastructure.
Hadoop tables utilizing integrated security.
DZON E .COM/ RE FCA RDZ

4. Convenient download and installation processes. Convenient


Apache Spark has begun to really shine in the areas of streaming data shell (REPL: Read-Eval-Print-Loop) to interactively learn the APIs.
processing and machine learning. With first-class support of Python
5. Enhanced productivity due to high-level constructs that keep
as a development language, PySpark allows for data scientists,
the focus on content of computation.
engineers and developers to develop and scale machine learning with
ease. One of the features that has expanded this is the support for 6. Multiple user notebook environments supported by Apache
Apache Zeppelin notebooks to run Apache Spark jobs for exploration, Zeppelin.
data cleanup, and machine learning. Apache Spark also integrates
with other important streaming tools in the Apache Hadoop space, Also, Spark is implemented in Scala, which means that the code is

namely Apache NiFi and Apache Kafka. I like to think of Apache Spark very succinct and fast and requires JVM to run.

+ Apache NiFi + Apache Kafka as the three amigos of Apache Big Data
ingest and streaming. The latest version of Apache Spark is 2.2. HOW TO INSTALL APACHE SPARK
The following table lists a few important links and prerequisites:

ABOUT APACHE SPARK


Apache Spark is an open source, Hadoop-compatible, fast and 2.2.0 @ apache.org/dyn/closer.lua/

expressive cluster-computing data processing engine. It was created Current Release spark/spark-2.2.0/spark-2.2.0-bin-
hadoop2.7.tgz
at AMPLabs in UC Berkeley as part of Berkeley Data Analytics Stack
(BDAS). It is a top-level Apache project. The below figure shows the
Downloads Page spark.apache.org/downloads.html
various components of the current Apache Spark stack.

JDK Version (Required) 1.8 or higher

Scala Version (Required) 2.11 or higher

Python (Optional) [2.7, 3.5)

Simple Build Tool (Re-


It has six major benefits: scala-sbt.org
quired)

1. Lightning speed of computation because data are loaded in


Development Version github.com/apache/spark
distributed memory (RAM) over a cluster of machines. Data can

1
APACHE SPARK

back to the Driver program, or writes to a stable storage system like


spark.apache.org/docs/latest/ Apache Hadoop HDFS. Transformations are lazily evaluated in that
Building Instructions
building-spark.html
they don’t run until an action warrants it. The Apache Spark Driver
remembers the transformations applied to an RDD, so if a partition
Maven 3.3.9 or higher
is lost (say a worker machine goes down), that partition can easily be
reconstructed on some other machine in the cluster. That is why is it
docs.hortonworks.com/HDP
called “Resilient.”
Hadoop + Spark Instal- Documents/Ambari-2.6.0.0/
lation bk_ambari-i0nstallation/content/
The following code snippets show how we can do this in Python using
ch_Getting_Ready.html
the Spark 2 PySpark shell.

%spark2.pyspark
Apache Spark can be configured to run standalone or on Hadoop
guten = spark.read.text('/load/55973-0.txt')
2 YARN. Apache Spark requires moderate skills in Java, Scala, or
Python. Here we will see how to install and run Apache Spark in the
standalone configuration. COMMONLY USED TRANSFORMATIONS

Transformation &
1. Install JDK 1.8+, Scala 2.11+, Python 3.5+ and Apache Maven. Example & Result
Purpose
2. Download Apache Spark 2.2.0 Release.
filter(func)
3. Untar and unzip spark-2.2.0.tgz in a specified directory.
Purpose: new RDD by shinto = guten.filter( guten.Variable
4. Go to the directory and run sbt to build Apache Spark. selecting those data
contains("Shinto") )
elements on which
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" func returns true
mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive

-Phive thriftserver -DskipTests clean package val rdd =


sc.parallelize(List(1,2,3,4,5))
map(func)
val times2 = rdd.map(*2) times2.
5. Launch Apache Spark standalone REPL. For Scala, use: Purpose: return new
DZON E .COM/ RE FCA RDZ

RDD by applying func collect()

./spark-shell on each data element Result:


Array[Int] = Array(2, 4, 6, 8, 10)
For Python, use:

./pyspark val rdd=sc.


flatMap(func)
parallelize(List(“Spark is
Purpose: Similar to awesome”,”It is fun”))
6. Go to SparkUI @ http://localhost:4040 map but func returns val fm=rdd.flatMap(str=>str.
a sequence instead of split(“ “))
This is a good quick start, but I recommend utilizing a Sandbox or a value. For example, fm.collect()
an available Apache Zeppelin notebook to begin your exploration of mapping a sentence
Result:
Apache Spark. into a sequence of
Array[String] = Array(Spark, is,
words
awesome, It, is, fun)
HOW APACHE SPARK WORKS
The Apache Spark engine provides a way to process data in distributed reduceByKey(- val word1=fm.map(word=>(word,1))
memory over a cluster of machines. The figure below shows a logical func,[numTasks]) val wrdCnt = word1.

diagram of how a typical Spark job processes information. Purpose: To aggregate reduceByKey(_+_)wrdCnt.collect()
values of a key using a
function. “numTasks” Result:
is an optional parame- Array[(String, Int)] =
ter to specify a number Array((is,2), (It,1),

of reduce tasks (awesome,1), (Spark,1), (fun,1))

val cntWrd = wrdCnt.map{case


RESILIENT DISTRIBUTED DATASET (word, count) => (count, word)}

The core concept in Apache Spark is the resilient distributed dataset cntWrd.groupByKey().collect()
groupByKey([num-
(RDD). It is an immutable distributed collection of data, which is Tasks])
Result:
partitioned across machines in a cluster. It facilitates two types of Purpose: To convert Array[(Int, Iterable[String])] =
operations: transformations and actions. A transformation is an (K,V) to (K,Iterable<V>) Array((1,ArrayBuffer(It,
operation such as filter(), map(), or union() on an RDD that yields awesome, Spark, fun)),
another RDD. An action is an operation such as count(), first(), (2,ArrayBuffer(is)))
take(n), or collect() that triggers a computation, returns a value

2
APACHE SPARK

distinct([num- fm.distinct().collect()
personFruit.cogroup(personSe).
Tasks]) collect()
Result:
Purpose: Eliminate
Array[String] = Array(is, It, Result:
duplicates from RDD awesome, Spark, fun)
Array[(String,
cogroup(RDD,[num- (Iterable[String],
Tasks]) Iterable[String]))] =
COMMONLY USED SET OPERATIONS Purpose: To convert Array((Andy,(ArrayBuffer(Apple,
(K,V) to (K,Iterable<V>) Apricot),ArrayBuffer(google))),
Transformation &
Example & Result
Purpose (Charlie,(ArrayBuffer(Cherry),
ArrayBuffer(Yahoo))),
val rdd1=sc. (Bob,(ArrayBuffer(Banana),
parallelize(List(‘A’,’B’)) ArrayBuffer(Bing, AltaVista))))
filter(func)
val rdd2=sc.

Purpose: new RDD by parallelize(List(‘B’,’C’))

selecting those data rdd1.union(rdd2).collect()


For a more detailed list of transformations, please refer to spark.
elements on which
func returns true Result: apache.org/docs/latest/programming-guide.html#transformations.

Array[Char] = Array(A, B, B, C)

COMMONLY USED ACTIONS

rdd1.cartesian(rdd2).collect()
cartesian()
Action & Purpose Example & Result
Purpose: new RDD Result:
cross product of all
Array[(Char, Char)] = val rdd = sc.parallelize
elements from source
Array((A,B), (A,C), (B,B), count() (list(‘A’,’B’,’c’)) scala> rdd.
RDD and argument
DZON E .COM/ RE FCA RDZ

(B,C)) Purpose: get the num-


count()
ber of data elements in
the RDD Result:
subtract()
long = 3
rdd1.subtract(rdd2).collect()
Purpose: new RDD
created by removing
Result:
data elements in
source RDD in com- Array[Char] = Array(A) val rdd = sc.parallelize(list
mon with argument collect() (‘A’,’B’,’c’))rdd.collect()
Purpose: get all the
data elements in an
Result:
RDD as an array
val personFruit = Array[char] = Array(A, B, c)
sc.parallelize(Seq((“Andy”,
“Apple”), (“Bob”, “Banana”),
(“Charlie”, “Cherry”),
(“Andy”,”Apricot”))) reduce(func) val rdd =
val personSE =
Purpose: Aggregate sc.parallelize(list(1,2,3,4))
sc.parallelize(Seq((“Andy”,
the data elements rdd.reduce(_+_)
join(RDD,[num- “Google”), (“Bob”, “Bing”),
in an RDD using this
Tasks]) (“Charlie”, “Yahoo”),
(“Bob”,”AltaVista”)))
function which takes
Result:
Purpose: When personFruit.join(personSE). two arguments and
invoked on (K,V) and collect() returns one Int = 10
(K,W), this operation
creates a new RDD of
Result:
(K, (V,W))
val rdd =
Array[(String, (String,
take (n) sc.parallelize(list(1,2,3,4))
String))] =
Array((Andy,(Apple,Google)), Purpose: fetch first n rdd.take(2)
(Andy,(Apricot,Google)), data elements in an
(Charlie,(Cherry,Yahoo)), RDD, computed by
Result:
(Bob,(Banana,Bing)), driver program
(Bob,(Banana,AltaVista))) Array[Int] = Array(1, 2)

3
APACHE SPARK

foreach(func) MEMORY_ONLY_DISK_ This option is same as above except that


val rdd =
Purpose: execute SER disk is used when memory is not sufficient.
sc.parallelize(list(1,2,3,4))
function for each
rdd.
data element in RDD,
foreach(x=>println(“%s*10=%s”.
usually used to update DISK_ONLY This option stores the RDD only on the disk
format(x,x*10)))
an accumulator
(discussed later) or in- Result: MEMORY_ONLY_2,
Same as other levels but partitions are
teracting with external 1*10=10 4*10=40 3*10=30 2*10=20 MEMORY_AND_DISK_2,
replicated on 2 slave nodes
systems etc.

val rdd = OFF_HEAP


Works off of JVM heap and must be enabled.
first() sc.parallelize(list(1,2,3,4))
(experimental)

Purpose: retrieves the rdd.first()


first data element in
Result: The above storage levels can be accessed through persist() operation
RDD. Similar to take(1)
Int = 1 on RDD. The cache() operation is a convenient way of specifying a
MEMORY_ONLY option. The SER options do not work with Python.

val hamlet = sc.textFile(“/


For a more detailed list of persistence options, please refer to
users/akuntamukkala/ temp/
saveAsTextFile(- http://spark.apache.org/docs/latest/programming-guide.html#rdd-
gutenburg.txt”) hamlet.filter(_.
path) contains(“Shakespeare”)).
persistence
Purpose: Writes the saveAsTextFile(“/users/
Spark uses the Least Recently Used (LRU) algorithm to remove old,
content of RDD to a akuntamukkala/temp/ filtered”)
text file or a set of unused cached RDDs to reclaim memory. It also provides a convenient
Result:
text files to local file unpersist() operation to force removal of cached/persisted RDDs.
system/ HDFS akuntamukkala@localhost~/temp/
filtered$ ls _SUCCESS part-00000
DATAFRAMES
DZON E .COM/ RE FCA RDZ

part-00001

Action & Purpose Example & Result


For a more detailed list of actions, please refer to spark.apache.org/
docs/latest/programming-guide.html#actions.

df.printSchema()

RDD PERSISTENCE Result:


One of the key capabilities in Apache Spark is persisting/caching RDD
in cluster memory. This speeds up iterative computation. root
|-- clientIp: string (nullable =

The following table shows the various options Spark provides: true)
|-- clientIdentity: string (nullable
printSchema()
= true)
Storage Level Purpose Purpose: print out |-- user: string (nullable = true)
|-- dateTime: string (nullable =
the schema of the
This option stores RDD in available cluster true)
DataFrame
memory as deserialized Java objects. |-- request: string (nullable = true)
|-- statusCode: integer (nullable =
MEMORY_ONLY Some partitions may not be cached if
true)
(Default level) there is not enough cluster memory. Those
|-- bytesSent: long (nullable = true)
partitions will be recalculated on the fly as
|-- referer: string (nullable = true)
needed.
|-- userAgent: string (nullable =
true)
This option stores RDD as deserialized
Java objects. If RDD does not fit in cluster
MEMORY_AND_DISK
memory, then store those partitions on the
disk and read them as needed.
df.collect()
This options stores RDD as serialized Java collect()
objects (One byte array per partition). Purpose: returns all Result:
This is more CPU intensive but saves the records as a list Row(value=u'The Project Gutenberg
MEMORY_ONLY_SER
memory as it is more space efficient. Some of rows EBook of Shinto: The ancient religion

partitions may not be cached. Those will be of Japan, by ')

recalculated on the fly as needed.

4
APACHE SPARK

warehouse.dir", “/mydata”).enableHiveSupport().getOrCreate()

spark.sql(“SELECT * FROM default.myHiveTable”)


columns() df.columns()

Purpose: returns all Result: For more practical examples using SQL & HiveQL, please refer to
the columns as a list [‘userAgent’, ‘referer’, ‘bytesSent’]
the following link: spark.apache.org/docs/latest/sql-programming-
guide.html.

count() df.count()

Purpose: returns Result:


number of rows 2

createTempView()

Purpose: create a
df.createTempView(“viewName”) SPARK STREAMING
local temporary view
Spark Streaming provides a scalable, fault tolerant, efficient way of
that can be used in
Spark SQL processing streaming data using Spark’s simple programming model.
It converts streaming data into “micro” batches, which enable Spark’s
batch programming model to be applied in Streaming use cases. This
A DataFrame is a distributed collection of data with named columns
unified programming model makes it easy to combine batch and
built on the Dataset interface. You can learn more here: spark.apache.
interactive data processing with streaming.
org/docs/latest/sql-programming-guide.html.

The core abstraction in Spark Streaming is Discretized Stream


SHARED VARIABLES (DStream). DStream is a sequence of RDDs. Each RDD contains data
ACCUMULATORS received in a configurable interval of time.
DZON E .COM/ RE FCA RDZ

Accumulators are variables that can be incremented in distributed tasks.


Spark Streaming also provides sophisticated window operators, which
exampleAccumulator = sparkContext.accumulator(1)
help with running efficient computation on a collection of RDDs in
exampleAccumulator.add(5)
a rolling window of time. DStream exposes an API, which contains
operators (transformations and output operators) that are applied on
BROADCAST VARIABLES
constituent RDDs. Let’s try and understand this using a simple example:
Using the SparkContext, you can broadcast a read-only value to other
tasks. You can set, destroy, and unpersist these values.
import org.apache.spark._
import org.apache.spark.streaming._
broadcastVariable = sparkContext.broadcast(500)
val conf = new SparkConf().setAppName(“appName”).
broadcastVariable.value
setMaster(“masterNode”)
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
SPARK SQL
Spark SQL provides a convenient way to run interactive queries over
The above snippet is setting up Spark Streaming Context. Spark
large data sets using Apache Spark Engine, returning DataFrames.
Streaming will create an RDD in DStream containing text network
Spark SQL provides two types of contexts, SQLContext and
streams retrieved every second.
HiveContext, that extend SparkContext functionality.
There are many commonly used source data streams for Spark
SQLContext provides access to a simple SQL parser, whereas
Streaming, including Apache Kafka, Apache HDFS, Twitter, Apache
HiveContext provides access to the HiveQL parser. HiveContext
NiFi S2S, Amazon S3, and Amazon Kinesis.
enables enterprises to leverage their existing Hive infrastructure.

Let’s see a simple example in Scala: Transformation &


Example & Result
Purpose
val df = spark.read.csv(“customers.txt”)
val dfS = spark.sql(“select * from customers where gender=’M’”) map(func)
lines.map(x=>x.tolnt*10).print()
dfs.printSchema()
Purpose: Create new
dfs.show()
DStream by applying nc –lk 9999 Output:

this function to tall 12 120


Here’s one in Python for Apache Hive: 34 340
constituent RDDS in
spark = SparkSession.builder.appName("dzone1").config("spark.sql. DStream

5
APACHE SPARK

flatMap(func)
The following example shows how Apache Spark combines Spark
lines.flatMap(_.split(“ “)).print()
Purpose: This is batch with Spark Streaming. This is a powerful capability for an all-in-

the same as map but one technology stack. In this example, we read a file containing brand
nc –lk 9999 Output: names and filter those streaming data sets that contain any of the
mapping function
Spark is fun Spark is fun brand names in the file.
can output 0 or more
items
val sc = new SparkContext(sparkConf)

lines.flatMap(_.split(“ “)).print() val ssc = new StreamingContext(sc,


count() transform(func) Seconds(10))
val lines = ssc.
Purpose: create a nc –lk 9999 Purpose: Creates socketTextStream(“localhost” 9999,
DStream of RDDs say a new DStream by StorageLevel.MEMORY_AND_DISK_SER)
Output:
containing a count of hello applying RDD->RDD val brands = sc.textFile(“/tmp/names.
4
the number of data to transformation to all txt”)

elements spark RDDs in DStream


lines.transform(rdd=> {
rdd.intersection(brands)
}).print()
brandNames.txt
lines.map(x=>x.toInt).reduce(_+_).
coke
print()
reduce(func) nike nc –lk 9999
sprite msft
nc –lk 9999 Output:
Purpose: Same as reebok apple
1 sprite
count but the value is Output: nike
3 nike
derived by applying 16 sprite
the function 5 ibm
7
DZON E .COM/ RE FCA RDZ

lines.map(x=>x.toInt).reduce(_+_).

print() COMMON WINDOW OPERATIONS


countByValue()
Transformation &
nc –lk 9999 Example & Result
Purpose: This is same Purpose
spark Output:
as map but mapping
spark (is,1) val win = lines.
function can output 0
is (spark,2) window(Seconds(30),Seconds(10)); win.
or more items
fun (fun,2) foreachRDD(rdd => {
fun rdd.foreach(x=>println(x+ “ “))
})

window(window
lines.map(x=>x.toInt).reduce(_+_). Length, Output:
print()
nc –lk 9999
slideInterval) 10
10 (0th second)
10 20
20 (10 seconds later)
nc –lk 9999 20 10 30
30 (20 seconds later)
reduceByKey spark Output: 20 30 40
(func,[numTasks]) 40 (30 seconds later)
spark (is,1) (drops 10)
is (spark,2)
fun (fun,2)
fun lines.countByWindow(Seconds(30),
countByWindow
Seconds (10)).print()
(windowLength
slideInterval)
val words = lines.flatMap(_.split(“ “))
nc –lk 9999 Output:
val wordCounts = words.map(x => (x, 1)). Purpose: Returns a 10 (0th second) 1
reduceByKey(_+_) new sliding window 20 (10 seconds later) 2
wordCounts.print() count of elements in 30 (20 seconds later) 3
reduceByKey a steam 40 (30 seconds later) 3
(func,[numTasks])
nc –lk 9999 Output:
spark is fun (is,1)
fun (spark,1) For additional transformation operators, please refer to spark.
fun (fun,3) apache.org/docs/latest/streaming-programming-guide.
html#transformations.

6
APACHE SPARK

Spark Streaming has powerful output operators. We already saw • hortonworks.com/tutorial/a-lap-around-apache-spark


foreachRDD() in the above example. For others, please refer to
• hortonworks.com/products/sandbox
spark.apache.org/docs/latest/streaming-programming-guide.
html#output-operations. • hortonworks.com/tutorial/hands-on-tour-of-apache-spark-
in-5-minutes
Structured Streaming has been added to Apache Spark and allows for
• hortonworks.com/tutorial/sentiment-analysis-with-apache-
continuous incremental execution of a structured query. There a few
spark
input sources supported including files, Apache Kafka, and sockets.
Structured Streaming supports windowing and other advanced • spark.apache.org/downloads.html
streaming features. It is recommended when streaming from files that • spark.apache.org/docs/latest
you supply a schema as opposed to letting Apache Spark infer one for
• spark.apache.org/docs/latest/quick-start.html
you. This is a similar feature of most streaming systems, like Apache
NiFi and Hortonworks Streaming Analytics Manager. • zeppelin.apache.org

• jaceklaskowski.gitbooks.io/mastering-apache-spark
val sStream = spark.readStream.json(“myJson”).load()
sStream.isStreaming

sStream.printSchema

For more details on Structured Streaming, please refer to


spark.apache.org/docs/latest/structured-streaming-programming-
guide.html.

ADDITIONAL RESOURCES

• docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/
DZON E .COM/ RE FCA RDZ

bk_spark-component-guide/content/ch_introduction-
spark.html

Written by Tim Spann


is a Big Data Solution Engineer. He helps educate and disseminate performant open source solutions for Big Data initiatives
to customers and the community. With over 15 years of experience in various technical leadership, architecture, sales
engineering, and development roles, he is well-experienced in all facets of Big Data, cloud, IoT, and microservices. As part of
his community efforts, he also runs the Future of Data Meetup in Princeton.

DZone, Inc.
150 Preston Executive Dr. Cary, NC 27513
DZone communities deliver over 6 million pages each month
888.678.0399 919.678.0300
to more than 3.3 million software developers, architects
and decision makers. DZone offers something for everyone,
Copyright © 2017 DZone, Inc. All rights reserved. No part of this publication
including news, tutorials, cheat sheets, research guides, feature may be reproduced, stored in a retrieval system, or transmitted, in any form
articles, source code and more. "DZone is a developer’s dream," or by means electronic, mechanical, photocopying, or otherwise, without
says PC Magazine. prior written permission of the publisher.