Beruflich Dokumente
Kultur Dokumente
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
Aggrega$ng
Data
with
Pair
RDDs
13
Wri;ng
and
Deploying
Spark
Applica;ons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaFerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-2
Spark
on
a
Cluster
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-3
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-4
Spark
Shell
vs.
Spark
Applica$ons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-5
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-6
The
SparkContext
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-7
Python
Example:
WordCount
import sys
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)
sc = SparkContext()
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda word: (word,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
sc.stop()
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-8
Scala
Example:
WordCount
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}
sc.stop()
}
}
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-9
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-10
Building
a
Spark
Applica$on:
Scala
or
Java
Scala
or
Java
Spark
applica;ons
must
be
compiled
and
assembled
into
JAR
les
JAR
le
will
be
passed
to
worker
nodes
Apache
Maven
is
a
popular
build
tool
For
specic
seang
recommenda$ons,
see
http://spark.apache.org/docs/latest/building-
with-maven.html
Build
details
will
dier
depending
on
Version
of
Hadoop
(HDFS)
Deployment
placorm
(Spark
Standalone,
YARN,
Mesos)
Consider
using
an
IDE
IntelliJ
or
Eclipse
are
two
popular
examples
Can
run
Spark
locally
in
a
debugger
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-11
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-12
Running
a
Spark
Applica$on
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-13
Spark
Applica$on
Cluster
Op$ons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-14
Supported
Cluster
Resource
Managers
Hadoop
YARN
Included
in
CDH
Most
common
for
produc$on
sites
Allows
sharing
cluster
resources
with
other
applica$ons
(e.g.
MapReduce,
Impala)
Spark
Standalone
Included
with
Spark
Easy
to
install
and
run
Limited
congurability
and
scalability
Useful
for
tes$ng,
development,
or
small
systems
Apache
Mesos
First
placorm
supported
by
Spark
Now
used
less
ofen
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-15
How
Spark
Runs
on
YARN:
Client
Mode
(1)
NodeManager DataNode
Executor
Executor
Name
Resource
Node
Manager
NodeManager
DataNode
Applica;on
Master
1
NodeManager DataNode
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-16
How
Spark
Runs
on
YARN:
Client
Mode
(2)
NodeManager DataNode
Executor
Executor
Name
Resource
Node
Manager
NodeManager
DataNode
Applica;on
Master
1
NodeManager DataNode
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-17
How
Spark
Runs
on
YARN:
Client
Mode
(3)
NodeManager DataNode
Executor
Executor
Name
Resource
Node
Manager
NodeManager
DataNode
Applica;on
Executor
Master
1
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-18
How
Spark
Runs
on
YARN:
Client
Mode
(4)
NodeManager DataNode
Executor
Executor
Name
Resource
Node
Manager
NodeManager
DataNode
Applica;on
Executor
Master
1
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-19
How
Spark
Runs
on
YARN:
Cluster
Mode
(1)
NodeManager DataNode
Executor
NodeManager
DataNode
submit
Executor
Executor
Name
Resource
Node
Manager
NodeManager
Applica;on
Master
DataNode
Driver
Program
Spark
Context
NodeManager DataNode
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-20
How
Spark
Runs
on
YARN:
Cluster
Mode
(2)
NodeManager DataNode
Executor
NodeManager
DataNode
submit
Executor
Executor
Name
Resource
Node
Manager
NodeManager
Applica;on
Master
DataNode
Driver
Program
Spark
Context
NodeManager DataNode
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-21
HIDDEN
SLIDE
Dynamic
Resource
Alloca$on
and
Data
Locality
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-22
Running
a
Spark
Applica$on
Locally
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-23
Running
a
Spark
Applica$on
on
a
Cluster
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-24
Star$ng
the
Spark
Shell
on
a
Cluster
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-25
Op$ons
when
Submiang
a
Spark
Applica$on
to
a
Cluster
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-26
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-27
The
Spark
Applica$on
Web
UI
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-28
Accessing
the
Spark
UI
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-29
Viewing
Spark
Job
History
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-30
Viewing
Spark
Job
History
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-31
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-32
Building
and
Running
Scala
Applica$ons
in
the
Homework
Assignments
Basic
Maven
projects
are
provided
in
the
exercises/spark/
projects
Project
Directory
Structure
+countjpgs
-pom.xml
$ mvn package
+src
+main
$ spark-submit \ +scala
--class solution.CountJPGs \ +solution
+target
target/countjpgs-1.0.jar \ -countjpgs-1.0.jar
weblogs.*
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-33
Homework:
Write
and
Run
a
Spark
Applica$on
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-34
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-35
Spark
Applica$on
Congura$on
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-36
Spark
Applica$on
Congura$on
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-37
Declara$ve
Congura$on
Op$ons
spark-submit script
e.g.,
spark-submit --driver-memory 500M
Proper;es
le
Tab-
or
space-separated
list
of
proper$es
and
values
Load
with
spark-submit --properties-file filename
Example:
spark.master spark://masternode:7077
spark.local.dir /tmp
spark.ui.port 4444
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-38
Seang
Congura$on
Proper$es
Programma$cally
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-39
SparkConf
Example
(Python)
import sys
from pyspark import SparkContext
from pyspark import SparkConf
if __name__ == "__main__":
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: WordCount <file>"
exit(-1)
sconf = SparkConf() \
.setAppName("Word Count") \
.set("spark.ui.port","4141")
sc = SparkContext(conf=sconf)
counts = sc.textFile(sys.argv[1]) \
.flatMap(lambda line: line.split()) \
.map(lambda w: (w,1)) \
.reduceByKey(lambda v1,v2: v1+v2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-40
SparkConf
Example
(Scala)
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object WordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: WordCount <file>")
System.exit(1)
}
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-41
Viewing
Spark
Proper$es
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-42
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-43
Spark
Logging
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-44
Spark
Log
Files
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-45
Spark
Log
Files
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-46
Conguring
Spark
Logging
(1)
Logging
levels
can
be
set
for
the
cluster,
for
individual
applica;ons,
or
even
for
specic
components
or
subsystems
Default
for
machine:
$SPARK_HOME/conf/log4j.properties
Start
by
copying
log4j.properties.template
log4j.proper$es.template
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-47
Conguring
Spark
Logging
(2)
my-working-directory/log4j.proper$es
# Set everything to be logged to the console
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-48
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-49
Essen$al
Points
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-50
Essen$al
Points
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-51
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriFen
consent
from
Cloudera.
13-52
Homework:
Congure
Spark
Applica$ons
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriFen consent from Cloudera. 13-53