Sie sind auf Seite 1von 42

Apache Spark: Going Beyond

MapReduce
Heinz Analytics Club Seminar
Abhinav Maurya
Carnegie Mellon University
www.andrew.cmu.edu/user/amaurya/docs/spark_talk

Outline

Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips

Spark Installation

https://spark.apache.org/downloads.html

Spark Installation
Extract compressed folder spark-2.0.1-binhadoop2.7
From terminal, go to spark-2.0.1-binhadoop2.7/bin
Run pyspark
Run rdd = sc.parallelize([1, 2, 3]);
rdd.map(lambda x: x*x).collect()
Get result [1,4,9]
Its that easy!

Spark Installation
Steps may not work for Windows users
Two choices
Build from source
Download a bigger zip (1.01GB)
from http://training.databricks.com/
workshop/usb.zip

Outline

Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips

Elephant in the Room: MapReduce

MR

MR
HDFS

...

HDFS

HDFS

Hadoop does disk IO for


every MapReduce step
Slow like an elephant!

Why Apache Spark?


Transformation
T

T
HDFS/
TXT

RDD

...

RDD
In Memory!

Why Apache Spark?

Logistic Regression in Hadoop and Spark

12#

30#

41#

58#

100#
80#
60#
40#
20#
0#

69#

Execution*time*(s)*

Why Apache Spark?

Cache#
disabled#

25%#

50%#

75%#

Fully#
cached#

%*of*working*set*in*cache*

Searching error logs for keywords

Why Apache Spark?

Distributed Cross-validation in Apache Spark for


Tuning Hyperparameters in Deep Learning

Why Apache Spark?

Why Apache Spark?

Spark SQL

MLlib

GraphX

SparkR

Spark Interface (Scala, Python, Java)


Spark Execution Engine

Outline

Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips

RDD Computation Model


RDD
RDD
RDD
RDD

Action

Transformations

RDD = Resilient Distributed Database


Transformations convert one RDD into another
No actual calculation
Actions force calculation of result
Lazy evaluation

Value

Distributed Data and Computation


1

2
3

Partition into 2

RDD = Resilient Distributed Database

RDD Computation Model


R1
R0

R3

R4

R2
Operators on RDDs form a directed acyclic graph
If any partition on dead workers is lost, it can be
recomputed by retracing the operator DAG.

Python Stuff You Need To Know


>>>m = [0,1,2,3,4,5]
>>>n = range(6)
>>>m == n
???True
>>>record = (Brian, 23, CMU,
Pittsburgh)
>>>record[1]
???23

Spark Context
Your link to the YARN/Mesos cluster
Helps you bring RDDs to life
>>>rdd = sc.textFile(filepath)
>>>rdd = sc.parallelize(my_list)

RDD Transformations: map


>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.map(lambda x: x%3)
1

RDD Transformations: reduce


>>>rdd = sc.parallelize(range(1,8))
>>>result_rdd = rdd.reduce(add)
1
2
3
4
5
6
7

28

RDD Transformations: reduceByKey


>>>rdd = sc.parallelize([(Alice,23), (Bob,17), (Alice,27)])
>>>result_rdd = rdd.reduceByKey(add)

Pair
(Alice, 23)
(Alice, 27)
(Bob, 17)

(Alice, 50)
(Bob, 17)

More RDD Transformations


rdd.filter(f): Keep elements that satisfy f
rdd.filter(lambda x: x%2==0)
rdd.keyBy(f): Create a pair RDD by applying f to
each element to get key
rdd2 = rdd.keyBy(lambda x: len(x))
rdd3 = rdd2.keys().histogram(10)
rdd.cache(): Cache RDD in memory for repeated use
rdd.flatMap(f): Apply f to each element and
concatenate resulting lists
rdd2 = rdd.flatMap(lambda x: x.split())

RDD Actions
rdd.glom().collect(): Return RDD
partition contents as a list
rdd = sc.parallelize([0,1,2], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.collect(): [[0], [1], [4]]
rdd.collect(): Return RDD content as a list
rdd = sc.parallelize([0,1,2], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.glom().collect(): [0, 1, 4]

Outline

Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips

Hello WordCount!

Apply to PNAS titles dataset at http://www.andrew.cmu.edu/


user/amaurya/docs/spark_talk/

Hello WordCount!
>>>import fileinput
>>>lines = []
>>>filepath = 'pnas_titles.txt'
>>>for line in fileinput.input(filepath):
...
lines.append(line)
<WordCount Code in Spark>

Hello WordCount!
>>>rdd_lines = sc.parallelize(lines, num_workers)
>>>rdd_result = rdd_lines.flatMap(lambda x:
x.split()).map(lambda x: (x,
1)).reduceByKey(lambda x,y: x+y)
>>>rdd_result = rdd_result.sortBy(lambda x: x[1],
False).keys()
>>>rdd_result.collect()[0:5]
???['of', 'the', 'in', 'and', 'a']

Outline

Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips

Movie Recommendation
Download movielens_ratings.txt from
http://www.andrew.cmu.edu/user/
amaurya/docs/spark_talk/
Data format
user::movie::rating::time
We need only the first three fields
Around 6000 users, 6000 movies, and
a million ratings

Movie Recommendation by Matrix


Factorization
W
A

users x
factors

factors x movies
=

users x movies

Movie Recommendation
### Import the required packages
>>>from pyspark.mllib.recommendation import
ALS, MatrixFactorizationModel, Rating
>>>from pyspark import SparkContext, SparkConf
### Load and parse the data
>>>conf = SparkConf().setAppName('movielensrecommendation').setMaster('local')
>>>sc = SparkContext(conf=conf)
data = sc.textFile(movielens_ratings.txt)

Movie Recommendation
>>>ratings = data.map(lambda l:
l.split('::')).map(lambda l:
Rating(int(l[0]), int(l[1]), float(l[2])))
### Build the recommendation model using
Alternating Least Squares
>>>rank = 10
>>>numIterations = 20
>>>model = ALS.train(ratings, rank,
numIterations)

Movie Recommendation
### Evaluate the model on training data
>>>testdata = ratings.map(lambda p: (p[0], p[1]))
>>>predictions =
model.predictAll(testdata).map(lambda r: ((r[0],
r[1]), r[2]))
>>>ratesAndPreds = ratings.map(lambda r: ((r[0],
r[1]), r[2])).join(predictions)
>>>MSE = ratesAndPreds.map(lambda r: (r[1][0] r[1][1])**2).reduce(lambda x, y: x + y) /
ratesAndPreds.count()
>>>print("Mean Squared Error = " + str(MSE))

Outline

Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips

Performing DataFrame Queries


### necessary imports
>>>from pyspark.sql import SQLContext, Row
>>>sqlContext = SQLContext(sc)
### create dataframe
>>>df = sqlContext.read.json(people.json)
### explore dataframe
>>>df.show()
>>>df.printSchema()
>>>df.select(name").show()
>>>df.select(df['name'], df['age'] + 1).show()
>>>df.filter(df['age'] > 21).show()
>>>df.groupBy("age").count()

Performing SQL Queries


### RDD of Rows
>>>lines = sc.textFile("people.txt")
>>>parts = lines.map(lambda l: l.split(","))
>>>people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
### Create DataFrame and register as a table with sqlContext
>>>schemaPeople = sqlContext.createDataFrame(people)
>>>schemaPeople.registerTempTable(people)
### execute SQL query
>>>teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13
AND age <= 19)
### print results
>>>teenNames = teenagers.map(lambda p: "Name: " + p.name)
>>>for teenName in teenNames.collect():
... print(teenName)

Lets Try Something Real-World!


Download Yelp Businesses dataset from
http://www.andrew.cmu.edu/user/
amaurya/docs/spark_talk/
JSON Data format
Load and register as table
Find number of businesses in Pittsburgh with
at least 10 reviews and review stars above 4
Print any 10 of them

Outline

Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips

Advanced Tips
Run program in Spark local mode and
debug
Spark UI is at http://localhost:4040
Identify slow running tasks
See RDD sizes
Be robust to stragglers
Set spark.speculation = true in Spark
configuration file

References

Spark Programming Guide


Spark API Documentation
Spark Summit Tutorials

Thanks!

Questions?

Das könnte Ihnen auch gefallen