Beruflich Dokumente
Kultur Dokumente
Alexey Filanovskiy
Cloudera certified developer
Architecture
Architecture
MapRedu
ce
and Hive
Processing Layer
Spark
Impala
Search
Big Data
SQL
NoSQL Databases
(Oracle NoSQL DB, Hbase)
Architecture
Architecture
Architecture
Spark automatically elect one node of cluster for running
Driver Program (main coordinator).
It manage all other processing distributed by other nodes
RDD
RDD. Definition
An RDD (Resilient Distributed Dataset) in Spark is simply an immutable distributed collection of
objects. Each RDD is split into multiplepartitions, which may be computed on different nodes
of the cluster.
In other words RDD is input for your Spark Jobs
Spark provides two ways to create RDDs:
- loading an external dataset
- parallelizing a collection in your driver program.
RDD. Terminology
Block B1
Block B2
Its partitions
(unit of parallelis
Block B3
1 2 , 3 4 5 , 6 7 8 9 , 10
RDD transformation
13
RDD transformation
oncepts:
sformations are operations on RDDs that return a new RDD
sformations on RDDs are lazily evaluated, meaning that Spark will not begin to execute until it sees an actio
ple:
> val weblog = sc.textFile("hdfs://localhost:8020/user/hive/warehouse/weblogs")
> val NewRDDwithFilter = weblog.filter(line => line.contains("adidas"))
> NewRDDwithFilter.count()
scala> import
org.apache.spark.storage.StorageLevel
scala> val input = sc.parallelize(List(1, 2, 3, 4))
scala> input.persist(StorageLevel.MEMORY_ONLY)
scala> val result1 = input.map(x => x * x)
scala> val result2 = input.filter(x => x!=1);
.
scala> println(result1.collect().mkString(","))
1,4,9,16
scala> println(result2.collect().mkString(","))
2,3,4
one,one,two,one,two,three,one,two,three,four
3) filter() Return an RDD consisting of only elements that pass the condition passed tofilter()
scala> val input = sc.parallelize(List(1, 2, 3, 4))
scala> val result = input.filter(line => line != 1)
scala> println(result.collect().mkString(","))
2,3,4
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |
map(x => x + 1)
For this function will add 1 to
Each line:
map(x => 3 + 1) => 4
Output:
4
5
6
7
map(x => x + x)
For this function we add to
Line its own value
map(x => 3 + 3) => 6
Output :
6
8
10
12
map(x => x * x)
For this function we multiple line
On itself
map(x => 3 * 3) => 9
Output :
9
16
25
36
(1,3),(1,4),(1,5),(1,6),(2,3),(2,4),(2,5),(2,6),(3,3),(3,4),(3,5),(3,6),(4,3),(4,4),(4,5),(4,6)
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |
RDD actions
20
reduce((x,y) => x + y)
It will goes down from first element
To the last. Example:
Initially x=3, y=4
3+4=7
Then x=7, y=5
7+5=12
Then x=12, y=6
12+6 = 18. its result
reduce((x,y) => x - y)
It will goes down from first element
To the last. Example:
Initially x=3, y=4
3 - 4=-1
Then x=-1, y=5
-1 5 = -6
Then x=-6, y=6
-6-6 = -12. its result
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |
Pair RDD
23
Pair RDD
Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs
Very useful for group by key type of operations.
Creating Pair RDD, example:
scala> val inputRDD = sc.parallelize(List("first string word some other, "second string hello"))
scala> val pairs = inputRDD.map(x => (x.split(" ")(0), x))
scala> println(pairs.collect().mkString(","));
(first ,first string word some other),(second ,second string hello)
28
Parallel execution
30
Parallel execution
Key concepts:
1) Every RDD has a fixed number ofpartitionsthat determine the degree of parallelism
- To know how many partitions contain given RDD, run:
scala> bigRDD.partitions.size;
.
res115: Int = 102
2) By default number of partitions equal to number of blocks:
[cloudera@quickstart ~]$ hdfs fsck /user/hive/warehouse/weblogs/|grep "Total blocks
Spark Partitioning
32
36