Spark and Scala 1

----------------------------------------------
Scala (Scalable Language)
Scala is "Functional + Object Oriented Programming Language"
Java is "Object Oriented Programming Language"
----------------------------------------------
Immutable (we can't change)
Mutable (we can change)
val => value (immutable)
var => variable (mutable)
----------------------------------------------
Java Syntax:
---------------
<data_type> <variable_name> = <value / exp> ;
Scala Syntax:
---------------
val <variable_name> : <data_type> = <value / exp>
var <variable_name> : <data_type> = <value / exp>
----------------------------------------------
How to start scala in `command prompt`:
----------------------------------------------
Command: scala
Java-1.7 => Scala-2.10 => Spark-1.x

Java-1.8 => Scala-2.11 => Spark-2.x
----------------------------------------------
orienit@kalyan:~$ scala
Welcome to Scala version 2.11.7 (OpenJDK 64-Bit Server VM, Java 1.7.0_101).
Type in expressions to have them evaluated.
Type :help for more information.
scala>
`Scala` provides `REPL` functionality.
Read Evaluate Print Loop (REPL)
Note: Already REPL functionality available in `R, Python, Groovy, ..`
----------------------------------------------
scala> val name : String = "kalyan"
name: String = kalyan
scala> name : String = "xyz"

<console>:1: error: ';' expected but '=' found.
name : String = "xyz"
^
scala> name = "xyz"

<console>:11: error: reassignment to val
name = "xyz"
^
scala> var name : String = "kalyan"

scala> name = "xyz"

name: String = xyz
----------------------------------------------
Scala:
----------
In `Scala` everything is a Òbject`
we don't have any primitive datatypes like Java
Java:
----------
we have Objects
we have primitive datatypes
Examples:
-------------
int a = 1;
Integer a = 1;
Note:
1. In java Òbjects` are serializable, not the primitive datatypes.
2. In java we don't have òperator overloading`
3. In scala & C++ allows òperator overloading`
----------------------------------------------
`Scala` provides `Type Infer`.
`Type infer` means get the Datatype from `value / expression`
scala> val name : String = "kalyan"

scala> val name = "kalyan"

----------------------------------------------
Examples for `Type Infer`:
scala> val id = 1
id: Int = 1
scala> val id = 1.5

id: Double = 1.5
scala> val id = 1l
id: Long = 1
scala> val id = 1d
id: Double = 1.0
scala> val id = 1f
id: Float = 1.0
scala> val id = true

id: Boolean = true
scala> val id = "aaa"

id: String = aaa
scala> val id = 'a'

id: Char = a
----------------------------------------------
Examples for Òperator Overloading`:
scala> val a = 10
a: Int = 10
scala> val b = 20
b: Int = 20
scala> val c = a + b
c: Int = 30
scala> val c = a.+(b)

c: Int = 30
a + b ====> a.+(b)
scala> a.-(b)
res0: Int = -10
scala> a.*(b)
res1: Int = 200
scala> a./(b)
res2: Int = 0
scala> a.%(b)
res3: Int = 10
scala> a min b
res4: Int = 10
scala> a max b
res5: Int = 20
----------------------------------------------
Data Type Conversions:
---------------------------
scala> val id = 10
id: Int = 10
scala> id.to
toByte toDouble toInt toShort
toChar toFloat toLong toString
scala> id.toDouble
res6: Double = 10.0
scala> id.toLong
res7: Long = 10
scala> id.toString
res8: String = 10
scala> id.toChar
res9: Char =
scala> id.toByte
res10: Byte = 10
----------------------------------------------
If, If-Else expressions in Scala
----------------------------------------------
if(exp1) {
body1
}
if(exp2) {
body1
} else {
body2
}
if(exp1) {
body1
} else if(exp2) {
body2
} else {
body3
}
Note:
1. Java, C, C++ supports Ternary Operator
(expression) ? <body1> : <body2>
2. Scala does not support Ternary Operator
if(expression) <body1> else <body2>

----------------------------------------------
Arrays in Scala & Java
----------------------------------------------
Java Syntax:
-------------------
<data_type>[] <variable_name> = {list of values};
<data_type>[] <variable_name> = new <data_type>[<size>];
Scala Syntax:
-------------------
val <variable_name> : Array[<data_type>] = Array[<data_type>](list of values)
val <variable_name> : Array[<data_type>] = new Array[<data_type>](<size>)
----------------------------------------------
Java Examples:
----------------------
String[] names = {"kalyan", "venkat", "ravi"};
(or)
String[] names = new String[3];

names[0] = "kalyan";
names[1] = "venkat";
names[2] = "ravi";
Scala Examples:
----------------------
val names : Array[String] = Array[String]("kalyan", "venkat", "ravi")
(or)
val names : Array[String] = new Array[String](3)

names(0) = "kalyan"
names(1) = "venkat"
names(2) = "ravi"
----------------------------------------------
scala> val names : Array[String] = Array[String]("kalyan", "venkat", "ravi")

names: Array[String] = Array(kalyan, venkat, ravi)
scala> names(0)
res11: String = kalyan
scala> names(1)
res12: String = venkat
scala> names(2)
res13: String = ravi
----------------------------------------------
scala> val names : Array[String] = new Array[String](3)
names: Array[String] = Array(null, null, null)
scala> names(0) = "kalyan"
scala> names(1) = "venkat"
scala> names(2) = "ravi"
scala> names
res17: Array[String] = Array(kalyan, venkat, ravi)
----------------------------------------------
scala> val names = Array[String]("kalyan", "venkat", "ravi")
scala> val names = Array("kalyan", "venkat", "ravi")

----------------------------------------------
scala> val ids = Array(1,2,3,4,5,6)

ids: Array[Int] = Array(1, 2, 3, 4, 5, 6)
scala> val ids = Array[Int](1,2,3,4,5,6)

----------------------------------------------
scala> for( id <- ids) println(id)
1
2
3
4
5
6
scala> for( id <- ids) if(id % 2 == 0) println(id)

2
4
6
scala> for( id <- ids) if(id % 2 == 1) println(id)

1
3
5
scala> for( id <- ids if(id % 2 == 0) ) println(id)

2
4
6
scala> for( id <- ids if(id % 2 == 1) ) println(id)

1
3
5
----------------------------------------------
Scala supports 2 types of collections:
1. Immutable (scala.collection.immutable)
2. Mutable (scala.collection.mutable)
----------------------------------------------
scala> scala.collection.immutable.
:: LongMap SortedMap
AbstractMap LongMapEntryIterator SortedSet
BitSet LongMapIterator Stack
DefaultMap LongMapKeyIterator Stream
HashMap LongMapUtils StreamIterator
HashSet LongMapValueIterator StreamView
IndexedSeq Map StreamViewLike
IntMap MapLike StringLike
IntMapEntryIterator MapProxy StringOps
IntMapIterator Nil Traversable
IntMapKeyIterator NumericRange TreeMap
IntMapUtils Page TreeSet
IntMapValueIterator PagedSeq TrieIterator
Iterable Queue Vector
LinearSeq Range VectorBuilder
List RedBlackTree VectorIterator
ListMap Seq VectorPointer
ListSerializeEnd Set WrappedString
ListSet SetProxy
----------------------------------------------
scala> scala.collection.mutable.
AVLIterator ListBuffer
AVLTree ListMap
AbstractBuffer LongMap
AbstractIterable Map
AbstractMap MapBuilder
AbstractSeq MapLike
AbstractSet MapProxy
AnyRefMap MultiMap
ArrayBuffer MutableList
ArrayBuilder Node
ArrayLike ObservableBuffer
ArrayOps ObservableMap
ArraySeq ObservableSet
ArrayStack OpenHashMap
BitSet PriorityQueue
Buffer PriorityQueueProxy
BufferLike Publisher
BufferProxy Queue
Builder QueueProxy
Cloneable ResizableArray
DefaultEntry RevertibleHistory
DefaultMapModel Seq
DoubleLinkedList SeqLike
DoubleLinkedListLike Set
FlatHashTable SetBuilder
GrowingBuilder SetLike
HashEntry SetProxy
HashMap SortedSet
HashSet Stack
HashTable StackProxy
History StringBuilder
ImmutableMapAdaptor Subscriber
ImmutableSetAdaptor SynchronizedBuffer
IndexedSeq SynchronizedMap
IndexedSeqLike SynchronizedPriorityQueue
IndexedSeqOptimized SynchronizedQueue
IndexedSeqView SynchronizedSet
Iterable SynchronizedStack
LazyBuilder Traversable
Leaf TreeSet
LinearSeq Undoable
LinkedEntry UnrolledBuffer
LinkedHashMap WeakHashMap
LinkedHashSet WrappedArray
LinkedList WrappedArrayBuilder
LinkedListLike
----------------------------------------------
1. Convert Ìmmutable Collection` to `Mutable Collection` using `toBuffer`
2. Convert `Mutable Collection` to ìmmutable Collection` using `toList`
----------------------------------------------
Examples on Collections:
----------------------------
val ids = Array[Int](1,2,3,4,5,6)
val ids = List[Int](1,2,3,4,5,6)
val ids = Seq[Int](1,2,3,4,5,6)
val ids = Set[Int](1,2,3,4,5,6)
val ids = Vector[Int](1,2,3,4,5,6)
val ids = Stack[Int](1,2,3,4,5,6)
val ids = Queue[Int](1,2,3,4,5,6)
----------------------------------------------
scala> val ids = Array[Int](1,2,3,4,5,6)

scala> val ids = List[Int](1,2,3,4,5,6)

ids: List[Int] = List(1, 2, 3, 4, 5, 6)
scala> val ids = Seq[Int](1,2,3,4,5,6)

ids: Seq[Int] = List(1, 2, 3, 4, 5, 6)
scala> val ids = Set[Int](1,2,3,4,5,6)
ids: scala.collection.immutable.Set[Int] = Set(5, 1, 6, 2, 3, 4)
scala> val ids = Vector[Int](1,2,3,4,5,6)

ids: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3, 4, 5, 6)
----------------------------------------------
scala> val ids = Stack[Int](1,2,3,4,5,6)

<console>:10: error: not found: value Stack
val ids = Stack[Int](1,2,3,4,5,6)
^
scala> val ids = scala.collection.immutable.Stack[Int](1,2,3,4,5,6)

ids: scala.collection.immutable.Stack[Int] = Stack(1, 2, 3, 4, 5, 6)
scala> val ids = scala.collection.mutable.Stack[Int](1,2,3,4,5,6)

ids: scala.collection.mutable.Stack[Int] = Stack(1, 2, 3, 4, 5, 6)
----------------------------------------------
scala> val ids = Queue[Int](1,2,3,4,5,6)

<console>:10: error: not found: value Queue
val ids = Queue[Int](1,2,3,4,5,6)
^
scala> val ids = scala.collection.immutable.Queue[Int](1,2,3,4,5,6)

ids: scala.collection.immutable.Queue[Int] = Queue(1, 2, 3, 4, 5, 6)
scala> val ids = scala.collection.mutable.Queue[Int](1,2,3,4,5,6)

ids: scala.collection.mutable.Queue[Int] = Queue(1, 2, 3, 4, 5, 6)
----------------------------------------------
Scala supports 3 types of functions:
----------------------------------------------
1. Anonymus functions
2. Named functions
3. Curried functions
1. Anonymus functions
----------------------------------------------
(a: Int, b: Int) => { a + b }
val add = (a: Int, b: Int) => { a + b }
scala> (a: Int, b: Int) => { a + b }

res27: (Int, Int) => Int = <function2>
scala> val add = (a: Int, b: Int) => { a + b }

add: (Int, Int) => Int = <function2>
scala> add(1,2)
res28: Int = 3
scala> add(10,20)
res29: Int = 30
2. Named functions
----------------------------------------------
def add(a: Int, b: Int) = { a + b }
scala> def add(a: Int, b: Int) = { a + b }

add: (a: Int, b: Int)Int
scala> add(1,3)
res30: Int = 4
scala> add(20,10)
res31: Int = 30
3. Curried functions
----------------------------------------------
def add1(a: Int, b: Int) = { a + b }
def add2(a: Int)(b: Int) = { a + b }
scala> def add1(a: Int, b: Int) = { a + b }

add1: (a: Int, b: Int)Int
scala> def add2(a: Int)(b: Int) = { a + b }

add2: (a: Int)(b: Int)Int
----------------------------------------------
scala> add1(10,20)
res32: Int = 30
scala> add2(10,20)
<console>:14: error: too many arguments for method add2: (a: Int)(b: Int)Int
add2(10,20)
^
scala> add2(10)(20)
res34: Int = 30
----------------------------------------------
scala> def mul(a: Int, b: Int, c : Int) = { a * b * c }

mul: (a: Int, b: Int, c: Int)Int
scala> def mul(a: Int, b: Int, c : Int) : Int = { a * b * c }

scala> def mul(a: Int, b: Int, c : Int) = { val m = a * b * c ; print(m)}

mul: (a: Int, b: Int, c: Int)Unit
scala> def mul(a: Int, b: Int, c : Int) = { val m = a * b * c ; print(m); m}
s
----------------------------------------------
----------------------------------------------
Spark
----------------------------------------------
Spark providing 4 libraries:

-> Spark SQL
-> Spark Streaming
-> Spark MLLib
-> Spark GraphX
Spark supports 4 programming languages:

-> Java
-> Scala
-> Python
-> R
SparkContext -> entry point for any spark operations
RDD -> Resilient Distributed DataSets
RDD features:
---------------------
1. Immutability
2. Lazy Evaluation
3. Cacheable
4. Type Infer
RDD Operations:
---------------------
1. Transformations ( convert old_rdd into new_rdd )
2. Actions ( convert rdd into result )
1.Transformations:
---------------------------
list <- {1,2,3,4}
f1(x) = { x + 1}
f2(x) = { x * x}
f1(list) <- {2,3,4,5}
f2(list) <- {1,4,9,16}

2.Actions:
---------------------------
list <- {1,2,3,4}
min(list) <- 1
max(list) <- 4
sum(list) <- 10
count(list) <- 4
----------------------------------------------
Spark Shell Start commands:
----------------------------------------------
scala => spark-shell
python => pyspark
R => SparkR
----------------------------------------------
Spark context Web UI available at http://192.168.0.176:4040
Spark context available as 'sc' (master = local[*], app id = local-1504943501105).
Spark session available as 'spark'. (only spark-2.x)
----------------------------------------------
How to create RDD?
----------------------------------------------
We can create RDD in spark 2 ways
1. from collections (list, seq, seq, ....)
2. from datasets (txt, csv, tsv, json, hbase, ...)
----------------------------------------------
How to create RDD from collections?
----------------------------------------------
val rdd = sc.parallelize(<collection> , <no.of partitions>)
----------------------------------------------
How to create RDD from datasets?
----------------------------------------------
val rdd = sc.textFile(<path> , <no.of partitions>)
----------------------------------------------
Examples on RDD:
----------------------------------------------
val list = List(1,2,3,4,5,6)
val rdd1 = sc.parallelize(list)

----------------------------------------------
scala> val list = List(1,2,3,4,5,6)

list: List[Int] = List(1, 2, 3, 4, 5, 6)
scala> val rdd1 = sc.parallelize(list)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
<console>:26
scala> rdd1.getNumPartitions
res0: Int = 4
scala> val rdd2 = sc.parallelize(list, 2)

rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at
<console>:26
res1: Int = 2
----------------------------------------------
NOTE:
1. `rdd.collect()` will display the RDD data in console , similar to PIG `dump`
command.
2. Don't use `collect()` in Productin environment.
----------------------------------------------
scala> rdd1.collect()
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6)
scala> rdd1.glom().collect()
res5: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4), Array(5, 6))
res6: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))
----------------------------------------------
scala> val rdd3 = rdd1.repartition(3)

rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at repartition at
<console>:28
res7: Int = 3
res8: Array[Array[Int]] = Array(Array(5), Array(1, 2, 6), Array(3, 4))
scala> val rdd4 = rdd3.coalesce(2)

rdd4: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[9] at coalesce at <console>:30
res9: Int = 2
res10: Array[Array[Int]] = Array(Array(5, 3, 4), Array(1, 2, 6))
----------------------------------------------
scala> rdd1.map((x : Int) => { x + 1 })

res12: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[11] at map at <console>:29
scala> val rdd11 = rdd1.map((x : Int) => { x + 1 })

rdd11: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:28
scala> def addOne(x : Int) = { x + 1 }

addOne: (x: Int)Int
scala> val rdd12 = rdd1.map(addOne)

----------------------------------------------
scala> val rdd11 = rdd1.map((x : Int) => { x + 1 })
scala> val rdd13 = rdd1.map(x => x + 1)

scala> val rdd14 = rdd1.map(_ + 1)

scala> rdd11.collect
----------------------------------------------
val path = "file:///home/orienit/work/input/demoinput"
val rdd = sc.textFile(path)
----------------------------------------------
scala> val path = "file:///home/orienit/work/input/demoinput"

path: String = file:///home/orienit/work/input/demoinput
scala> val rdd = sc.textFile(path)

rdd: org.apache.spark.rdd.RDD[String] = file:///home/orienit/work/input/demoinput
MapPartitionsRDD[18] at textFile at <console>:26
scala> rdd.getNumPartitions
res18: Int = 2
scala> val rdd = sc.textFile(path, 1)

rdd: org.apache.spark.rdd.RDD[String] = file:///home/orienit/work/input/demoinput
MapPartitionsRDD[20] at textFile at <console>:26
scala> rdd.getNumPartitions
res19: Int = 1
scala> rdd.collect()
res20: Array[String] = Array(I am going, to hyd, I am learning, hadoop course)
scala> rdd.collect().foreach(println)
I am going
to hyd
I am learning
hadoop course
----------------------------------------------
Word Count in Spark:
----------------------------------------------
val input = "file:///home/orienit/work/input/demoinput"

val output = "file:///home/orienit/work/output/wordcount"
val file = sc.textFile(input, 1)
val words = file.flatMap(line => line.split(" "))
val tuples = words.map(word => (word, 1))
val wordcount = tuples.reduceByKey((a,b) => a + b)
val sorted = wordcount.sortByKey()
sorted.saveAsTextFile(output)
----------------------------------------------
val input = "file:///home/orienit/work/input/demoinput"

val output = "file:///home/orienit/work/output/wordcount1"
val file = sc.textFile(input, 1)
val sorted = file.flatMap(line => line.split(" ")).map(word => (word,

1)).reduceByKey((a,b) => a + b).sortByKey()
sorted.saveAsTextFile(output)
----------------------------------------------

Spark and Scala 1

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Spark and Scala 1

Hochgeladen von

Copyright:

Verfügbare Formate

----------------------------------------------

Scala (Scalable Language)

Scala is "Functional + Object Oriented Programming Language"

Java is "Object Oriented Programming Language"

Immutable (we can't change)

Mutable (we can change)

val => value (immutable)

var => variable (mutable)

var <variable_name> : <data_type> = <value / exp>

Java-1.7 => Scala-2.10 => Spark-1.x

`Scala` provides `REPL` functionality.

Read Evaluate Print Loop (REPL)

Note: Already REPL functionality available in `R, Python, Groovy, ..`

scala> name : String = "xyz"

scala> name = "xyz"

scala> var name : String = "kalyan"

scala> name = "xyz"

2. In java we don't have `operator overloading`

3. In scala & C++ allows `operator overloading`

`Type infer` means get the Datatype from `value / expression`

scala> val name : String = "kalyan"

scala> val name = "kalyan"

scala> val id = 1.5

scala> val id = true

scala> val id = "aaa"

scala> val id = 'a'

scala> val c = a.+(b)

(expression) ? <body1> : <body2>

2. Scala does not support Ternary Operator

if(expression) <body1> else <body2>

<data_type>[] <variable_name> = new <data_type>[<size>];

val <variable_name> : Array[<data_type>] = new Array[<data_type>](<size>)

String[] names = new String[3];

val names : Array[String] = new Array[String](3)

scala> val names : Array[String] = Array[String]("kalyan", "venkat", "ravi")

scala> names(0) = "kalyan"

scala> names(1) = "venkat"

scala> names(2) = "ravi"

scala> val names = Array("kalyan", "venkat", "ravi")

scala> val ids = Array(1,2,3,4,5,6)

scala> val ids = Array[Int](1,2,3,4,5,6)

scala> for( id <- ids) if(id % 2 == 0) println(id)

scala> for( id <- ids) if(id % 2 == 1) println(id)

scala> for( id <- ids if(id % 2 == 0) ) println(id)

scala> for( id <- ids if(id % 2 == 1) ) println(id)

2. Convert `Mutable Collection` to `immutable Collection` using `toList`

val ids = List[Int](1,2,3,4,5,6)

val ids = Seq[Int](1,2,3,4,5,6)

val ids = Set[Int](1,2,3,4,5,6)

val ids = Vector[Int](1,2,3,4,5,6)

val ids = Stack[Int](1,2,3,4,5,6)

val ids = Queue[Int](1,2,3,4,5,6)

scala> val ids = Array[Int](1,2,3,4,5,6)

scala> val ids = List[Int](1,2,3,4,5,6)

scala> val ids = Seq[Int](1,2,3,4,5,6)

scala> val ids = Vector[Int](1,2,3,4,5,6)

scala> val ids = Stack[Int](1,2,3,4,5,6)

scala> val ids = scala.collection.immutable.Stack[Int](1,2,3,4,5,6)

scala> val ids = scala.collection.mutable.Stack[Int](1,2,3,4,5,6)

scala> val ids = Queue[Int](1,2,3,4,5,6)

scala> val ids = scala.collection.immutable.Queue[Int](1,2,3,4,5,6)

scala> val ids = scala.collection.mutable.Queue[Int](1,2,3,4,5,6)

val add = (a: Int, b: Int) => { a + b }

scala> (a: Int, b: Int) => { a + b }