Sie sind auf Seite 1von 1

* change the master parameter to change the distributing computing

* spark has web for status: localhost somethign


* sparkcontext stop sc.stop(), when the same thing is running already
* collect: reading the file and create an RDD
* cache: only on RAM, persist: both RAM and disk, if it doesn't fit the memory

* map --> shuffling here (grouping by the framework) --> reduceByKey


* aggregate: seqOp: run in each subarray, comOp: combine the results between
workers.
* reduceByKey vs groupByKey: reduceByKey reduces first in the RDD before sending to
shuffling stage

* accumulator may be prone to fault: some machine crashed, it reruns and the global
variable may be wrong

Das könnte Ihnen auch gefallen