Beruflich Dokumente
Kultur Dokumente
Spark is an open source, scalable, massively parallel, in-memory execution engine for
analytics applications.
Think of it as an in-memory layer that sits above multiple data stores, where data can
be loaded into memory and analyzed in parallel across a cluster. Spark Core: The
foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over
100s of high-level operators that make it easy to build parallel apps.
Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms
that are especially written to execute in parallel and in memory. It also supports
interactive SQL processing of queries and real-time streaming analytics. As a result, you
can write analytics applications in programming languages such as Java, Python, R and
Scala.
https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b 1/5
8/29/2019 Why run Spark on Kubernetes? - Rachit Arora - Medium
You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on
Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object
Store, and any Hadoop data source.
Spark can run on clusters managed by Kubernetes. This feature makes use of native
Kubernetes scheduler that has been added to Spark.
How it works?
The driver creates executors which are also running within Kubernetes pods and
connects to them, and executes application code.
When the application completes, the executor pods terminate and are cleaned up,
but the driver pod persists logs and remains in “completed” state in the Kubernetes
API until it’s eventually garbage collected or manually cleaned up.
This integration is certainly very interesting but the important question one should
consider is why an organization should choose Kubernetes as cluster manager and why
not run on Standalone Scheduler which come by default with Spark or run on
Production grade cluster manager like YARN.
4. Limitations using Standalone Scheduler. You can easily run Spark on Kubernetes
by starting Spark cluster running in a standalone mode. This means you run Spark
Master and workers and full Spark cluster on Kubernetes. This approach is very
much feasible and works for many scenarios. It come with its own benefits as its a
very stable way to run Spark on Kubernetes and you can leverage all cool features of
kubernetes like resource management, variety of Persistence storage and its
integrations for logging and service mesh(Istio) but Spark is not aware its running
on Kubernetes and Kubernetes is not aware that its running Spark. This means that
https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b 3/5
8/29/2019 Why run Spark on Kubernetes? - Rachit Arora - Medium
SparkR, support for Kerberized HDFS clusters, as well as client-mode and popular
notebooks interactive execution environments are still being worked on and not
available.
Many features which need more improvement is storing Executor logs, History server
events on a persistent volumes so that they can be referred for later use.
Summary
Apache Spark is an essential tool for data scientists, offering a robust platform for a
variety of applications ranging from large scale data transformation to analytics to
machine learning.
Data scientists are adopting containers to improve their workflows by realizing benefits
such as packaging of dependencies and creating reproducible artifacts.
https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b 5/5