Sie sind auf Seite 1von 5

8/29/2019 Why run Spark on Kubernetes?

- Rachit Arora - Medium

Why run Spark on Kubernetes?


Rachit Arora Follow
Sep 1, 2018 · 5 min read

Spark is an open source, scalable, massively parallel, in-memory execution engine for
analytics applications.

Think of it as an in-memory layer that sits above multiple data stores, where data can
be loaded into memory and analyzed in parallel across a cluster. Spark Core: The
foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over
100s of high-level operators that make it easy to build parallel apps.

Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms
that are especially written to execute in parallel and in memory. It also supports
interactive SQL processing of queries and real-time streaming analytics. As a result, you
can write analytics applications in programming languages such as Java, Python, R and
Scala.

https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b 1/5
8/29/2019 Why run Spark on Kubernetes? - Rachit Arora - Medium

You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on
Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object
Store, and any Hadoop data source.

Spark can run on clusters managed by Kubernetes. This feature makes use of native
Kubernetes scheduler that has been added to Spark.

How it works?

Spark Submit on k8s

spark-submit can be directly used to submit a Spark application to a Kubernetes cluster.


The submission mechanism works as follows:

Spark creates a Spark driver running within a Kubernetes pod.

The driver creates executors which are also running within Kubernetes pods and
connects to them, and executes application code.

When the application completes, the executor pods terminate and are cleaned up,
but the driver pod persists logs and remains in “completed” state in the Kubernetes
API until it’s eventually garbage collected or manually cleaned up.

But, Why we should use Kubernetes as cluster Manager?


https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b 2/5
8/29/2019 Why run Spark on Kubernetes? - Rachit Arora - Medium

This integration is certainly very interesting but the important question one should
consider is why an organization should choose Kubernetes as cluster manager and why
not run on Standalone Scheduler which come by default with Spark or run on
Production grade cluster manager like YARN.

Let me try to attempt to answer the question with following points.

1. Are you using data analytical pipeline which is containerized? A typical


company want to have data pipelines unified like If a typical data pipeline requires
Queue, Spark, DB, Visualization applications and everything is run on containers, it
make sense to run Spark on containers and use Kubernetes to manage your entire
pipeline.

2. Resource sharing is better optimized, Instead of running you pipeline on a


dedicated hardware for each, it is very efficient and optimal to run on Kubernetes
cluster, so that there is better resource sharing as all components in a pipeline are not
running all the time.

3. Leveraging Kubernetes ecosystem. Spark workloads can make direct use of


Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas,
as well as administrative features such as Pluggable Authorization and Logging. Best
of all, it requires no changes or new installations on your Kubernetes cluster; simply
create a container image and set up the right RBAC roles for your Spark Application
and you’re all set. When you are using Kubernetes as you cluster manager you will
get seamless integration with logging and monitoring solutions. The community is
also exploring advanced use cases such as managing streaming workloads and
leveraging service meshes like Istio.

4. Limitations using Standalone Scheduler. You can easily run Spark on Kubernetes
by starting Spark cluster running in a standalone mode. This means you run Spark
Master and workers and full Spark cluster on Kubernetes. This approach is very
much feasible and works for many scenarios. It come with its own benefits as its a
very stable way to run Spark on Kubernetes and you can leverage all cool features of
kubernetes like resource management, variety of Persistence storage and its
integrations for logging and service mesh(Istio) but Spark is not aware its running
on Kubernetes and Kubernetes is not aware that its running Spark. This means that

https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b 3/5
8/29/2019 Why run Spark on Kubernetes? - Rachit Arora - Medium

we need to repeat or duplicate some of the configurations like if we want to run


Executors of 10 GB, we will need to start the Kubernetes PODs with these setting and
specify Spark daemon memory accordingly. Note you will need to give some extra
memory to Spark Daemon process like Spark Master and Spark worker and can not
user entire memory of POD for driver or executor only. Thus there is resource
wasteage and its easy to mess up configurations if not done by experts :). Another
limitation of Standalone approach is that its difficult to manage elasticity —
standalone does not give by default.

5. YARN vs Kubernetes : Although this is still not a right comparsion as kubernetes


still does support all the frameworks which uses YARN as resource
manager/negotiator. YARN is used for many production workloads and can be used
to run any application. Spark treats YARN as a container management system to
request with defined resource
once spark acquire container it builds RPC based communication between container
to run driver and executors. YARN does not handle the service layer where as Spark
need to handle the Service layer.
YARN can scale automatically by releasing and acquiring container. One major
disadvantage of using YARN containers is that it always runs a JVM in container,
Where as when you run on Kubernetes you can use you own image which can very
minimal.

6. Kubernetes community support. If you as organization if you need to choose


between container orchestrator, you can easily choose Kubernetes just because of the
community support it has apart from the reasons that It can run “on Prem” as well as
on “cloud provider” of your choice and there is no CLOUD lock down you need to
suffer. Kubernetes is agnostic of container runtime and it as very vast feature list like
support for running cluster application on containers and service load balancing,
service upgradation without stopping or any disruption and well defined storage
story.

Why you may want to avoid using Kubernetes as Spark


cluster manager?
This is still a beta feature and not ready for production yet. There many features such as
dynamic resource allocation, in-cluster staging of dependencies, support for PySpark &
https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b 4/5
8/29/2019 Why run Spark on Kubernetes? - Rachit Arora - Medium

SparkR, support for Kerberized HDFS clusters, as well as client-mode and popular
notebooks interactive execution environments are still being worked on and not
available.

Many features which need more improvement is storing Executor logs, History server
events on a persistent volumes so that they can be referred for later use.

Summary
Apache Spark is an essential tool for data scientists, offering a robust platform for a
variety of applications ranging from large scale data transformation to analytics to
machine learning.

Data scientists are adopting containers to improve their workflows by realizing benefits
such as packaging of dependencies and creating reproducible artifacts.

Given that Kubernetes is the standard for managing containerized environments, it is a


natural fit to have support for Kubernetes APIs within Spark.

Docker Spark Kubernetes Yarn Istio

About Help Legal

https://medium.com/@rachit1arora/why-run-spark-on-kubernetes-51c0ccb39c9b 5/5

Das könnte Ihnen auch gefallen