Sie sind auf Seite 1von 2

Evaluative Summary on Databricks’ value propositions

The basic task of knowing what is going on with the business, drives platform adoption and technology buying decisions. Databricks is a managed platform for running Apache Spark that aims to provide a fast and generalized GUI for large scale data analysis. Its an implementation of Spark to help reduce complexity of setup and operation by providing dashboards and scheduled jobs. The client does not have to learn cluster management concepts nor perform Spark cluster maintenance. It is a point and click interface for data analysts and BI professionals with options to automate data jobs and AWS private cluster integrations. Their core components are

1. Workspaces (file storage),

2. Libraries (Python and Java libs to extend functions),

3. SQL tables (same as SQL Tables),

4. Clusters (managed Spark cluster instances),

5. Jobs (scheduled data workloads) and

6. Notebooks (same as Jupyter notebooks, Apache Zepppelin and R Notebooks; that executes Scala, Python and SQL code and see results in same document)

First impressions by using the Databricks community version, it seemed like a merge of visualization suites like inCites / Exploratory.io, and liveCode tools like Jupyter / Zepppelin. It felt like an investigative convenience tool that pulls in functionalities of Apache Spark and presents them in a web based interface. Since Spark became top level apache project in 2014, it has been tremendously improved in specific areas of Data integration, ETL, Machine Learning and visualization. Data scientists can now use python APIs to run BI code and visualization tools like Qlikview / Tableau can connect directly to Spark SQL. The data scientist responsible to drive insights most likely is already proficient in all of the aforementioned tools. The question then arises that what value would databricks add to the existing and rapidly evolving infrastructure.

Databricks presents itself as a convenience tools, that anyone can be trained on, for easy cluster management, ease of setup, collaboration, visualization etc. Although DataBricks’ web based interface saves time in visualizations, but certainly restricts customization in machine learning frameworks specially deep learning. This enforces power users to restrict queries and analysis within the bounds of the web based system. Being a data scientist, I have used similar systems previously and I would still prefer Python and R over a web-based tool for the heavy lifting and flexibility. Certain areas that will undergo massive change, with the use of transfer learning (deep learning technique), are real time processing for outliers and fraud detection and recommendations on user feedback. The web based system show no support on handling transfer learning and this is still a vision in the company profile.

It is important to note that Databricks was founded by the creators of Apache Spark and this has played a huge part in their success at seed funding rounds. Not to be shadowed by their popularity in Venture Capitalists, companies would be better off adding another talented data scientist for the price of their annual subscription.

Appendix: Quick review of the latest offerings in Spark

Spark is paving new ways to give easier access to big data for data scientists. This is reflected in their latest architecture and platform integrations. Most recent update is the introduction of Dataset i.e. a combination of RDDs and Dataframes. Datasets allow users with the ease to type like a RDD and query like a dataframe. Datasets are predicted to be the way forward in Spark data structures.

1. Spark Core Engine (includes task distribution, scheduling and I/O)

2. Spark SQL (Now using dataframes and datasets)

3. Spark streaming (micro Batches using Lambda architecture i.e. incremental aggregating)

4. MLLib (9x faster than Mahout),

5. SparkR (interface for connecting Spark Cluster from Rstudio, distributes data as dataframes on all nodes )

6. GraphX (graph relation jobs identifying nodes n edges)

Recent versions of Spark can run from Jupyter notebooks, Apache Zepppelin and Rstudio. The command shell now natively supports Java, Scala, Python and R. The previous Java based batch oriented technologies like MapReduce and its abstractions like Hive,Pig,Mahout etc are phasing out due to slow and tedious performances.

Legend

RDD:

a container built using varying data types spread across the cluster

Dataframe: a subset of RDD, that only inherits key value pairs and not the different data type

Author:

Saad Sadiq, PhD candidate College of Engineering University of Miami Coral Gables, FL