Sie sind auf Seite 1von 2

AMS 560.

01 Big Data Systems, Algorithms and Networks - Spring 2018


Vishal Jasrotia, Rahul Rane, Noopur Maheshwari

Assignment 4, Report 2
Spark SQL: Relational Data Processing in Spark Michael Armbrust†, Reynold S. Xin†, Cheng Lian†, Yin Huai†,
Davies Liu†, Joseph K. Bradley†, Xiangrui Meng†, Tomer Kaftan‡, Michael J. Franklin†‡, Ali Ghodsi†, Matei Zaharia†⇤
†Databricks Inc. ⇤MIT CSAIL ‡AMPLab, UC Berkeley

​Spark SQL is new Apache Spark module which lets Spark programmers leverage the benefits of relational processing,
and lets SQL users call complex analytics library in Spark. It includes tight integration between relational and procedural
processing and highly extensible optimizer built using Scala that makes it easy to add rules, control code and define
extension points.Spark is an evolution of both SQL-on Spark and of Spark itself, offering rich APIs and optimizations.
Big data application require powerful processing techniques.Users want to perform ETL, machine learning, graph
processing, advanced analytics that are challenging in relation systems. MapReduce can solve this problem but
optimization usually requires manual intervention.Spark bridges the gap between the two models by providing
DataFrame API that can perform relational operations but evaluates operations lazily. And, it also supports the wide
range of data sources and algorithms in big data using Catalyst.DataFrame API offers rich relational and procedural
integration in within Spark by creating built-in collections of distributed objects using Java/Python objects, enabling
relational operations on existing Spark programs.Other machine learning libraries and analytical tools can similarly use
the distributed collection for processing in Spark.
Sparks offer a functional programming API where users can manipulate Resilient Distributed Datasets using operations
like map, filter and, reduce.RDDs are fault tolerance and follow a logical plan to compute datasets.Spark also address the
limitation of the previous system, Shark, which could only be used to query external data, and was thus not useful for
relational queries inside Spark program.The goal of the Spark was to support relational processing within the Spark
program, external data sources, new data sources, providing high performance and, enabling extensions for advanced
analytical algorithms for graph processing and machine learning.
Sparks SQL runs on the top of SQL interfaces which can be accessed using JDBC/ODBC. The programming interface of
Sparks consists of DataFrame AP, Data models, Operations on DataFrames and Querying. DataFrames in Sparks keep
track of their schema and support relational operations for optimized executions.Data Frames can be constructed from
tables or from existing RDDs and can be manipulated by performing various operations lazily.Sparks supports nested
models of primitive data types as well as complex data types like struct, arrays, maps and, unions. Users can perform all
common relational operations like select, where, , join and, groupBy on DataFrames in R, Python or Java.All operations
are build up an abstract syntax tree of the expression which then passed to Catalyst for optimizations.This approach
makes complex Spark program easier to code and debug than the relational query languages.Spark also support features
like in-memory caching and user-defined function and types for complex algorithms like vectors type for machine
learning algorithms.
The important part of Spark is Catalyst which is responsible for the optimization and support both rule-based and
cast-based optimizations.At its core, Catalyst contains a general purpose library for representing trees and several sets of
rules for analysis of a logical plan from abstract syntax tree or from the DataFrame object, logical plan optimization by
folding and, pruning and, code generation, for manipulating them.It also extension points for external data sources and
user-defined functions.
Advanced features of Sparks includes schema inference algorithms for JSON and other semistructured data which is
very common in big data environment as it is very easy to produce and add fields over time.Second, Sparks support
large-scale aggregations, extraction, normalization, dimensionality reduction for advanced machine learning algorithms.
And last, data pipelining for combining heterogeneous sources as data sometimes reside on a different machine in
different geolocation using schemas.
Performance of Spark is better than the Shark and Impala because of the code generation in Catalyst which reduces CPU
overhead.Performance of the DataFrame API is 2 times better than the Scala API and, almost 18 times better than the
Python API.It also performs almost 2 times faster in the two-stage pipeline than the Spark +SQL query.
Research applications include the generalization online aggregation in which users can view the progress of executing
queries on a fraction of the dataset to improve the accuracy of the system.Second research application is inspecting
overlapping regions using join inequalities in Spark which faster the nested loop join operations in SQL.Shark is closely
related to Spark in terms of offering the combination of relational queries and advance analytics.Exodus optimizer also
shares the same goals with the Catalyst optimizer.
Spark SQL, a new module in Apache Spark provides rich integration with relational processing and also offers benefits
such as automatic optimizations and letting users write complex programs. It supports wide range of feature for analytics,
machine learning, prediction models, complex querying.Catalyst makes to easier to add optimizations rules, data sources,
data types and user-defined functions.Spark SQL makes it simpler to write efficient data pipelines using Catalyst while
offering substantial improvement over previous SQL-on-Spark engines.

Das könnte Ihnen auch gefallen