Beruflich Dokumente
Kultur Dokumente
A
Seminar Report
on
Smart Card id
SUBMITTED BY:
Jatin Kumar (1503310094)
Under the Guidance of:
Mr. Zatin Gupta
CERTIFICATE
Certified that seminar work entitled “ Smart card id” is a bonafide work carried out in
the eighth semester by “Jatin Kumar” in partial fulfilment for the award of Bachelor of
Technology in Computer Science Engineering from Raj Kumar Goel Institute of
Technology Ghaziabad during the academic year 2018- 2019.
SIGNATURE
COMPUTER SCIENCE HEAD OF DEPARTMENT
SIGNATURE
SEMINAR COORDINATOR
3
ACKNOWLEDGEMENT
The seminar report on “Smart card id” is the outcome of guidance, moral support, and
devotion bestowed on me throughout my work. For this I acknowledge and express my
profound sense of gratitude and thanks to everybody who have been a source of
inspiration during the seminar preparation. First and foremost I offer our sincere phrases
of thanks with innate humility to Mr. Zatin Gupta guide of my seminar to provide help
whenever I needed.
If I can say in words I must at the outside tender our intimacy for receipt of affectionate
care to Raj Kumar Goel Institute of Technology for providing such a stimulating
atmosphere and wonderful work environment.
Jatin Kumar
4
INDEX
2. Introduction 7
8. References 42
6
ABSTRACT
Abstract—In today’s world carrying a number of plastic smart cards to establish our
identity has become an integral segment of our routine lives. Identity establishment
necessitates a pre stored readily available data about self and to the administrator to
authenticate it with claimer’s personal information. There is a distinct requirement of a
technological solution for nationwide multipurpose identity for any citizen across the
board. Number of options has been exercised by various countries and every option has
its own pros and cons. However, it has been observed that in most of the cases Smart
Card solution has been preferred by a user and administrator both. The use of Smart
cards are so prevalent that be it any profession, without incorporating its application,
identity of any individual is hardly considered complete.
In this paper, the principle aim is to discuss the viability of Smart Card technology as an
identity solution and its ability to perform various functions with strong access control
that increases the reliability of Smart Cards over other technologies. It outlines the
overview of smart card technology along with its key applications. Security concerns of
smart card have been discussed through an algorithm with the help of a division integer
proposition. Possibilities of upgrading it with evolving technology offer it as a universal
acceptability of identification. Capability of storing desired amount of information by an
administrator to compute multiple operations to authenticate a citizen dictates its
widening acceptability and an endeavor has been made in this paper to explain it through
a proposed system flow chart.
7
INTRODUCTION
One takes today a burden of carrying a wallet with full of cards to establish his/her
identity like official ID card, canteen cards, library cards, driving license, etc. Smart card
ID has a potential to replace all these cards by a single smart ID cards to serve the desired
purpose. Varieties of smart cards are available as on date with progressive technologies
where developers use different data structures and standards for programming. In this
paper, we will discuss about viability of smart cards as a solution to requirement of
nationwide multipurpose smart ID for each and every citizen with continuous evolving
technology. Our aim is to propose a viable technological solution for a single
multipurpose smart ID card to do away with carrying multiple cards by an individual. It
will assist governments across the globe in better administration with cost effective
solution for multiple application single smart ID cards. It will also need management of a
large database with processing and scalable computing to home on desired ID. Data
centers handling these big data are contributing in reducing the delay and costs in data
processing and improving the quality of service to include certain discrete services using
internet based services.
Smart Card is also data but with a huge size. Smart Card is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time. In short
such data is so large and complex that none of the traditional data management tools are
able to store it or process it efficiently.
The New York Stock Exchange shown in figure 1.1 generates about one terabyte of
new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of
social media site Facebook (figure 1.2), every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data. Over the period of time, talent in computer science has achieved
greater success in developing techniques for working with such kind of data (where the
format is well known in advance) and also deriving value out of it. However, nowadays,
we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are
being in the rage of multiple zettabytes.
Table 1.1
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text files, images, videos
etc. Now day organizations have wealth of data available with them but unfortunately,
they don't know how to derive value out of it since this data is in its raw form or
unstructured format.
11
FIGURE 1.4
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
12
FIGURE 1.5
(i) Volume – The name itself is related to a size which is enormous. Size of data plays
a very crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with
smart card
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.
13
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data.
Velocity deals with the speed at which data flows in from sources like business processes
, application logs, networks, and social media sites, sensors, mobile devices, etc. The
flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
Card Data technologies can be used for creating a staging area or landing zone for
new identifying what data should be moved to the data warehouse. In addition, such
integration of Big Data technologies and data warehouse helps an organization to
offload infrequently accessed.
14
SUMMARY
Data is defined as data that is huge in size. data is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
Examples of Data generation includes stock exchanges, social media sites, jet
engines, etc.
could be 1) Structured, 2) Unstructured, 3) Semi-structured.
Volume, Variety, Velocity, and Variability are few Characteristics of Big data.
Improved customer service, better operational efficiency, Better Decision
Making are few advantages of Big data.
15
This chapter deals with the big data processing frameworks. Processing frameworks
compute over the data in the system, either by reading from non-volatile storage or as it is
ingested into the system. Computing over data is the process of extracting information
and insight from large quantities of individual data points.
1. Batch-only frameworks
a. Apache Hadoop
2. Stream-only frameworks
a. Apache Storm
b. Apache Samza
3. Hybrid frameworks
a. Apache Spark
b. Apache Flink
Processing frameworks and processing engines are responsible for computing over
data in a data system. While there is no authoritative definition setting apart "engines"
from "frameworks", it is sometimes useful to define the former as the actual component
responsible for operating on data and the latter as a set of components designed to do the
same.
For instance, Apache Hadoop can be considered a processing framework with Map
Reduce as its default processing engine. Engines and frameworks can often be swapped
out or used in tandem. For instance, Apache Spark, another framework, can hook into
16
Hadoop to replace Map Reduce. This interoperability between components is one reason
that big data systems have great flexibility.
While the systems which handle this stage of the data life cycle can be complex, the goals
on a broad level are very similar: operate over data in order to increase understanding,
surface patterns, and gain insight into complex interactions.
These processing frameworks are grouped by the state of the data they are designed to
handle. Some systems handle data in batches, while others process data in a continuous
stream as it flows into the system. Still others can handle data in either of these ways.
Batch processing has a long history within the data world. Batch processing involves
operating over a large, static dataset and returning the result at a later time when the
computation is complete.
Batch processing is well-suited for calculations where access to a complete set of records
is required. For instance, when calculating totals and averages, datasets must be treated
holistically instead of as a collection of individual records. These operations require that
state be maintained for the duration of the calculations.
Tasks that require very large volumes of data are often best handled by batch operations.
Whether the datasets are processed directly from permanent storage or loaded into
memory, batch systems are built with large quantities in mind and have the resources to
handle them. Because batch processing excels at handling large volumes of persistent
data, it frequently is used with historical data.
The trade-off for handling large quantities of data is longer computation time. Because of
this, batch processing is not appropriate in situations where processing time is especially
significant.
17
Apache Hadoop
Modern versions of Hadoop are composed of several components or layers that work
together to process batch data:
HDFS: HDFS is the distributed filesystem layer that coordinates storage and
replication across the cluster nodes. HDFS ensures that data remains available in
spite of inevitable host failures. It is used as the source of data, to store
intermediate processing results, and to persist the final calculated results.
YARN: YARN, which stands for Yet Another Resource Negotiator, is the cluster
coordinating component of the Hadoop stack. It is responsible for coordinating
and managing the underlying resources and scheduling jobs to be run. YARN
makes it possible to run much more diverse workloads on a Hadoop cluster than
was possible in earlier iterations by acting as an interface to the cluster resources.
MapReduce: MapReduce is Hadoop's native batch processing engine.
Because this methodology heavily leverages permanent storage, reading and writing
multiple times per task, it tends to be fairly slow. On the other hand, since disk space is
typically one of the most abundant server resources, it means that MapReduce can handle
enormous datasets. This also means that Hadoop's MapReduce can typically run on less
expensive hardware than some alternatives since it does not attempt to store everything in
memory. MapReduce has incredible scalability potential and has been used in production
on tens of thousands of nodes.
As a target for development, MapReduce is known for having a rather steep learning
curve. Other additions to the Hadoop ecosystem can reduce the impact of this to varying
degrees, but it can still be a factor in quickly implementing an idea on a Hadoop cluster.
Hadoop has an extensive ecosystem, with the Hadoop cluster itself frequently used as a
building block for other software. Many other processing frameworks and engines
have Hadoop integrations to utilize HDFS and the YARN resource manager.
Stream processing systems compute over data as it enters the system. This requires a
different processing model than the batch paradigm. Instead of defining operations to
apply to an entire dataset, stream processors define operations that will be applied to each
individual data item as it passes through the system.
The datasets in stream processing are considered "unbounded". This has a few important
implications:
The total dataset is only defined as the amount of data that has entered the system
so far.
The working dataset is perhaps more relevant, and is limited to a single item at a
time.
Processing is event-based and does not "end" until explicitly stopped. Results are
immediately available and will be continually updated as new data arrives.
Stream processing systems can handle a nearly unlimited amount of data, but they only
process one (true stream processing) or very few (micro-batch processing) items at a
time, with minimal state being maintained in between records. While most systems
provide methods of maintaining some state, steam processing is highly optimized for
more functional processing with few side effects.
19
Functional operations focus on discrete steps that have limited state or side-effects.
Performing the same operation on the same piece of data will produce the same output
independent of other factors. This kind of processing fits well with streams because state
between items is usually some combination of difficult, limited, and sometimes
undesirable. So while some type of state management is usually possible, these
frameworks are much simpler and more efficient in their absence.
This type of processing lends itself to certain types of workloads. Processing with near
real-time requirements is well served by the streaming model. Analytics, server or
application error logging, and other time-based metrics are a natural fit because reacting
to changes in these areas can be critical to business functions. Stream processing is a
good fit for data where you must respond to changes or spikes and where you're
interested in trends over time.
Apache Storm
Apache Storm is a stream processing framework that focuses on extremely low latency
and is perhaps the best option for workloads that require near real-time processing. It can
handle very large quantities of data with and deliver results with less latency than other
solutions.
The idea behind Storm is to define small, discrete operations using the above components
and then compose them into a topology. By default, Storm offers at-least-once processing
guarantees, meaning that it can guarantee that each message is processed at least once,
but there may be duplicates in some failure scenarios. Storm does not guarantee that
messages will be processed in order.
20
Storm users typically recommend using Core Storm whenever possible to avoid those
penalties. With that in mind, Trident's guarantee to processes items exactly once is useful
in cases where the system cannot intelligently handle duplicate messages. Trident is also
the only choice within Storm when you need to maintain state between items, like when
counting how many users click a link within an hour. Trident gives Storm flexibility,
even though it does not play to the framework's natural strengths.
Stream batches: These are micro-batches of stream data that are chunked in
order to provide batch processing semantics.
Operations: These are batch procedures that can be performed on the data.
Storm with Trident gives you the option to use micro-batches instead of pure stream
processing. While this gives users greater flexibility to shape the tool to an intended use,
it also tends to negate some of the software's biggest advantages over other solutions.
That being said, having a choice for the stream processing style is still helpful.
Core Storm does not offer ordering guarantees of messages. Core Storm offers at-least-
once processing guarantees, meaning that processing of each message can be guaranteed
but duplicates may occur. Trident offers exactly-once guarantees and can offer ordering
between batches, but not within.
Apache Samza
Apache Samza is a stream processing framework that is tightly tied to the Apache Kafka
messaging system. While Kafka can be used by many stream processing systems, Samza
is designed specifically to take advantage of Kafka's unique architecture and guarantees.
It uses Kafka to provide fault tolerance, buffering, and state storage.
Samza uses YARN for resource negotiation. This means that by default, a Hadoop cluster
is required (at least HDFS and YARN), but it also means that Samza can rely on the rich
features built into YARN.
Samza relies on Kafka's semantics to define the way that streams are handled. Kafka uses
the following concepts when dealing with data:
Topics: Each stream of data entering a Kafka system is called a topic. A topic is
basically a stream of related information that consumers can subscribe to.
Partitions: In order to distribute a topic among nodes, Kafka divides the
incoming messages into partitions. The partition divisions are based on a key such
that each message with the same key is guaranteed to be sent to the same
partition. Partitions have guaranteed ordering.
Brokers: The individual nodes that make up a Kafka cluster are called brokers.
Producer: Any component writing to a Kafka topic is called a producer. The
producer provides the key that is used to partition a topic.
Consumers: Consumers are any component that reads from a Kafka topic.
Consumers are responsible for maintaining information about their own offset,
so that they are aware of which records have been processed if a failure occurs.
Because Kafka is represents an immutable log, Samza deals with immutable streams.
This means that any transformations create new streams that are consumed by other
components without affecting the initial stream.
Samza's reliance on a Kafka-like queuing system at first glance might seem restrictive.
However, it affords the system some unique guarantees and features not common in other
stream processing systems.
22
For example, Kafka already offers replicated storage of data that can be accessed with
low latency. It also provides a very easy and inexpensive multi-subscriber model to each
individual data partition. All output, including intermediate results, is also written to
Kafka and can be independently consumed by downstream stages.
In many ways, this tight reliance on Kafka mirrors the way that the MapReduce engine
frequently references HDFS. While referencing HDFS between each calculation leads to
some serious performance issues when batch processing, it solves a number of problems
when stream processing.
Samza's strong relationship to Kafka allows the processing steps themselves to be very
loosely tied together. An arbitrary number of subscribers can be added to the output of
any step without prior coordination. This can be very useful for organizations where
multiple teams might need to access similar data. Teams can all subscribe to the topic of
data entering the system, or can easily subscribe to topics created by other teams that
have undergone some processing. This can be done without adding additional stress on
load-sensitive infrastructure like databases.
Samza offers high level abstractions that are in many ways easier to work with than the
primitives provided by systems like Storm. Samza only supports JVM languages at
this time, meaning that it does not have the same language flexibility as Storm.
As you will see, the way that this is achieved varies significantly between Spark and
Flink, the two frameworks we will discuss. This is a largely a function of how the two
processing paradigms are brought together and what assumptions are made about the
relationship between fixed and unfixed datasets.
23
While projects focused on one processing type may be a close fit for specific use-cases,
the hybrid frameworks attempt to offer a general solution for data processing. They not
only provide methods for processing over data, they have their own integrations,
libraries, and tooling for doing things like graph analysis, machine learning, and
interactive querying.
Apache Spark
Apache Spark is a next generation batch processing framework with stream processing
capabilities. Built using many of the same principles of Hadoop's MapReduce engine,
Spark focuses primarily on speeding up batch processing workloads by offering full in-
memory computation and processing optimization.
Spark can be deployed as a standalone cluster (if paired with a capable storage layer) or
can hook into Hadoop as an alternative to the MapReduce engine.
Beyond the capabilities of the engine itself, Spark also has an ecosystem of
libraries that can be used for machine learning, interactive queries, etc. Spark
tasks are almost universally acknowledged to be easier to write than
MapReduce, which can have significant implications for productivity.
Adapting the batch methodology for stream processing involves buffering the
data as it enters the system. The buffer allows it to handle a high volume of
incoming data, increasing overall throughput, but waiting to flush the buffer also
leads to a significant increase in latency. This means that Spark Streaming might
not be appropriate for processing where low latency is imperative.
Since RAM is generally more expensive than disk space, Spark can cost more to
run than disk-based systems. However, the increased processing speed means that
tasks can complete much faster, which may completely offset the costs when
operating in an environment where you pay for resources hourly.
One other consequence of the in-memory design of Spark is that resource scarcity
can be an issue when deployed on shared clusters. In comparison to Hadoop's
MapReduce, Spark uses significantly more resources, which can interfere with
other tasks that might be trying to use the cluster at the time. In essence, Spark
might be a less considerate neighbor than other components that can operate on
the Hadoop stack.
25
Apache Flink
Apache Flink is a stream processing framework that can also handle batch tasks. It
considers batches to simply be data streams with finite boundaries, and thus treats batch
processing as a subset of stream processing. This stream-first approach to all processing
has a number of interesting side effects.
This stream-first approach has been called the Kappa architecture, in contrast to the
more widely known Lambda architecture (where batching is used as the primary
processing method with streams used to supplement and provide early but unrefined
results). Kappa architecture, where streams are used for everything, simplifies the model
and has only recently become possible as stream processing engines have grown more
sophisticated.
Streams are immutable, unbounded datasets that flow through the system
Operators are functions that operate on data streams to produce other streams
Sources are the entry point for streams entering the system
Sinks are the place where streams flow out of the Flink system. They might
represent a database or a connector to another system
Stream processing tasks take snapshots at set points during their computation to use for
recovery in case of problems. For storing state, Flink can work with a number of state
backends depending with varying levels of complexity and persistence.
Additionally, Flink's stream processing is able to understand the concept of "event time",
meaning the time that the event actually occurred, and can handle sessions as well. This
means that it can guarantee ordering and grouping in some interesting ways.
Flink offers some optimizations for batch workloads. For instance, since batch operations
are backed by persistent storage, Flink removes snapshotting from batch loads. Data is
still recoverable, but normal processing completes faster.
Another optimization involves breaking up batch tasks so that stages and components are
only involved when needed. This helps Flink play well with other users of the cluster.
26
Preemptive analysis of the tasks gives Flink the ability to also optimize by seeing
the entire set of operations, the size of the data set, and the requirements of steps
coming down the line.
Flink manages many things by itself. Somewhat unconventionally, it manages its own
memory instead of relying on the native Java garbage collection mechanisms for
performance reasons. Unlike Spark, Flink does not require manual optimization and
adjustment when the characteristics of the data it processes change. It handles data
partitioning and caching automatically as well.
Flink analyzes its work and optimizes tasks in a number of ways. Part of this analysis is
similar to what SQL query planners do within relationship databases, mapping out the
most effective way to implement a given task. It is able to parallelize stages that can be
completed in parallel, while bringing data together for blocking tasks. For iterative tasks,
Flink attempts to do computation on the nodes where the data is stored for performance
reasons. It can also do "delta iteration", or iteration on only the portions of data that have
changes.
In terms of user tooling, Flink offers a web-based scheduling view to easily manage tasks
and view the system. Users can also display the optimization plan for submitted tasks to
see how it will actually be implemented on the cluster. For analysis tasks, Flink offers
SQL-style querying, graph processing and machine learning libraries, and in-memory
computation.
Flink operates well with other components. It is written to be a good neighbor if used
within a Hadoop stack, taking up only the necessary resources at any given time. It
integrates with YARN, HDFS, and Kafka easily. Flink can run tasks written for other
processing frameworks like Hadoop and Storm with compatibility packages.
One of the largest drawbacks of Flink at the moment is that it is still a very young project.
Large scale deployments in the wild are still not as common as other processing
frameworks and there hasn't been much research into Flink's scaling limitations. With the
rapid development cycle and features like the compatibility packages, there may begin to
be more Flink deployments as organizations get the chance to experiment with it.
27
Spark is a general-purpose distributed data processing engine that is suitable for use in a
wide range of circumstances. On top of the Spark core data processing engine, there are
libraries for SQL, machine learning, graph computation, and stream processing, which
can be used together in an application. Programming languages supported by Spark
include: Java, Python, Scala, and R. Application developers and data scientists
incorporate Spark into their applications to rapidly query, analyze, and transform data at
scale. Tasks most frequently associated with Spark include ETL and SQL batch jobs
across large data sets, processing of streaming data from sensors, IoT, or financial
systems, and machine learning tasks.
FIGURE 3.1
28
History
In order to understand Spark, it helps to understand its history. Before Spark, there was
MapReduce, a resilient distributed processing framework, which enabled Google to index
the exploding volume of content on the web, across large clusters of commodity servers.
FIGURE 3.2
1. Distribute data: when a data file is uploaded into the cluster, it is split into
chunks, called data blocks, and distributed amongst the data nodes and replicated
across the cluster.
2. Distribute computation: users specify a map function that processes a key/value
pair to generate a set of intermediate key/value pairs and a reduce function that
merges all intermediate values associated with the same intermediate key.
Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines in the following way:
29
o The mapping process runs on each assigned data node, working only on its
block of data from a distributed file.
o The results from the mapping processes are sent to the reducers in a
process called "shuffle and sort": key/value pairs from the mappers are
sorted by key, partitioned by the number of reducers, and then sent
across the network and written to key sorted "sequence files" on the
reducer nodes.
o The reducer process executes on its assigned node and works only on its
subset of the data (its sequence file). The output from the reducer
process is written to an output file.
3. Tolerate faults: both data and computation can tolerate failures by failing over
to another node for data or processing.
FIGURE 3.3
30
Some iterative algorithms, like PageRank, which Google used to rank websites in their
search engine results, require chaining multiple MapReduce jobs together, which causes a
lot of reading and writing to disk. When multiple MapReduce jobs are chained together,
for each MapReduce job, data is read from a distributed file block into a map process,
written to and read from a SequenceFile in between, and then written to an output file
from a reducer process.
FIGURE 3.4
A year after Google published a white paper describing the MapReduce framework
(2004), Doug Cutting and Mike Cafarella created Apache Hadoop.
Apache Spark™ began life in 2009 as a project within the AMPLab at the University of
California, Berkeley. Spark became an incubated project of the Apache Software
Foundation in 2013, and it was promoted early in 2014 to become one of the
Foundation’s top-level projects. Spark is currently one of the most active projects
managed by the Foundation, and the community that has grown up around the project
includes both prolific individual contributors and well-funded corporate backers, such as
Databricks, IBM, and China’s Huawei.
The goal of the Spark project was to keep the benefits of MapReduce’s scalable,
distributed, fault-tolerant processing framework, while making it more efficient and
easier to use. The advantages of Spark over MapReduce are:
31
FIGURE 3.5
32
Spark also has a local mode, where the driver and executors run as threads on your
computer instead of a cluster, which is useful for developing your applications from a
personal computer.
Stream processing: From log files to sensor data, application developers are increasingly
having to cope with "streams" of data. This data arrives in a steady stream, often from
multiple sources simultaneously. While it is certainly feasible to store these data streams
on disk and analyze them retrospectively, it can sometimes be sensible or important to
process and act upon the data as it arrives. Streams of data related to financial
transactions, for example, can be processed in real time to identify– and refuse–
potentially fraudulent transactions.
Machine learning: As data volumes grow, machine learning approaches become more
feasible and increasingly accurate. Software can be trained to identify and act upon
triggers within well-understood data sets before applying the same solutions to new and
unknown data. Spark’s ability to store data in memory and rapidly run repeated queries
makes it a good choice for training machine learning algorithms. Running broadly similar
33
queries again and again, at scale, significantly reduces the time required to go through a
set of possible solutions in order to find the most efficient algorithms.
Data integration: Data produced by different systems across a business is rarely clean or
consistent enough to simply and easily be combined for reporting or analysis. Extract,
transform, and load (ETL) processes are often used to pull data from different systems,
clean and standardize it, and then load it into a separate system for analysis. Spark (and
Hadoop) are increasingly being used to reduce the cost and time required for this ETL
process.
A wide range of technology vendors have been quick to support Spark, recognizing the
opportunity to extend their existing big data products into areas where Spark delivers real
value, such as interactive querying and machine learning. Well-known companies such as
IBM and Huawei have invested significant sums in the technology, and a growing
number of startups are building businesses that depend in whole or in part upon Spark.
For example, in 2013 the Berkeley team responsible for creating Spark founded
Databricks, which provides a hosted end-to-end data platform powered by Spark. The
company is well-funded, having received $247 million across four rounds of investment
in 2013, 2014, 2016 and 2017, and Databricks employees continue to play a prominent
role in improving and extending the open source code of the Apache Spark project.
The major Hadoop vendors, including MapR, Cloudera, and Hortonworks, have all
moved to support YARN-based Spark alongside their existing products, and each vendor
is working to add value for its customers. Elsewhere, IBM, Huawei, and others have all
made significant investments in Apache Spark, integrating it into their own products and
contributing enhancements and extensions back to the Apache project. Web-based
companies, like Chinese search engine Baidu, e-commerce operation Taobao, and social
networking company Tencent, all run Spark-based operations at scale, with Tencent’s
34
800 million active users reportedly generating over 700 TB of data per day for processing
on a cluster of more than 8,000 compute nodes.
There are many reasons to choose Spark, but the following three are key:
Simplicity: Spark’s capabilities are accessible via a set of rich APIs, all designed
specifically for interacting quickly and easily with data at scale. These APIs are well-
documented and structured in a way that makes it straightforward for data scientists and
application developers to quickly put Spark to work.
Speed: Spark is designed for speed, operating both in memory and on disk. Using Spark,
a team from Databricks tied for first place with a team from the University of California,
San Diego, in the 2014 Daytona GraySort benchmarking challenge
(https://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html).
The challenge involves processing a static data set; the Databricks team was able to
process 100 terabytes of data stored on solid-state drives in just 23 minutes, and the
previous winner took 72 minutes by using Hadoop and a different cluster configuration.
Spark can perform even better when supporting interactive queries of data stored in
memory. In those situations, there are claims that Spark can be 100 times faster than
Hadoop’s MapReduce.
Much of Spark's power lies in its ability to combine very different techniques and
processes together into a single, coherent whole. Outside Spark, the discrete tasks of
selecting data, transforming that data in various ways, and analyzing the transformed
results might easily require a series of separate processing frameworks, such as Apache
Oozie. Spark, on the other hand, offers the ability to combine these together, crossing
boundaries between batch, streaming, and interactive workflows in ways that make the
user more productive.
Spark jobs perform multiple operations consecutively, in memory, and only spilling to
disk when required by memory limitations. Spark simplifies the management of these
disparate processes, offering an integrated whole – a data pipeline that is easier to
configure, easier to run, and easier to maintain. In use cases such as ETL, these pipelines
can become extremely rich and complex, combining large numbers of inputs and a wide
range of processing steps into a unified whole that consistently delivers the desired result.
SUMMARY
1. This chapter introduces Apache Spark and its history and explore some of the areas
in which its particular set of capabilities show the most promise.
Performance
There’s no lack of information on the Internet about how fast Spark is compared to
MapReduce. The problem with comparing the two is that they perform processing
differently, which is covered in the Data Processing section. The reason that Spark is so
fast is that it processes everything in memory. Yes, it can also use disk for data that
doesn’t all fit into memory.
Spark’s in-memory processing delivers near real-time analytics for data from marketing
campaigns, machine learning, Internet of Things sensors, log monitoring, security
analytics, and social media sites. MapReduce alternatively uses batch processing and was
really never built for blinding speed. It was originally setup to continuously gather
information from websites and there were no requirements for this data in or near real-
time.
Ease of Use
Spark is well known for its performance, but it’s also somewhat well known for its ease
of use in that it comes with user-friendly APIs for Scala (its native language), Java,
Python, and Spark SQL. Spark SQL is very similar to SQL 92, so there’s almost no
learning curve required in order to use it.
Spark also has an interactive mode so that developers and users alike can have immediate
feedback for queries and other actions. MapReduce has no interactive mode, but add-ons
such as Hive and Pig make working with MapReduce a little easier for adopters.
Costs
Both MapReduce and Spark are Apache projects, which means that they’re open source
and free software products. While there’s no cost for the software, there are costs
associated with running either platform in personnel and in hardware. Both products are
37
designed to run on commodity hardware, such as low cost, so-called white box server
systems.
MapReduce and Spark run on the same hardware, so where’s the cost differences
between the two solutions? MapReduce uses standard amounts of memory because its
processing is disk-based, so a company will have to purchase faster disks and a lot of disk
space to run MapReduce. MapReduce also requires more systems to distribute the disk
I/O over multiple systems.
Spark requires a lot of memory, but can deal with a standard amount of disk that runs at
standard speeds. Some users have complained about temporary files and their cleanup.
Typically these temporary files are kept for seven days to speed up any processing on the
same data sets. Disk space is a relatively inexpensive commodity and since Spark does
not use disk I/O for processing, the disk space used can be leveraged SAN or NAS.
It is true, however that Spark systems cost more because of the large amounts of RAM
required to run everything in memory. But what’s also true is that Spark’s technology
reduces the number of required systems. So, you have significantly fewer systems that
cost more. There’s probably a point at which Spark actually reduces costs per unit of
computation even with the additional RAM requirement.
To illustrate, “Spark has been shown to work well up to petabytes. It has been used to
sort 100 TB of data 3X faster than Hadoop MapReduce on one-tenth of the machines.”
This feat won Spark the 2014 Daytona GraySort Benchmark.
Compatibility
MapReduce and Spark are compatible with each other and Spark shares all MapReduce’s
compatibilities for data sources, file formats, and business intelligence tools via JDBC
and ODBC.
38
Spark also includes its own graph computation library, GraphX. GraphX allows users to
view the same data as graphs and as collections. Users can also transform and join graphs
with Resilient Distributed Datasets (RDDs), discussed in the Fault Tolerance section.
Fault Tolerance
For fault tolerance, MapReduce and Spark resolve the problem from two different
directions. MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a
heartbeat is missed then the JobTracker reschedules all pending and in-progress
operations to another TaskTracker. This method is effective in providing fault tolerance,
however it can significantly increase the completion times for operations that have even a
single failure.
Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of
elements that can be operated on in parallel. RDDs can reference a dataset in an external
storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a
Hadoop InputFormat. Spark can create RDDs from any storage source supported by
Hadoop, including local filesystems or one of those listed previously.
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
39
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-
partitioned)
Optionally, a list of preferred locations to compute each split on (e.g.
block locations for an HDFS file)
RDDs can be persistent in order to cache a dataset in memory across operations. This
allows future actions to be much faster, by as much as ten times. Spark’s cache is fault-
tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by
using the original transformations.
Scalability
By definition, both MapReduce and Spark are scalable using the HDFS. So how big can a
Hadoop cluster grow?
Yahoo reportedly has a 42,000 node Hadoop cluster, so perhaps the sky really is the limit.
The largest known Spark cluster is 8,000 nodes, but as big data grows, it’s expected that
cluster sizes will increase to maintain throughput expectations.
Security
Hadoop supports Kerberos authentication, which is somewhat painful to manage.
However, third party vendors have enabled organizations to leverage Active Directory
Kerberos and LDAP for authentication. Those same third party vendors also offer data
encrypt for in-flight and data at rest.
Hadoop’s Distributed File System supports access control lists (ACLs) and a traditional
file permissions model. For user control in job submission, Hadoop provides Service
Level Authorization, which ensures that clients have the right permissions.
Spark’s security is a bit sparse by currently only supporting authentication via shared
secret (password authentication). The security bonus that Spark can enjoy is that if you
run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally,
Spark can run on YARN giving it the capability of using Kerberos authentication.
40
Summary
Upon first glance, it seems that using Spark would be the default choice for any big data
application. However, that’s not the case. MapReduce has made inroads into the big data
market for businesses that need huge datasets brought under control by commodity
systems. Spark’s speed, agility, and relative ease of use are perfect complements to
MapReduce’s low cost of operation.
The truth is that Spark and MapReduce have a symbiotic relationship with each other.
Hadoop provides features that Spark does not possess, such as a distributed file system
and Spark provides real-time, in-memory processing for those data sets that require it.
The perfect big data scenario is exactly as the designers intended—for Hadoop and Spark
to work together on the same team.
41
Conclusion
There are plenty of options for processing within a Smart Card system.
For batch-only workloads that are not time-sensitive, Hadoop is a good choice that is
likely less expensive to implement than some other solutions.
For stream-only workloads, Storm has wide language support and can deliver very low
latency processing, but can deliver duplicates and cannot guarantee ordering in its default
configuration. Samza integrates tightly with YARN and Kafka in order to provide
flexibility, easy multi-team usage, and straightforward replication and state management.
For mixed workloads, Spark provides high speed batch processing and micro-batch
processing for streaming. It has wide support, integrated libraries and tooling, and
flexible integrations. Flink provides true stream processing with batch processing
support. It is heavily optimized, can run tasks written for other platforms, and provides
low latency processing, but is still in the early days of adoption.
The best fit for your situation will depend heavily upon the state of the data to process,
how time-bound your requirements are, and what kind of results you are interested in.
There are trade-offs between implementing an all-in-one solution and working with
tightly focused projects, and there are similar considerations when evaluating new and
innovative solutions over their mature and well-tested counterparts.
References
[2] https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters/
[3] https://www.guru99.com/what-is-big-data.html#1
[4] https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-
and-flink-big-data-frameworks-compared#conclusion