Seminar Report

1
A
Seminar Report
on
Smart Card id
In partial fulfillment of requirements for the degree of

Bachelor of Engineering in Computer Engineering
SUBMITTED BY:
Jatin Kumar (1503310094)
Under the Guidance of:
Mr. Zatin Gupta
Department Of Computer Science and Engineering
Raj Kumar Goel Institute Of Technology

[2018-2019]
2
CERTIFICATE
Certified that seminar work entitled “ Smart card id” is a bonafide work carried out in
the eighth semester by “Jatin Kumar” in partial fulfilment for the award of Bachelor of
Technology in Computer Science Engineering from Raj Kumar Goel Institute of
Technology Ghaziabad during the academic year 2018- 2019.
SIGNATURE
COMPUTER SCIENCE HEAD OF DEPARTMENT
SIGNATURE
SEMINAR COORDINATOR
3
ACKNOWLEDGEMENT
The seminar report on “Smart card id” is the outcome of guidance, moral support, and
devotion bestowed on me throughout my work. For this I acknowledge and express my
profound sense of gratitude and thanks to everybody who have been a source of
inspiration during the seminar preparation. First and foremost I offer our sincere phrases
of thanks with innate humility to Mr. Zatin Gupta guide of my seminar to provide help
whenever I needed.
If I can say in words I must at the outside tender our intimacy for receipt of affectionate
care to Raj Kumar Goel Institute of Technology for providing such a stimulating
atmosphere and wonderful work environment.
Jatin Kumar
4
INDEX
Serial Chapters Sub-Headings Page

Number Number
1. Abstract 6
2. Introduction 7
3. Chapter 1: Smart card 1.1- What is Smart card id 8

1.2- What is Smart card
1.3- Examples of Smart card
1.4- Types of Smart Card
1.5- Structured
1.6- Unstructured
1.7- Semi-Structutred
1.8- Characteristics of Card
1.9- Benefits of Big Data
1.10- Summary of chapter 1 14
4. Chapter 2 : Processing 2 .1 - What are Smart 15
of Smart Card Card
Processing
Frameworks
2.2- Batch Processing Systems
2.3- Apache Hadoop
2.31-Batch Processing Models
2.32- Advantages and
Limitations
2.4- Stream Processing Systems
2.5- Apache Storm
2.51- Stream Processing Model
Limitations
2.6- Apache Samza
Limitations
2.7- Hybrid Processing Systems
Batch and Stream Processors
2.8 Apache Spark
2.81- Batch Processing Model
Limitations.
2.9- Apache Flink
2.91-Stream Processing Model
2.92- Batch processing Model
5
2.93- Advantages and 26

Limitations
5. Chapter 3-Spark Framework 3.1- What is Apache Spark 27

3.2- History
3.3- MapReduce Word Count
Execution.
3.4- How a Spark Application
Runs on a cluster
3.5- What does Spark Do?
3.6- What sets Spark Apart?
3.7- The Power of Data
Pipelines
3.8- Summary 36
6. Chapter 4-Smart card Vs Debit 4.1- Performance 37
4.2- Ease Of Use
4.3- Costs
4.4- Compatibility
4.5- Data Processing
4.6- Fault Tolerance
4.7- Scalability
4.8- Security
4.9- Summary 41
7. Conclusion 42
8. References 42
6
ABSTRACT
Abstract—In today’s world carrying a number of plastic smart cards to establish our
identity has become an integral segment of our routine lives. Identity establishment
necessitates a pre stored readily available data about self and to the administrator to
authenticate it with claimer’s personal information. There is a distinct requirement of a
technological solution for nationwide multipurpose identity for any citizen across the
board. Number of options has been exercised by various countries and every option has
its own pros and cons. However, it has been observed that in most of the cases Smart
Card solution has been preferred by a user and administrator both. The use of Smart
cards are so prevalent that be it any profession, without incorporating its application,
identity of any individual is hardly considered complete.
In this paper, the principle aim is to discuss the viability of Smart Card technology as an
identity solution and its ability to perform various functions with strong access control
that increases the reliability of Smart Cards over other technologies. It outlines the
overview of smart card technology along with its key applications. Security concerns of
smart card have been discussed through an algorithm with the help of a division integer
proposition. Possibilities of upgrading it with evolving technology offer it as a universal
acceptability of identification. Capability of storing desired amount of information by an
administrator to compute multiple operations to authenticate a citizen dictates its
widening acceptability and an endeavor has been made in this paper to explain it through
a proposed system flow chart.
7
INTRODUCTION
One takes today a burden of carrying a wallet with full of cards to establish his/her
identity like official ID card, canteen cards, library cards, driving license, etc. Smart card
ID has a potential to replace all these cards by a single smart ID cards to serve the desired
purpose. Varieties of smart cards are available as on date with progressive technologies
where developers use different data structures and standards for programming. In this
paper, we will discuss about viability of smart cards as a solution to requirement of
nationwide multipurpose smart ID for each and every citizen with continuous evolving
technology. Our aim is to propose a viable technological solution for a single
multipurpose smart ID card to do away with carrying multiple cards by an individual. It
will assist governments across the globe in better administration with cost effective
solution for multiple application single smart ID cards. It will also need management of a
large database with processing and scalable computing to home on desired ID. Data
centers handling these big data are contributing in reducing the delay and costs in data
processing and improving the quality of service to include certain discrete services using
internet based services.
A smart card is an electronic device with micro-processor based system containing

embedded integrated circuits which can process and store a large chunk of data and
applications . A smart card reader is used to access the stored information and it is also
called smart called terminal when a card is plugged into this reader. Apart from the card
reader, radio frequencies are also used to operate a smart card. Different protocols are
being used for different types of card readers to communicate between card and the
reader. The standard of security adopted in the smart cards defines the degree of
protection about sensitivity and confidentiality of data against the breaches. The issue
with smart cards is its data storage capacity and processing capability. If we choose to
associate any new application with smart card then the security mechanism would require
consume more space which in turn necessitates use of lightweight security algorithm. In
this paper a hypothetical case of a division integer algorithm is taken and then a viable
system has been proposed to ensure appropriate security measures and to combat
epidemics of cyber-crimes. In this respect, all the states need stringent legislations with
effective law enforcement to prevent any frauds . The objective of this paper is to touch
upon smart card technology and its viability as single ID alternative with desired identity
standards by various states and to study its viability with feasible applications
8
CHAPTER 1: SMART CARD
What is Smart card?
The quantities, characters, or symbols on which operations are performed by a computer,

which may be stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
What is Smart card?
Smart Card is also data but with a huge size. Smart Card is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time. In short
such data is so large and complex that none of the traditional data management tools are
able to store it or process it efficiently.
Examples of Smart Card
Following are some the examples of Smart card
FIGURE NO. 1.1
FIGURE NO. 1.2 FIGURE NO. 1.3

9
The New York Stock Exchange shown in figure 1.1 generates about one terabyte of
new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of
social media site Facebook (figure 1.2), every day. This data is mainly generated in
terms of photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Types of Smart Card
Smart Card could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data. Over the period of time, talent in computer science has achieved
greater success in developing techniques for working with such kind of data (where the
format is well known in advance) and also deriving value out of it. However, nowadays,
we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are
being in the rage of multiple zettabytes.
1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Data stored in a relational database management system is one example of

a 'structured' data.
10
Examples of Structured Data
An 'Employee' table in a database is an example of Structured Data
Employee_ID Employee_Name Department
2365 Rajesh Kulkarni Finance
3398 Pratibha Joshi Admin
7465 Shushil Roy Admin
7500 Shubhojit Das Finance
7699 Priya Sane Finance
Table 1.1
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text files, images, videos
etc. Now day organizations have wealth of data available with them but unfortunately,
they don't know how to derive value out of it since this data is in its raw form or
unstructured format.
11
Examples of Un-structured Data
The output returned by 'Google Search'
FIGURE 1.4
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.
Examples of Semi-structured Data
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
12
Data Growth over the years
FIGURE 1.5
Note: Web application data, which is unstructured, consists of log files,

transaction history files etc. OLTP systems are built to work with structured data
wherein data is stored in relations (tables).
Characteristics of Smart Card
(i) Volume – The name itself is related to a size which is enormous. Size of data plays
a very crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with
smart card
(ii) Variety – The next aspect of Smart Card is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.
13
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data.
Velocity deals with the speed at which data flows in from sources like business processes
, application logs, networks, and social media sites, sensors, mobile devices, etc. The
flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
Benefits of Smart Card Processing
Ability to process Smart Card brings in multiple benefits, such as-
o Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
o Improved customer service
Traditional customer feedback systems are getting replaced by new systems

designed with Data technologies. In these new systems, Data and natural language
processing technologies are being used to read and evaluate consumer responses.
o Early identification of risk to the product/services, if any

o Better operational efficiency
Card Data technologies can be used for creating a staging area or landing zone for
new identifying what data should be moved to the data warehouse. In addition, such
integration of Big Data technologies and data warehouse helps an organization to
offload infrequently accessed.
14
SUMMARY
 Data is defined as data that is huge in size. data is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
 Examples of Data generation includes stock exchanges, social media sites, jet
engines, etc.
 could be 1) Structured, 2) Unstructured, 3) Semi-structured.
 Volume, Variety, Velocity, and Variability are few Characteristics of Big data.
 Improved customer service, better operational efficiency, Better Decision
Making are few advantages of Big data.
15
CHAPTER: 2 PROCESSING OF Smart Card
This chapter deals with the big data processing frameworks. Processing frameworks
compute over the data in the system, either by reading from non-volatile storage or as it is
ingested into the system. Computing over data is the process of extracting information
and insight from large quantities of individual data points.
Some of the big data processing frameworks are:
1. Batch-only frameworks
a. Apache Hadoop
2. Stream-only frameworks
a. Apache Storm
b. Apache Samza
3. Hybrid frameworks
a. Apache Spark
b. Apache Flink
What Are Smart Card Processing Frameworks?
Processing frameworks and processing engines are responsible for computing over
data in a data system. While there is no authoritative definition setting apart "engines"
from "frameworks", it is sometimes useful to define the former as the actual component
responsible for operating on data and the latter as a set of components designed to do the
same.
For instance, Apache Hadoop can be considered a processing framework with Map
Reduce as its default processing engine. Engines and frameworks can often be swapped
out or used in tandem. For instance, Apache Spark, another framework, can hook into
16
Hadoop to replace Map Reduce. This interoperability between components is one reason
that big data systems have great flexibility.
While the systems which handle this stage of the data life cycle can be complex, the goals
on a broad level are very similar: operate over data in order to increase understanding,
surface patterns, and gain insight into complex interactions.
These processing frameworks are grouped by the state of the data they are designed to
handle. Some systems handle data in batches, while others process data in a continuous
stream as it flows into the system. Still others can handle data in either of these ways.
Batch Processing Systems
Batch processing has a long history within the data world. Batch processing involves
operating over a large, static dataset and returning the result at a later time when the
computation is complete.
The datasets in batch processing are typically...
 bounded: batch datasets represent a finite collection of data

 persistent: data is almost always backed by some type of permanent storage
 large: batch operations are often the only option for processing extremely large
sets of data
Batch processing is well-suited for calculations where access to a complete set of records
is required. For instance, when calculating totals and averages, datasets must be treated
holistically instead of as a collection of individual records. These operations require that
state be maintained for the duration of the calculations.
Tasks that require very large volumes of data are often best handled by batch operations.
Whether the datasets are processed directly from permanent storage or loaded into
memory, batch systems are built with large quantities in mind and have the resources to
handle them. Because batch processing excels at handling large volumes of persistent
data, it frequently is used with historical data.
The trade-off for handling large quantities of data is longer computation time. Because of
this, batch processing is not appropriate in situations where processing time is especially
significant.
17
Apache Hadoop
Apache Hadoop is a processing framework that exclusively provides batch processing.

Hadoop was the first big data framework to gain significant traction in the open-source
community. Based on several papers and presentations by Google about how they were
dealing with tremendous amounts of data at the time, Hadoop re-implemented the
algorithms and component stack to make large scale batch processing more accessible.
Modern versions of Hadoop are composed of several components or layers that work
together to process batch data:
 HDFS: HDFS is the distributed filesystem layer that coordinates storage and
replication across the cluster nodes. HDFS ensures that data remains available in
spite of inevitable host failures. It is used as the source of data, to store
intermediate processing results, and to persist the final calculated results.
 YARN: YARN, which stands for Yet Another Resource Negotiator, is the cluster
coordinating component of the Hadoop stack. It is responsible for coordinating
and managing the underlying resources and scheduling jobs to be run. YARN
makes it possible to run much more diverse workloads on a Hadoop cluster than
was possible in earlier iterations by acting as an interface to the cluster resources.
 MapReduce: MapReduce is Hadoop's native batch processing engine.
Batch Processing Model

The processing functionality of Hadoop comes from the MapReduce engine.
MapReduce's processing technique follows the map, shuffle, reduce algorithm using key-
value pairs. The basic procedure involves:
 Reading the dataset from the HDFS filesystem

 Dividing the dataset into chunks and distributed among the available nodes
 Applying the computation on each node to the subset of data (the intermediate
results are written back to HDFS)
 Redistributing the intermediate results to group by key
 "Reducing" the value of each key by summarizing and combining the results
calculated by the individual nodes
 Write the calculated final results back to HDFS
18
Advantages and Limitations
Because this methodology heavily leverages permanent storage, reading and writing
multiple times per task, it tends to be fairly slow. On the other hand, since disk space is
typically one of the most abundant server resources, it means that MapReduce can handle
enormous datasets. This also means that Hadoop's MapReduce can typically run on less
expensive hardware than some alternatives since it does not attempt to store everything in
memory. MapReduce has incredible scalability potential and has been used in production
on tens of thousands of nodes.
As a target for development, MapReduce is known for having a rather steep learning
curve. Other additions to the Hadoop ecosystem can reduce the impact of this to varying
degrees, but it can still be a factor in quickly implementing an idea on a Hadoop cluster.
Hadoop has an extensive ecosystem, with the Hadoop cluster itself frequently used as a
building block for other software. Many other processing frameworks and engines
have Hadoop integrations to utilize HDFS and the YARN resource manager.
Stream Processing Systems
Stream processing systems compute over data as it enters the system. This requires a
different processing model than the batch paradigm. Instead of defining operations to
apply to an entire dataset, stream processors define operations that will be applied to each
individual data item as it passes through the system.
The datasets in stream processing are considered "unbounded". This has a few important
implications:
 The total dataset is only defined as the amount of data that has entered the system
so far.
 The working dataset is perhaps more relevant, and is limited to a single item at a
time.
 Processing is event-based and does not "end" until explicitly stopped. Results are
immediately available and will be continually updated as new data arrives.
Stream processing systems can handle a nearly unlimited amount of data, but they only
process one (true stream processing) or very few (micro-batch processing) items at a
time, with minimal state being maintained in between records. While most systems
provide methods of maintaining some state, steam processing is highly optimized for
more functional processing with few side effects.
19
Functional operations focus on discrete steps that have limited state or side-effects.
Performing the same operation on the same piece of data will produce the same output
independent of other factors. This kind of processing fits well with streams because state
between items is usually some combination of difficult, limited, and sometimes
undesirable. So while some type of state management is usually possible, these
frameworks are much simpler and more efficient in their absence.
This type of processing lends itself to certain types of workloads. Processing with near
real-time requirements is well served by the streaming model. Analytics, server or
application error logging, and other time-based metrics are a natural fit because reacting
to changes in these areas can be critical to business functions. Stream processing is a
good fit for data where you must respond to changes or spikes and where you're
interested in trends over time.
Apache Storm
Apache Storm is a stream processing framework that focuses on extremely low latency
and is perhaps the best option for workloads that require near real-time processing. It can
handle very large quantities of data with and deliver results with less latency than other
solutions.
Stream Processing Model

Storm stream processing works by orchestrating DAGs (Directed Acyclic Graphs) in a
framework it calls topologies. These topologies describe the various transformations or
steps that will be taken on each incoming piece of data as it enters the system.
The topologies are composed of:
 Streams: Conventional data streams. This is unbounded data that is continuously

arriving at the system.
 Spouts: Sources of data streams at the edge of the topology. These can be
APIs, queues, etc. that produce data to be operated on.
 Bolts: Bolts represent a processing step that consumes streams, applies an
operation to them, and outputs the result as a stream. Bolts are connected to each
of the spouts, and then connect to each other to arrange all of the necessary
processing. At the end of the topology, final bolt output may be used as an input
for a connected system.
The idea behind Storm is to define small, discrete operations using the above components
and then compose them into a topology. By default, Storm offers at-least-once processing
guarantees, meaning that it can guarantee that each message is processed at least once,
but there may be duplicates in some failure scenarios. Storm does not guarantee that
messages will be processed in order.
20
In order to achieve exactly-once, stateful processing, an abstraction called Trident is also

available. To be explicit, Storm without Trident is often referred to as Core Storm.
Trident significantly alters the processing dynamics of Storm, increasing latency, adding
state to the processing, and implementing a micro-batching model instead of an item-by-
item pure streaming system.
Storm users typically recommend using Core Storm whenever possible to avoid those
penalties. With that in mind, Trident's guarantee to processes items exactly once is useful
in cases where the system cannot intelligently handle duplicate messages. Trident is also
the only choice within Storm when you need to maintain state between items, like when
counting how many users click a link within an hour. Trident gives Storm flexibility,
even though it does not play to the framework's natural strengths.
Trident topologies are composed of:
 Stream batches: These are micro-batches of stream data that are chunked in
order to provide batch processing semantics.
 Operations: These are batch procedures that can be performed on the data.

Storm is probably the best solution currently available for near real-time processing. It is
able to handle data with extremely low latency for workloads that must be processed with
minimal delay. Storm is often a good choice when processing time directly affects user
experience, for example when feedback from the processing is fed directly back to a
visitor's page on a website.
Storm with Trident gives you the option to use micro-batches instead of pure stream
processing. While this gives users greater flexibility to shape the tool to an intended use,
it also tends to negate some of the software's biggest advantages over other solutions.
That being said, having a choice for the stream processing style is still helpful.
Core Storm does not offer ordering guarantees of messages. Core Storm offers at-least-
once processing guarantees, meaning that processing of each message can be guaranteed
but duplicates may occur. Trident offers exactly-once guarantees and can offer ordering
between batches, but not within.
In terms of interoperability, Storm can integrate with Hadoop's YARN resource

negotiator, making it easy to hook up to an existing Hadoop deployment. More than most
processing frameworks, Storm has very wide language support, giving users many
options for defining topologies.
21
Apache Samza
Apache Samza is a stream processing framework that is tightly tied to the Apache Kafka
messaging system. While Kafka can be used by many stream processing systems, Samza
is designed specifically to take advantage of Kafka's unique architecture and guarantees.
It uses Kafka to provide fault tolerance, buffering, and state storage.
Samza uses YARN for resource negotiation. This means that by default, a Hadoop cluster
is required (at least HDFS and YARN), but it also means that Samza can rely on the rich
features built into YARN.
Samza relies on Kafka's semantics to define the way that streams are handled. Kafka uses
the following concepts when dealing with data:
 Topics: Each stream of data entering a Kafka system is called a topic. A topic is
basically a stream of related information that consumers can subscribe to.
 Partitions: In order to distribute a topic among nodes, Kafka divides the
incoming messages into partitions. The partition divisions are based on a key such
that each message with the same key is guaranteed to be sent to the same
partition. Partitions have guaranteed ordering.
 Brokers: The individual nodes that make up a Kafka cluster are called brokers.
 Producer: Any component writing to a Kafka topic is called a producer. The
producer provides the key that is used to partition a topic.
 Consumers: Consumers are any component that reads from a Kafka topic.
Consumers are responsible for maintaining information about their own offset,
so that they are aware of which records have been processed if a failure occurs.
Because Kafka is represents an immutable log, Samza deals with immutable streams.
This means that any transformations create new streams that are consumed by other
components without affecting the initial stream.
Samza's reliance on a Kafka-like queuing system at first glance might seem restrictive.
However, it affords the system some unique guarantees and features not common in other
stream processing systems.
22
For example, Kafka already offers replicated storage of data that can be accessed with
low latency. It also provides a very easy and inexpensive multi-subscriber model to each
individual data partition. All output, including intermediate results, is also written to
Kafka and can be independently consumed by downstream stages.
In many ways, this tight reliance on Kafka mirrors the way that the MapReduce engine
frequently references HDFS. While referencing HDFS between each calculation leads to
some serious performance issues when batch processing, it solves a number of problems
when stream processing.
Samza's strong relationship to Kafka allows the processing steps themselves to be very
loosely tied together. An arbitrary number of subscribers can be added to the output of
any step without prior coordination. This can be very useful for organizations where
multiple teams might need to access similar data. Teams can all subscribe to the topic of
data entering the system, or can easily subscribe to topics created by other teams that
have undergone some processing. This can be done without adding additional stress on
load-sensitive infrastructure like databases.
Writing straight to Kafka also eliminates the problems of backpressure. Backpressure is

when load spikes cause an influx of data at a rate greater than components can process in
real time, leading to processing stalls and potentially data loss. Kafka is designed to hold
data for very long periods of time, which means that components can process at their
convenience and can be restarted without consequence.
Samza is able to store state, using a fault-tolerant checkpointing system implemented as a

local key-value store. This allows Samza to offer an at-least-once delivery guarantee, but
it does not provide accurate recovery of aggregated state (like counts) in the event of a
failure since data might be delivered more than once.
Samza offers high level abstractions that are in many ways easier to work with than the
primitives provided by systems like Storm. Samza only supports JVM languages at
this time, meaning that it does not have the same language flexibility as Storm.
Hybrid Processing Systems: Batch and Stream Processors

Some processing frameworks can handle both batch and stream workloads. These
frameworks simplify diverse processing requirements by allowing the same or related
components and APIs to be used for both types of data.
As you will see, the way that this is achieved varies significantly between Spark and
Flink, the two frameworks we will discuss. This is a largely a function of how the two
processing paradigms are brought together and what assumptions are made about the
relationship between fixed and unfixed datasets.
23
While projects focused on one processing type may be a close fit for specific use-cases,
the hybrid frameworks attempt to offer a general solution for data processing. They not
only provide methods for processing over data, they have their own integrations,
libraries, and tooling for doing things like graph analysis, machine learning, and
interactive querying.
Apache Spark
Apache Spark is a next generation batch processing framework with stream processing
capabilities. Built using many of the same principles of Hadoop's MapReduce engine,
Spark focuses primarily on speeding up batch processing workloads by offering full in-
memory computation and processing optimization.
Spark can be deployed as a standalone cluster (if paired with a capable storage layer) or
can hook into Hadoop as an alternative to the MapReduce engine.

 Unlike MapReduce, Spark processes all data in-memory, only interacting with the
storage layer to initially load the data into memory and at the end to persist the
final results. All intermediate results are managed in memory.
 While in-memory processing contributes substantially to speed, Spark is also

faster on disk-related tasks because of holistic optimization that can be achieved
by analyzing the complete set of tasks ahead of time. It achieves this by creating
Directed Acyclic Graphs, or DAGs which represent all of the operations that must
be performed, the data to be operated on, as well as the relationships between
them, giving the processor a greater ability to intelligently coordinate work.
 To implement in-memory batch computation, Spark uses a model called called

Resilient Distributed Datasets, or RDDs, to work with data. These are immutable
structures that exist within memory that represent collections of data. Operations
on RDDs produce new RDDs. Each RDD can trace its lineage back through its
parent RDDs and ultimately to the data on disk. Essentially, RDDs are a way for
Spark to maintain fault tolerance without needing to write back to disk after each
operation.
 Stream Processing Model

 Stream processing capabilities are supplied by Spark Streaming. Spark itself is
designed with batch-oriented workloads in mind. To deal with the disparity
between the engine design and the characteristics of streaming workloads, Spark
implements a concept called micro-batches*. This strategy is designed to treat
streams of data as a series of very small batches that can be handled using the
native semantics of the batch engine.
24
 Spark Streaming works by buffering the stream in sub-second increments. These

are sent as small fixed datasets for batch processing. In practice, this works fairly
well, but it does lead to a different performance profile than true stream
processing frameworks.

 The obvious reason to use Spark over Hadoop MapReduce is speed. Spark can
process the same datasets significantly faster due to its in-memory computation
strategy and its advanced DAG scheduling.
 Another of Spark's major advantages is its versatility. It can be deployed as a

standalone cluster or integrated with an existing Hadoop cluster. It can perform
both batch and stream processing, letting you operate a single cluster to handle
multiple processing styles.
 Beyond the capabilities of the engine itself, Spark also has an ecosystem of
libraries that can be used for machine learning, interactive queries, etc. Spark
tasks are almost universally acknowledged to be easier to write than
MapReduce, which can have significant implications for productivity.
 Adapting the batch methodology for stream processing involves buffering the
data as it enters the system. The buffer allows it to handle a high volume of
incoming data, increasing overall throughput, but waiting to flush the buffer also
leads to a significant increase in latency. This means that Spark Streaming might
not be appropriate for processing where low latency is imperative.
 Since RAM is generally more expensive than disk space, Spark can cost more to
run than disk-based systems. However, the increased processing speed means that
tasks can complete much faster, which may completely offset the costs when
operating in an environment where you pay for resources hourly.
 One other consequence of the in-memory design of Spark is that resource scarcity
can be an issue when deployed on shared clusters. In comparison to Hadoop's
MapReduce, Spark uses significantly more resources, which can interfere with
other tasks that might be trying to use the cluster at the time. In essence, Spark
might be a less considerate neighbor than other components that can operate on
the Hadoop stack.
25
Apache Flink
Apache Flink is a stream processing framework that can also handle batch tasks. It
considers batches to simply be data streams with finite boundaries, and thus treats batch
processing as a subset of stream processing. This stream-first approach to all processing
has a number of interesting side effects.
This stream-first approach has been called the Kappa architecture, in contrast to the
more widely known Lambda architecture (where batching is used as the primary
processing method with streams used to supplement and provide early but unrefined
results). Kappa architecture, where streams are used for everything, simplifies the model
and has only recently become possible as stream processing engines have grown more
sophisticated.

Flink's stream processing model handles incoming data on an item-by-item basis as a
true stream. Flink provides its DataStream API to work with unbounded streams of data.
The basic components that Flink works with are:
 Streams are immutable, unbounded datasets that flow through the system
 Operators are functions that operate on data streams to produce other streams
 Sources are the entry point for streams entering the system
 Sinks are the place where streams flow out of the Flink system. They might
represent a database or a connector to another system
Stream processing tasks take snapshots at set points during their computation to use for
recovery in case of problems. For storing state, Flink can work with a number of state
backends depending with varying levels of complexity and persistence.
Additionally, Flink's stream processing is able to understand the concept of "event time",
meaning the time that the event actually occurred, and can handle sessions as well. This
means that it can guarantee ordering and grouping in some interesting ways.

Flink's batch processing model in many ways is just an extension of the stream
processing model. Instead of reading from a continuous stream, it reads a bounded
dataset off of persistent storage as a stream. Flink uses the exact same runtime for both of
these processing models.
Flink offers some optimizations for batch workloads. For instance, since batch operations
are backed by persistent storage, Flink removes snapshotting from batch loads. Data is
still recoverable, but normal processing completes faster.
Another optimization involves breaking up batch tasks so that stages and components are
only involved when needed. This helps Flink play well with other users of the cluster.
26
Preemptive analysis of the tasks gives Flink the ability to also optimize by seeing
the entire set of operations, the size of the data set, and the requirements of steps
coming down the line.

Flink is currently a unique option in the processing framework world. While Spark
performs batch and stream processing, its streaming is not appropriate for many use cases
because of its micro-batch architecture. Flink's stream-first approach offers low latency,
high throughput, and real entry-by-entry processing.
Flink manages many things by itself. Somewhat unconventionally, it manages its own
memory instead of relying on the native Java garbage collection mechanisms for
performance reasons. Unlike Spark, Flink does not require manual optimization and
adjustment when the characteristics of the data it processes change. It handles data
partitioning and caching automatically as well.
Flink analyzes its work and optimizes tasks in a number of ways. Part of this analysis is
similar to what SQL query planners do within relationship databases, mapping out the
most effective way to implement a given task. It is able to parallelize stages that can be
completed in parallel, while bringing data together for blocking tasks. For iterative tasks,
Flink attempts to do computation on the nodes where the data is stored for performance
reasons. It can also do "delta iteration", or iteration on only the portions of data that have
changes.
In terms of user tooling, Flink offers a web-based scheduling view to easily manage tasks
and view the system. Users can also display the optimization plan for submitted tasks to
see how it will actually be implemented on the cluster. For analysis tasks, Flink offers
SQL-style querying, graph processing and machine learning libraries, and in-memory
computation.
Flink operates well with other components. It is written to be a good neighbor if used
within a Hadoop stack, taking up only the necessary resources at any given time. It
integrates with YARN, HDFS, and Kafka easily. Flink can run tasks written for other
processing frameworks like Hadoop and Storm with compatibility packages.
One of the largest drawbacks of Flink at the moment is that it is still a very young project.
Large scale deployments in the wild are still not as common as other processing
frameworks and there hasn't been much research into Flink's scaling limitations. With the
rapid development cycle and features like the compatibility packages, there may begin to
be more Flink deployments as organizations get the chance to experiment with it.
27
CHAPTER 3: APACHE SPARK
What Is Apache Spark?
Spark is a general-purpose distributed data processing engine that is suitable for use in a
wide range of circumstances. On top of the Spark core data processing engine, there are
libraries for SQL, machine learning, graph computation, and stream processing, which
can be used together in an application. Programming languages supported by Spark
include: Java, Python, Scala, and R. Application developers and data scientists
incorporate Spark into their applications to rapidly query, analyze, and transform data at
scale. Tasks most frequently associated with Spark include ETL and SQL batch jobs
across large data sets, processing of streaming data from sensors, IoT, or financial
systems, and machine learning tasks.
FIGURE 3.1
28
History
In order to understand Spark, it helps to understand its history. Before Spark, there was
MapReduce, a resilient distributed processing framework, which enabled Google to index
the exploding volume of content on the web, across large clusters of commodity servers.
FIGURE 3.2
There were 3 core concepts to the Google strategy:
1. Distribute data: when a data file is uploaded into the cluster, it is split into
chunks, called data blocks, and distributed amongst the data nodes and replicated
across the cluster.
2. Distribute computation: users specify a map function that processes a key/value
pair to generate a set of intermediate key/value pairs and a reduce function that
merges all intermediate values associated with the same intermediate key.
Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines in the following way:
29
o The mapping process runs on each assigned data node, working only on its
block of data from a distributed file.
o The results from the mapping processes are sent to the reducers in a
process called "shuffle and sort": key/value pairs from the mappers are
sorted by key, partitioned by the number of reducers, and then sent
across the network and written to key sorted "sequence files" on the
reducer nodes.
o The reducer process executes on its assigned node and works only on its
subset of the data (its sequence file). The output from the reducer
process is written to an output file.
3. Tolerate faults: both data and computation can tolerate failures by failing over
to another node for data or processing.
MapReduce word count execution example:
FIGURE 3.3
30
Some iterative algorithms, like PageRank, which Google used to rank websites in their
search engine results, require chaining multiple MapReduce jobs together, which causes a
lot of reading and writing to disk. When multiple MapReduce jobs are chained together,
for each MapReduce job, data is read from a distributed file block into a map process,
written to and read from a SequenceFile in between, and then written to an output file
from a reducer process.
FIGURE 3.4
A year after Google published a white paper describing the MapReduce framework
(2004), Doug Cutting and Mike Cafarella created Apache Hadoop.
Apache Spark™ began life in 2009 as a project within the AMPLab at the University of
California, Berkeley. Spark became an incubated project of the Apache Software
Foundation in 2013, and it was promoted early in 2014 to become one of the
Foundation’s top-level projects. Spark is currently one of the most active projects
managed by the Foundation, and the community that has grown up around the project
includes both prolific individual contributors and well-funded corporate backers, such as
Databricks, IBM, and China’s Huawei.
The goal of the Spark project was to keep the benefits of MapReduce’s scalable,
distributed, fault-tolerant processing framework, while making it more efficient and
easier to use. The advantages of Spark over MapReduce are:
31
 Spark executes much faster by caching data in memory across multiple

parallel operations, whereas MapReduce involves more reading and writing
from disk.
 Spark runs multi-threaded tasks inside of JVM processes, whereas MapReduce
runs as heavier weight JVM processes. This gives Spark faster startup, better
parallelism, and better CPU utilization.
 Spark provides a richer functional programming model than MapReduce.
 Spark is especially useful for parallel processing of distributed
data with iterative algorithms.
How a Spark Application Runs on a Cluster
The figure 3.5 below shows a Spark application running on a cluster.
 A Spark application runs as independent processes, coordinated by the Spark

Session object in the driver program.
 The resource or cluster manager assigns tasks to workers, one task per partition.
 A task applies its unit of work to the dataset in its partition and outputs a new
partition dataset. Because iterative algorithms apply operations repeatedly to
data, they benefit from caching datasets across iterations.
 Results are sent back to the driver application or can be saved to disk.
FIGURE 3.5
32
Spark supports the following resource/cluster managers:
 Spark Standalone – a simple cluster manager included with Spark

 Apache Mesos – a general cluster manager that can also run Hadoop applications
 Apache Hadoop YARN – the resource manager in Hadoop 2
 Kubernetes – an open source system for automating deployment, scaling,
and management of containerized applications
Spark also has a local mode, where the driver and executors run as threads on your
computer instead of a cluster, which is useful for developing your applications from a
personal computer.
What Does Spark Do?
Spark is capable of handling several petabytes of data at a time, distributed across a

cluster of thousands of cooperating physical or virtual servers. It has an extensive set of
developer libraries and APIs and supports languages such as Java, Python, R, and Scala;
its flexibility makes it well-suited for a range of use cases. Spark is often used with
distributed data stores such as MapR XD, Hadoop’s HDFS, and Amazon’s S3, with
popular NoSQL databases such as MapR-DB, Apache HBase, Apache Cassandra, and
MongoDB, and with distributed messaging stores such as MapR-ES and Apache Kafka.
Typical use cases include:
Stream processing: From log files to sensor data, application developers are increasingly
having to cope with "streams" of data. This data arrives in a steady stream, often from
multiple sources simultaneously. While it is certainly feasible to store these data streams
on disk and analyze them retrospectively, it can sometimes be sensible or important to
process and act upon the data as it arrives. Streams of data related to financial
transactions, for example, can be processed in real time to identify– and refuse–
potentially fraudulent transactions.
Machine learning: As data volumes grow, machine learning approaches become more
feasible and increasingly accurate. Software can be trained to identify and act upon
triggers within well-understood data sets before applying the same solutions to new and
unknown data. Spark’s ability to store data in memory and rapidly run repeated queries
makes it a good choice for training machine learning algorithms. Running broadly similar
33
queries again and again, at scale, significantly reduces the time required to go through a
set of possible solutions in order to find the most efficient algorithms.
Interactive analytics: Rather than running pre-defined queries to create static

dashboards of sales or production line productivity or stock prices, business analysts and
data scientists want to explore their data by asking a question, viewing the result, and
then either altering the initial question slightly or drilling deeper into results. This
interactive query process requires systems such as Spark that are able to respond and
adapt quickly.
Data integration: Data produced by different systems across a business is rarely clean or
consistent enough to simply and easily be combined for reporting or analysis. Extract,
transform, and load (ETL) processes are often used to pull data from different systems,
clean and standardize it, and then load it into a separate system for analysis. Spark (and
Hadoop) are increasingly being used to reduce the cost and time required for this ETL
process.
Who Uses Spark?
A wide range of technology vendors have been quick to support Spark, recognizing the
opportunity to extend their existing big data products into areas where Spark delivers real
value, such as interactive querying and machine learning. Well-known companies such as
IBM and Huawei have invested significant sums in the technology, and a growing
number of startups are building businesses that depend in whole or in part upon Spark.
For example, in 2013 the Berkeley team responsible for creating Spark founded
Databricks, which provides a hosted end-to-end data platform powered by Spark. The
company is well-funded, having received $247 million across four rounds of investment
in 2013, 2014, 2016 and 2017, and Databricks employees continue to play a prominent
role in improving and extending the open source code of the Apache Spark project.
The major Hadoop vendors, including MapR, Cloudera, and Hortonworks, have all
moved to support YARN-based Spark alongside their existing products, and each vendor
is working to add value for its customers. Elsewhere, IBM, Huawei, and others have all
made significant investments in Apache Spark, integrating it into their own products and
contributing enhancements and extensions back to the Apache project. Web-based
companies, like Chinese search engine Baidu, e-commerce operation Taobao, and social
networking company Tencent, all run Spark-based operations at scale, with Tencent’s
34
800 million active users reportedly generating over 700 TB of data per day for processing
on a cluster of more than 8,000 compute nodes.
In addition to those web-based giants, pharmaceutical company Novartis depends upon

Spark to reduce the time required to get modeling data into the hands of researchers,
while ensuring that ethical and contractual safeguards are maintained.
What Sets Spark Apart?
There are many reasons to choose Spark, but the following three are key:
Simplicity: Spark’s capabilities are accessible via a set of rich APIs, all designed
specifically for interacting quickly and easily with data at scale. These APIs are well-
documented and structured in a way that makes it straightforward for data scientists and
application developers to quickly put Spark to work.
Speed: Spark is designed for speed, operating both in memory and on disk. Using Spark,
a team from Databricks tied for first place with a team from the University of California,
San Diego, in the 2014 Daytona GraySort benchmarking challenge
(https://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-benchmark.html).
The challenge involves processing a static data set; the Databricks team was able to
process 100 terabytes of data stored on solid-state drives in just 23 minutes, and the
previous winner took 72 minutes by using Hadoop and a different cluster configuration.
Spark can perform even better when supporting interactive queries of data stored in
memory. In those situations, there are claims that Spark can be 100 times faster than
Hadoop’s MapReduce.
Support: Spark supports a range of programming languages, including Java, Python,

R, and Scala. Spark includes support for tight integration with a number of leading
storage solutions in the Hadoop ecosystem and beyond, including: MapR (file system,
database, and event store), Apache Hadoop (HDFS), Apache HBase, and Apache
Cassandra.
Furthermore, the Apache Spark community is large, active, and international. A growing
set of commercial providers, including Databricks, IBM, and all of the main Hadoop
vendors, deliver comprehensive support for Spark-based solutions.
35
The Power of Data Pipelines
Much of Spark's power lies in its ability to combine very different techniques and
processes together into a single, coherent whole. Outside Spark, the discrete tasks of
selecting data, transforming that data in various ways, and analyzing the transformed
results might easily require a series of separate processing frameworks, such as Apache
Oozie. Spark, on the other hand, offers the ability to combine these together, crossing
boundaries between batch, streaming, and interactive workflows in ways that make the
user more productive.
Spark jobs perform multiple operations consecutively, in memory, and only spilling to
disk when required by memory limitations. Spark simplifies the management of these
disparate processes, offering an integrated whole – a data pipeline that is easier to
configure, easier to run, and easier to maintain. In use cases such as ETL, these pipelines
can become extremely rich and complex, combining large numbers of inputs and a wide
range of processing steps into a unified whole that consistently delivers the desired result.
SUMMARY
1. This chapter introduces Apache Spark and its history and explore some of the areas
in which its particular set of capabilities show the most promise.
2. Shows MapReduce word count execution.
3. Areas that spark covers in terms of its application.
4. Explains the uses of spark framework.
5. Explains how spark sets itself different from other frameworks.

36
CHAPTER 4: Smart card VS Debit card
Performance
There’s no lack of information on the Internet about how fast Spark is compared to
MapReduce. The problem with comparing the two is that they perform processing
differently, which is covered in the Data Processing section. The reason that Spark is so
fast is that it processes everything in memory. Yes, it can also use disk for data that
doesn’t all fit into memory.
Spark’s in-memory processing delivers near real-time analytics for data from marketing
campaigns, machine learning, Internet of Things sensors, log monitoring, security
analytics, and social media sites. MapReduce alternatively uses batch processing and was
really never built for blinding speed. It was originally setup to continuously gather
information from websites and there were no requirements for this data in or near real-
time.
Ease of Use
Spark is well known for its performance, but it’s also somewhat well known for its ease
of use in that it comes with user-friendly APIs for Scala (its native language), Java,
Python, and Spark SQL. Spark SQL is very similar to SQL 92, so there’s almost no
learning curve required in order to use it.
Spark also has an interactive mode so that developers and users alike can have immediate
feedback for queries and other actions. MapReduce has no interactive mode, but add-ons
such as Hive and Pig make working with MapReduce a little easier for adopters.
Costs
Both MapReduce and Spark are Apache projects, which means that they’re open source
and free software products. While there’s no cost for the software, there are costs
associated with running either platform in personnel and in hardware. Both products are
37
designed to run on commodity hardware, such as low cost, so-called white box server
systems.
MapReduce and Spark run on the same hardware, so where’s the cost differences
between the two solutions? MapReduce uses standard amounts of memory because its
processing is disk-based, so a company will have to purchase faster disks and a lot of disk
space to run MapReduce. MapReduce also requires more systems to distribute the disk
I/O over multiple systems.
Spark requires a lot of memory, but can deal with a standard amount of disk that runs at
standard speeds. Some users have complained about temporary files and their cleanup.
Typically these temporary files are kept for seven days to speed up any processing on the
same data sets. Disk space is a relatively inexpensive commodity and since Spark does
not use disk I/O for processing, the disk space used can be leveraged SAN or NAS.
It is true, however that Spark systems cost more because of the large amounts of RAM
required to run everything in memory. But what’s also true is that Spark’s technology
reduces the number of required systems. So, you have significantly fewer systems that
cost more. There’s probably a point at which Spark actually reduces costs per unit of
computation even with the additional RAM requirement.
To illustrate, “Spark has been shown to work well up to petabytes. It has been used to
sort 100 TB of data 3X faster than Hadoop MapReduce on one-tenth of the machines.”
This feat won Spark the 2014 Daytona GraySort Benchmark.
Compatibility
MapReduce and Spark are compatible with each other and Spark shares all MapReduce’s
compatibilities for data sources, file formats, and business intelligence tools via JDBC
and ODBC.
38
Smart card Processing

MapReduce is a batch-processing engine. MapReduce operates in sequential steps by
reading data from the cluster, performing its operation on the data, writing the results
back to the cluster, reading updated data from the cluster, performing the next data
operation, writing those results back to the cluster and so on. Spark performs similar
operations, but it does so in a single step and in memory. It reads data from the cluster,
performs its operation on the data, and then writes it back to the cluster.
Spark also includes its own graph computation library, GraphX. GraphX allows users to
view the same data as graphs and as collections. Users can also transform and join graphs
with Resilient Distributed Datasets (RDDs), discussed in the Fault Tolerance section.
Fault Tolerance
For fault tolerance, MapReduce and Spark resolve the problem from two different
directions. MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a
heartbeat is missed then the JobTracker reschedules all pending and in-progress
operations to another TaskTracker. This method is effective in providing fault tolerance,
however it can significantly increase the completion times for operations that have even a
single failure.
Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of
elements that can be operated on in parallel. RDDs can reference a dataset in an external
storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a
Hadoop InputFormat. Spark can create RDDs from any storage source supported by
Hadoop, including local filesystems or one of those listed previously.
An RDD possesses five main properties:
 A list of partitions
 A function for computing each split
 A list of dependencies on other RDDs
39
 Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-
partitioned)
 Optionally, a list of preferred locations to compute each split on (e.g.
block locations for an HDFS file)
RDDs can be persistent in order to cache a dataset in memory across operations. This
allows future actions to be much faster, by as much as ten times. Spark’s cache is fault-
tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by
using the original transformations.
Scalability
By definition, both MapReduce and Spark are scalable using the HDFS. So how big can a
Hadoop cluster grow?
Yahoo reportedly has a 42,000 node Hadoop cluster, so perhaps the sky really is the limit.
The largest known Spark cluster is 8,000 nodes, but as big data grows, it’s expected that
cluster sizes will increase to maintain throughput expectations.
Security
Hadoop supports Kerberos authentication, which is somewhat painful to manage.
However, third party vendors have enabled organizations to leverage Active Directory
Kerberos and LDAP for authentication. Those same third party vendors also offer data
encrypt for in-flight and data at rest.
Hadoop’s Distributed File System supports access control lists (ACLs) and a traditional
file permissions model. For user control in job submission, Hadoop provides Service
Level Authorization, which ensures that clients have the right permissions.
Spark’s security is a bit sparse by currently only supporting authentication via shared
secret (password authentication). The security bonus that Spark can enjoy is that if you
run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally,
Spark can run on YARN giving it the capability of using Kerberos authentication.
40
Summary
Upon first glance, it seems that using Spark would be the default choice for any big data
application. However, that’s not the case. MapReduce has made inroads into the big data
market for businesses that need huge datasets brought under control by commodity
systems. Spark’s speed, agility, and relative ease of use are perfect complements to
MapReduce’s low cost of operation.
The truth is that Spark and MapReduce have a symbiotic relationship with each other.
Hadoop provides features that Spark does not possess, such as a distributed file system
and Spark provides real-time, in-memory processing for those data sets that require it.
The perfect big data scenario is exactly as the designers intended—for Hadoop and Spark
to work together on the same team.
41
Conclusion
There are plenty of options for processing within a Smart Card system.
For batch-only workloads that are not time-sensitive, Hadoop is a good choice that is
likely less expensive to implement than some other solutions.
For stream-only workloads, Storm has wide language support and can deliver very low
latency processing, but can deliver duplicates and cannot guarantee ordering in its default
configuration. Samza integrates tightly with YARN and Kafka in order to provide
flexibility, easy multi-team usage, and straightforward replication and state management.
For mixed workloads, Spark provides high speed batch processing and micro-batch
processing for streaming. It has wide support, integrated libraries and tooling, and
flexible integrations. Flink provides true stream processing with batch processing
support. It is heavily optimized, can run tasks written for other platforms, and provides
low latency processing, but is still in the early days of adoption.
The best fit for your situation will depend heavily upon the state of the data to process,
how time-bound your requirements are, and what kind of results you are interested in.
There are trade-offs between implementing an all-in-one solution and working with
tightly focused projects, and there are similar considerations when evaluating new and
innovative solutions over their mature and well-tested counterparts.
References
[2] https://mapr.com/blog/spark-101-what-it-what-it-does-and-why-it-matters/
[3] https://www.guru99.com/what-is-big-data.html#1
[4] https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-
and-flink-big-data-frameworks-compared#conclusion

Seminar Report

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Seminar Report

Hochgeladen von

Copyright:

Verfügbare Formate

1

In partial fulfillment of requirements for the degree of

Department Of Computer Science and Engineering

Raj Kumar Goel Institute Of Technology

Serial Chapters Sub-Headings Page

3. Chapter 1: Smart card 1.1- What is Smart card id 8

2.93- Advantages and 26

5. Chapter 3-Spark Framework 3.1- What is Apache Spark 27

A smart card is an electronic device with micro-processor based system containing

CHAPTER 1: SMART CARD

What is Smart card?

The quantities, characters, or symbols on which operations are performed by a computer,

What is Smart card?

Examples of Smart Card

Following are some the examples of Smart card

FIGURE NO. 1.1

FIGURE NO. 1.2 FIGURE NO. 1.3

Types of Smart Card

Smart Card could be found in three forms:

1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

Data stored in a relational database management system is one example of

Examples of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Department

2365 Rajesh Kulkarni Finance

3398 Pratibha Joshi Admin

7465 Shushil Roy Admin

7500 Shubhojit Das Finance

7699 Priya Sane Finance

Examples of Un-structured Data

The output returned by 'Google Search'

Examples of Semi-structured Data

Personal data stored in an XML file-

Data Growth over the years

Note: Web application data, which is unstructured, consists of log files,

Characteristics of Smart Card

(ii) Variety – The next aspect of Smart Card is its variety.

Benefits of Smart Card Processing

Ability to process Smart Card brings in multiple benefits, such as-

o Businesses can utilize outside intelligence while taking decisions

o Improved customer service

Traditional customer feedback systems are getting replaced by new systems

o Early identification of risk to the product/services, if any

CHAPTER: 2 PROCESSING OF Smart Card

Some of the big data processing frameworks are:

What Are Smart Card Processing Frameworks?

Batch Processing Systems

The datasets in batch processing are typically...

 bounded: batch datasets represent a finite collection of data

Apache Hadoop is a processing framework that exclusively provides batch processing.

Batch Processing Model

 Reading the dataset from the HDFS filesystem

Advantages and Limitations

Stream Processing Systems

Stream Processing Model

The topologies are composed of:

 Streams: Conventional data streams. This is unbounded data that is continuously

In order to achieve exactly-once, stateful processing, an abstraction called Trident is also

Trident topologies are composed of:

Advantages and Limitations

In terms of interoperability, Storm can integrate with Hadoop's YARN resource

Stream Processing Model

Advantages and Limitations