Big Data Unit

UNIT I
INTRODUCTION TO BIG DATA

Big Data – Definition, Characteristic Features – Big Data Applications - Big Data vs. Traditional
Data - Risks of Big Data - Structure of Big Data - Challenges of Conventional Systems - Web
Data – Evolution of Analytic Scalability - Evolution of Analytic Processes, Tools and methods -
Analysis vs. Reporting - Modern Data Analytic Tools.
1.1 DEFINITION:
Big data is an evolving term that describes any voluminous amount of structured, semi
structured and unstructured data that has the potential to be mined for information. Big data
challenges include capturing data, data storage, data analysis, search, sharing, transfer,
visualization, querying, updating and information privacy.
1.2 CHARACTERISTICS OF BIG DATA:
(i)Volume
The name 'Big Data' itself is related to a size which is enormous. Size of data plays very
crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with 'Big Data'.
(ii)Variety
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Now days, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This
variety of unstructured data poses certain issues for storage, mining and analyzing data.
(iii)Velocity
The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data. Big Data
Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks and social media sites, sensors, Mobile devices, etc. The flow of data
is massive and continuous.
(iv)Variability
This refers to the inconsistency which can be shown by the data at times, thus hampering
the process of being able to handle and manage the data effectively.
1
1.3 BIG DATA APPLICATIONS:
Some of the industries propelled by big data analytics are:
 Public Sector Services.

 Healthcare contributions.
 Learning Services.
 Insurance Services.
 Industrialized and Natural Resources.
 Transportation Services.
 Banking Sectors and Fraud Detection
Real Time Big data Applications:
a) Procurement with Big data
Demand can be forecasted properly as per different conditions available with Big Data.
b) Big data in Product development
What product to be developed to increase sales
c) Big data in manufacturing sector
Big data can be used to identify machinery and process variations that may be indicators
of quality problems.
d) Big data for product distribution
Based on data available, its analysis could be done to ensure proper distribution in proper
market.
e) Big data in Marketing field
Big data helps in knowing better marketing strategy that could increase ale.
f) Price Management using Big data
To maintain position in market, price management plays a key role and Big data helps
business in knowing market trend for it.
g) Merchandising
Big Data plays a major role in sales for retail market also.
2
h) Big data in Sales
It helps in increasing sale for the business. It also helps in optimizing assignment of sales
resources and accounts, product mix and other operations.
i) Store Operations using Big Data
Different tools can be used to monitor store operations which reduce manual work. Big
data helps in adjusting inventory levels on the basis of predicted buying patterns, study of
demographics, weather, key events, and other factors.
j) Big data in Human Resources
Big Data has changed way of recruitment and other HR operations. You can also find out
the characteristics and behaviors of successful and effective employees, as well as other
employee insights to manage talent better.
k) Big data in Banking
Big Data has provided biggest opportunity to companies like Citi bank to see the big
picture due to balancing the sensitive nature of the data for delivering value to clients along with
prioritizing the privacy and protection of information. It has been fully adopted by many
companies to drive business growth and enhance the services they provide to customers.
Understand further how Income tax has benefited from Big data.
l) Big data in Finance sector
Financial services have widely adopted big data analytics to inform better investment
decisions with consistent returns. The big data pendulum for financial services has swung from
passing fad to large deployments last year.
m) Big data in Telecom
A recent report, “Global Big Data Analytics Market in Telecom Industry 2014-2018,”
found that use of data analytics tools in telecom sector is expected to grow at a compound annual
growth rate of 28.28 percent over the next four years. Mobile Telecom harnesses Big Data with
combined actuate and hadoop solution.
n) Big data in retail sector
Retailers harness Big Data to offer consumers personalized shopping experiences.

Analyzing how a customer came to make a purchase, or the path to purchase, is 1 way big data
tech is making a mark in retail. 66% of retailers have made financial gains in customer
relationship management through big data.
3
o) Big data in HealthCare
Big data is used for analyzing data in the electronic medical record (EMR) system with
the goal of reducing costs and improving patient care. This Data includes the unstructured data
from physician notes, pathology reports etc. Big Data and healthcare analytics have the power to
predict, prevent & cure diseases.
p) Big data in Media and Entertainment
Big data is changing the media and entertainment industry, giving users and viewers a
much more personalized and enriched experience. Big data is used for increasing revenues,
understanding real-time customer sentiment, increasing marketing effectiveness and ratings and
viewership.
q) Big Data in tourism
Big data is transforming the global tourism industry. People know more about the world
than ever before. People have much more detailed itineraries these days with the help of Big
data.
r) Big data in Airlines
Big Data and Analytics give wings to the Aviation Industry. An airline now knows where
a plane is headed, where a passenger is sitting, and what a passenger is viewing on the IFE or
connectivity system.
s) Big data in Social Media
Big data is a driving factor behind every marketing decision made by social media
companies and it is driving personalization to the extreme.
1.4 BIG DATA VS TRADITIONAL DATA:
The major difference between traditional data and big data are discussed below.
Data architecture
Traditional data use centralized database architecture in which large and complex
problems are solved by a single computer system. Centralised architecture is costly and
ineffective to process large amount of data. Big data is based on the distributed database
architecture where a large block of data is solved by dividing it into several smaller sizes. Then
the solution to a problem is computed by several different computers present in a given computer
network. The computers communicate to each other in order to find the solution to a problem .
The distributed database provides better computing, lower price and also improve the
4
performance as compared to the centralized database system. This is because centralized
architecture is based on the mainframes which are not as economic as microprocessors in
distributed database system. Also the distributed database has more computational power as
compared to the centralized database system which is used to manage traditional data.
Types of data
Traditional database systems are based on the structured data i.e. traditional data is stored
in fixed format or fields in a file. Examples of the unstructured data include Relational Database
System (RDBMS) and the spreadsheets, which only answers to the questions about what
happened. Traditional database only provides an insight to a problem at the small level. However
in order to enhance the ability of an organization, to gain more insight into the data and also to
know about metadata unstructured data is used. Big data uses the semi-structured and
unstructured data and improves the variety of the data gathered from different sources like
customers, audience or subscribers. After the collection, Bid data transforms it into knowledge
based information
Volume of data
The traditional system database can store only small amount of data ranging from
gigabytes to terabytes. However, big data helps to store and process large amount of data which
consists of hundreds of terabytes of data or petabytes of data and beyond. The storage of massive
amount of data would reduce the overall cost for storing data and help in providing business
intelligence
Data schema
Big data uses the dynamic schema for data storage. Both the un-structured and structured
information can be stored and any schema can be used since the schema is applied only after a
query is generated. Big data is stored in raw format and then the schema is applied only when the
data is to be read. This process is beneficial in preserving the information present in the data. The
traditional database is based on the fixed schema which is static in nature. In traditional database
data cannot be changed once it is saved and this is only done during write operations
Data relationship
In the traditional database system relationship between the data items can be explored
easily as the number of information’s stored is small. However, big data contains massive or
voluminous data which increase the level of difficulty in figuring out the relationship between
the data items
Scaling
Scaling refers to demand of the resources and servers required to carry out the
computation. Big data is based on the scale out architecture under which the distributed
approaches for computing are employed with more than one server. So, the load of the
5
computation is shared with single application based system. However, achieving the scalability
in the traditional database is very difficult because the traditional database runs on the single
server and requires expensive servers to scale up.
Higher cost of traditional data
Traditional database system requires complex and expensive hardware and software in
order to manage large amount of data. Also moving the data from one system to another requires
more number of hardware and software resources which increases the cost significantly. While in
case of big data as the massive amount of data is segregated between various systems, the
amount of data decreases. So use of big data is quite simple, makes use of commodity hardware
and open source software to process the data
Accuracy and confidentiality
Under the traditional database system it is very expensive to store massive amount of
data, so all the data cannot be stored. This would decrease the amount of data to be analyzed
which will decrease the result’s accuracy and confidence. While in big data as the amount
required to store voluminous data is lower. Therefore the data is stored in big data systems and
the points of correlation are identified which would provide high accurate results.
1.5 RISKS OF BIG DATA:
Data Security
This risk is obvious and often uppermost in our minds when we are considering the
logistics of data collection and analysis. Data theft is a rampant and growing area of crime – and
attacks are getting bigger and more damaging.
The bigger your data, the bigger the target it presents to criminals with the tools to steal
and sell it. In the case of Target, hackers stole credit and debit card information of 40 million
customers, as well as personal identifying information such as email and geographical addresses
of up to 110 million people. In March, a federal judge approved a settlement in which Target
would pay $10 million into a settlement fund, from which payments of up to $10,000 would be
made to everyone affected by the breach.
Data Privacy
Closely related to the issue of security is privacy. But in addition to ensuring that
people’s personal data are safe from criminals, you need to be sure that the sensitive information
you are storing and collecting isn’t going to be divulged through less malevolent but equally
damaging misuse by yourself or by people to whom you have delegated responsibility for
analyzing and reporting on it.
6
Failing to follow applicable data protection laws can lead to expensive lawsuits
and even prison, depending on what sort of data you are using and the jurisdiction you are in.
Last year, private hire and car sharing service Uber stirred up controversy when one of its
executives was caught using the service’s “God mode” to track the movements of BuzzFeed
journalist Johana Bhuiyan.
Costs
Data collection, aggregation, storage, analysis, and reporting all cost money. On top of
this, there will be compliancy costs – to avoid falling foul on the issues I raised in the previous
point. These costs can be mitigated by careful budgeting during the planning stages, but getting it
wrong at that point can lead to spiralling costs, potentially negating any value added to your
bottom line by your data-driven initiative. This is why “starting with strategy” is so vital. A well-
developed strategy will clearly set out what you intend to achieve and the benefits that can be
gained so they can be balanced against the resources allocated to the project. One bank that I
worked with was worried about the costs of storing and maintaining all the data it was collecting
to the point that it was considering pulling the plug on one particular analytics project, as the
costs looked likely to exceed any potential savings. By identifying and eliminating irrelevant
data from the project, the bank was able to bring costs back under control and achieve its
objectives.
Bad Analytics
Aka “getting it wrong.” Misinterpreting the patterns shown by your data and drawing
causal links where there is in fact merely random coincidence is an obvious pitfall. Sales data
may show a rise following a major sporting event, prompting you to draw a link between sports
fans and your products or services, when in fact the rise is based on there being more people in
town, and the rise would be equally dramatic after a large live music event.
In addition, care must be taken to avoid confirmation bias – easily imposed when an analyst
comes to a project with predetermined ideas about what they are looking for and is blinded to
insights from the data that go against these preconceived notions. The only way to protect against
this is to ensure that you are implementing all best practice procedures from top to bottom
throughout your project.
Google’s Flu Trends project serves as a good example. Designed to produce accurate maps of flu
outbreaks based on the searches being made by Google users, at first it provided compelling
results. But as time went on, its predictions began to diverge increasingly from reality. It turned
out that the algorithms behind the project just weren’t accurate enough to pick up anomalies such
as the 2009 H1N1 pandemic, vastly reducing the value that could be gained from them.
Bad Data
I’ve come across many data projects that start off on the wrong foot by collecting
irrelevant, out of date, or erroneous data. This usually comes down to insufficient time being
spent on designing the project strategy. The big data gold rush has led to a “collect everything
7
and think about analyzing it later” approach at many organizations. This not only adds to the
growing cost of storing the data and ensuring compliance, it leads to large amounts of data that
can become outdated very quickly.
The real danger here is falling behind your competition. If you are not analyzing the right
data, you won’t be drawing the right insights that will provide value. Meanwhile, your
competitors most likely will be running their own data projects. And if they are getting it right,
they’ll take the lead. A healthcare client I recently worked with created a 217-page report for
senior management. A lot of the data in the report would have been useful, but it was drowned
out by irrelevant background noise. Working with them, I was able to show them how to cut the
report down to 20 pages, mostly info graphics, which clearly showed the relevant data while
omitting a lot of the noise.
That’s just a simple checklist of the risks that every big data project needs to account for
before one cent is spent on infrastructure or data collecting. Businesses of all sizes should engage
wholeheartedly with big data projects. If they don’t, they run the serious risk of being left
behind. But they also should be aware of the risks and enter into big data projects with their eyes
wide open.
1.6 STRUCTURE OF BIG DATA:
Figure: Big Data structures, models and their linkage at different processing stages.
8
1.7 CHALLENGES OF CONVENTIONAL SYSTEMS:
In the past, the term ‘Analytics' has been used in the business intelligence world to
provide tools and intelligence to gain insight into the data through fast, consistent, interactive
access to a wide variety of possible views of information. Data mining has been used in
enterprises to keep pace with the critical monitoring and analysis of mountains of data. The main
challenge in the traditional approach is how to unearth all the hidden information through the
vast amount of data.
 Traditional Analytics analyzes on the known data terrain that too the data that is well
understood. It cannot work on unstructured data efficiently.
 Traditional Analytics is built on top of the relational data model, relationships between
the subjects of interests have been created inside the system and the analysis is done
based on them. This approach will not adequate for big data analytics.
 Traditional analytics is batch oriented and we need to wait for nightly ETL (extract,
transform and load) and transformation jobs to complete before the required insight is
obtained.
 Parallelism in a traditional analytics system is achieved through costly hardware like
MPP (Massively Parallel Processing) systems
 Inadequate support of aggregated summaries of data
Apart from these challenges others are categorized as

Data challenges
-Volume, velocity, veracity, variety
-Data discovery and comprehensiveness
-Scalability
Process challenges
-Capturing data
-Aligning data from different sources
-Transforming data into suitable form for data analysis
-Modeling data (mathematically, simulation)
-Understanding output, visualizing results and display issues on mobile devices
Management challenges
-Security
-Privacy
-Governance
-Ethical issues
-Traditional/ RDBMS challenges
-Designed to handle well structured data
-traditional storage vendor solutions are very expensive
-shared block-level storage is too slow
-read data in 8k or 16k block size
-Schema-on-write requires data be validated before it can be written to disk.
-Software licenses are too expensive
-Get data from disk and load into memory requires application
9
1.8 WEB DATA:
In the world of Big Data, there's a lot of talk about unstructured data -- after all, "variety"
is one of the three Vs. Often these discussions dwell on log file data, sensor output or media
content. But what about data on the Web itself -- not data from Web APIs, but data on Web
pages that were designed more for eyeballing than machine-driven query and storage? How can
this data be read, especially at scale? Recently, I had a chat with the CTO and Founder of
Kapow Software, Stefan Andreasen, who showed me how the company's Katalyst product tames
data-rich Web sites not designed for machine-readability.
Scraping the Web:
If you're a programmer, you know that Web pages are simply visualizations of HTML
markup -- in effect every visible Web page is really just a rendering of a big string of text. And
because of that, the data you may want out of a Web page can usually be extracted by looking for
occurrences of certain text immediately preceding and following that data, and taking what's in
between.
Code that performs data extraction through this sort of string manipulation is sometimes said to
be performing Web "scraping." This term that pays homage to "screen scraping," a similar,
though much older, technique used to extract data from mainframe terminal screen text. Web
scraping has significant relevance to Big Data. Even in cases where the bulk of a Big Data set
comes from flat files or databases, augmenting that with up-to-date- reference data from the Web
can be very attractive, if not outright required.
Unlocking Important Data:
But not all data is available through downloads, feeds or APIs. This is especially true of
government data, various Open Data initiatives notwithstanding. Agencies like the US Patent
and Trademark Office (USPTO) and the Federal Securities and Exchange Commission
(SEC) have tons of data available online, but API access may require subscriptions from third
parties.
Similarly, there's lots of commercial data available online that may not be neatly packaged in
code-friendly formats either. Consider airline and hotel frequent flyer/loyalty program
promotions. You can log into your account and read about them, but just try getting a list of all
such promotions that may apply to a specific property or geographic area, and keeping the list
up-to-date. If you're an industry analyst wanting to perform ad hoc analytical queries across
such offers, you may be really stuck.
Downside Risk:
So it's Web scraping to the rescue, right? Not exactly, because Web scraping code can
be brittle. If the layout of a data-containing Web page changes -- even by just a little -- the text
patterns being searched may be rendered incorrect, and a mission critical process may
completely break down. Fixing the broken code may involve manual inspection of the page's
10
new markup, then updating the delimiting text fragments, which would hopefully be stored in a
database, but might even be in the code itself.
Such an approach is neither reliable, nor scalable. Writing the code is expensive and updating it
is too. What is really needed for this kind of work is a scripting engine which determines the
URLs it needs to visit, the data it needs to extract and the processing it must subsequently
perform on the data. What's more, allowing the data desired for extraction, and the delimiters
around it, to be identified visually, would allow for far faster authoring and updating than would
manual inspection of HTML markup.
An engine like this has really been needed for years, but the rise of Big Data has increased the
urgency. Because this data is no longer needed just for simple and quick updates. In the era of
Big Data, we need to collect lots of this data and analyze it.
Making it Real:
Kapow Software's Katalyst product meets the spec, and then some. It provides all the
wish list items above: visual and interactive declaration of desired URLs, data to extract and
delimiting entities in the page. So far, so good. But Katalyst doesn't just build a black box that
grabs the data for you. Instead, it actually exposes an API around its extraction processes, thus
enabling other code and other tools to extract the data directly.
That's great for public Web sites that you wish to extract data from, but it's also good for adding
an API to your own internal Web applications without having to write any code. In effect,
Katalyst builds data services around existing Web sites and Web applications, does so without
required coding, and makes any breaking layout changes in those products minimally disruptive.
Maybe the nicest thing about Katalyst is that it's designed with data extraction and analysis in
mind, and it provides a manageability layer atop all of its data integration processes, making it
perfect for Big Data applications where repeatability, manageability, maintainability and
scalability are all essential.
Web Data is BI, and Big Data:
Katalyst isn't just a tweaky programmer's toolkit. It's a real, live data integration
tool. Maybe that's why Informatics, a big name in BI which just put out its 9.5 release this week,
announced a strategic partnership with Kapow Software. As a result, Informatica
PowerExchange for Kapow Katalyst will be made available as part of Informatica 9.5. Version
9.5 is the Big Data release of Informatica, with the ability to treat Hadoop as a standard data
source and destination. Integrating with this version of Informatica makes the utility of Katalyst
in Big Data applications not merely a provable idea, but a product reality.
11
1.9 ANALYSIS Vs REPORTING:
There are five differences between reporting and analysis:
1. Purpose
Reporting helps companies monitor their data even before digital technology boomed.
Various organizations have been dependent on the information it brings to their business, as
reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-channels of
data, provide comparison, and make understand information easier (think of a dashboard, charts,
and graphs, which are reporting tools and not analysis reports), analysis interprets this
information and provides recommendations on actions.
2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy to
confuse tasks that have analysis labeled on top of them when all it does is reporting. Hence,
ensure that your analytics team has a healthy balance doing both.
Here’s a great differentiator to keep in mind if what you’re doing is reporting or analysis:
Reporting includes building, configuring, consolidating, organizing, formatting, and

summarizing. It’s very similar to the abovementioned like turning data into charts, graphs, and
linking data across multiple channels.
Analysis consists of questioning, examining, interpreting, comparing, and confirming. With big
data, predicting is possible as well.
3. Outputs
Reporting and analysis have the push and pull effect from its users through their outputs.
Reporting has a push approach, as it pushes information to users and outputs come in the forms
of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further probe and to
answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of insights, recommended actions,
and a forecast of its impact on the company—all in a language that’s easy to understand at the
level of the user who’ll be reading and deciding on it.
This is important for organizations to realize truly the value of data, such that a standard report is
not similar to a meaningful analytics.
12
4. Delivery
Considering that reporting involves repetitive tasks—often with truckloads of data,

automation has been a lifesaver, especially now with big data. It’s not surprising that the first
thing outsourced are data entry services since outsourcing companies are perceived as data
reporting experts.
Analysis requires a more custom approach, with human minds doing superior reasoning and
analytical thinking to extract insights, and technical skills to provide efficient steps towards
accomplishing a specific goal. This is why data analysts and scientists are demanded these days,
as organizations depend on them to come up with recommendations for leaders or business
executives make decisions about their businesses.
5. Value
This isn’t about identifying which one brings more value, rather understanding that both
are indispensable when looking at the big picture. It should help businesses grow, expand, move
forward, and make more profit or increase their value.
1.10 MODERN DATA ANALYTIC TOOLS:
Following are some of the prominent big data analytics tools and techniques that are used by
analytics developers.
Cassandra:
This is the most applauded and widely used big data tool because it offers an effective
management of large and intricate amounts of data. This is a database which offers high
availability and scalability without affecting the performance of commodity hardware and cloud
infrastructure. Cassandra has many advantages and some of those are fault tolerance,
decentralization, durability, performance, professional support, elasticity, and scalability. Since
this tool has so many qualities hence it is loved by all the analytics developers. Companies which
are using Cassandra big data analytics tool are eBay and Netflix.
Hadoop:
This is a striking product from Apache which has been used by many eminent companies.
Hadoop is basically an open-source software framework which is written in Java language so
that it can work with a chunk of data sets. It is designed in such a way so that it can scale up from
a single server to hundreds of machines. The most prominent feature of this advanced software
library is superior processing of voluminous data sets. Many companies choose big data tool
Hadoop because of its great processing capabilities. With this tool, the developer provides
regular updates and improvements to the product.
13
Knime:
This is a big data analytics open source data tool. Knime is a leading analytics platform
which provides an open solution for data-driven innovation. With the help of this tool, you can
discover the hidden potential of your data, mine for fresh insights, and can predict new futures by
analysing the data. With nearly 1000 modules, hundreds of ready-to-run examples, a complete
range of integrated tools, and a chunk of advanced algorithms available, this Knime analytics
platform is certainly the best toolbox for any data scientist who wants to accomplish his job in a
hassle-free way. This tool can support any type of data like XML, JSON, Images, documents,
and more. This tool also possesses advanced predictive and machine learning algorithms.
OpenRefine:
Are you stuck up with large and voluminous data sets? Then this tool is ideal for you
which help you to explore huge and baggy data sets easily. Basically, OpenRefine helps to
organize the data in the database that was nothing but a mess and muddle. This tool helps you in
cleaning and transforming data from one format into another. This data tool can also be used to
link and extend your datasets with web services and other peripheral data. Earlier, OpenRefine is
known as Google Refine but from 2012, Google didn’t support this project and it was then
rebranded to OpenRefine.
R language:
R is an open source programming language which helps the organizations to manage and
analyse a chunk of data effectively and aptly. The language was initially written by Ross Ihaka
and Robert Gentleman but it has got immense appreciation from the mathematicians,
statisticians, data scientists and data miners who are in the field of data analytics. R is packed
with a host of data analysis tools which make the analysis of data more facile and simpler for the
users. With R, businesses don’t need to develop the customized tools and moreover, they can
easily get rid of the time-consuming codes. R is the prime data analysis software which consists
of innumerable algorithms that are designed for data retrieval, processing, analysis and high-end
statistical graphics representations.
Plotly:
As a successful big data analytics tool, Plotly has been used to create great dynamic
visualization even the organization has inadequate time or skills for meeting big data needs. With
the help of this tool, you can create stunning and informative graphics very effortlessly.
Basically, Plotly is used for composing, editing, and sharing interactive data visualization via
web.
Bokeh:
This tool has many resemblances with Plotly. This tool is very effective and useful if you
want to create easy and informative visualizations. Bokeh is a Python interactive visualization
library which helps you in creating astounding and meaningful visual presentation of data in the
14
web browsers. Thus, this tool is widely used by big data analytics experienced persons to create
interactive data applications, dashboards, and plots quickly and easily. Many data analytics
experts claimed that Bokeh is the most progressive and effective visual data representation tool.
Neo4j:
Neo4j is one of the leading big data analytics tools as it takes the big data business to the
next level. Neo4j is a graph database management system which is developed by Neo4j Inc. This
tool helps to work with the connections between them. The connections between the data drive
modern intelligent applications, and Neo4j is the tool that transforms these connections to gain
competitive advantage. As per DB-Engines ranking, Neo4j is the most popular graph database.
Rapidminer:
This is certainly one of the favourite tools for all the data specialists. Like Knime, this is
also an open source data science platform which operates through visual programming. This tool
has the capability of manipulating, analysing, modeling and integrating the data into business
processes. RapidMiner helps data science teams to become more productive by giving an open
source platform for data preparation, model deployment, and machine learning. Its unified data
science platform accelerates the building of complete analytical workflows. From data
preparation to machine learning to model deployment, everything can be done under a single
environment. This actually enhances the efficiency and lessens the time for various data science
projects.
Wolfram Alpha:
If you want to do something new from your data, then this could be an ideal tool for you.
This will give you every minute detail of your data. This famous tool was developed by Wolfram
alpha LLC which is a subsidiary of Wolfram Research. If you want to do advanced research on
financial, historical, social, and other professional areas, then you must use this platform.
Suppose, if you type Microsoft, then you will receive miscellaneous information including input
interpretation, fundamentals, financials, new trade, price, performance comparisons, data return
analysis, and much more relevant information.
Orange:
Orange is an open source data visualization and data analysis tool which can be used by
both novice and sagacious persons in the field of data analytics. This tool provides interactive
workflows with a large toolbox. With the help of this toolbox, you can create interactive
workflows to analyse and visualize data. Orange is crammed many different visualizations like
from scatter plots, bar charts, trees, to dendrograms, networks and heat maps, you can find
everything in this tool.
15
Node XL:
This is a data visualization and analysis software tool for relationships and networks. This
tool offers exact calculations to the users. You will be glad to know that it is a free and open-
source network analysis and visualization software tool which has a wide range of application.
This tool is considered as one of the best and latest statistical tools for data analysis which gives
advanced network metrics, automation, access to social media network data importers, and many
more things.
Storm:
Storm has inscribed its name as one of the popular data analytics tools because of its
superior streaming data processing capabilities in real time. You can even integrate this tool with
many other tools like Apache Slider in order to manage and secure your data. Storm can be used
by an organization in many cases like data monetization, cybersecurity analytics, detection of the
threat, operational dashboards, real-time customer management, etc. All these functions can
enhance your business growth and will give you many opportunities for the betterment of your
business.
Hope, from the above-mentioned list, you got enough information regarding some of the best
data analytics tools which will be ruling in the upcoming years. If you want to establish your
business firmly, then enhance your knowledge of these data analytics tools.
Norjimm is one among the most popular custom software development company in India with
teams having an project experience of 6+ years and with many happy clients in different parts of
the world. We are known for our innovative and future preparatory approach that we follow. Get
in touch with us today to get the best software and app development services.
16

Big Data Unit

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Big Data Unit

Hochgeladen von

Copyright:

Verfügbare Formate

UNIT I

INTRODUCTION TO BIG DATA

1.2 CHARACTERISTICS OF BIG DATA:

Some of the industries propelled by big data analytics are:

 Public Sector Services.

Real Time Big data Applications:

a) Procurement with Big data

b) Big data in Product development

What product to be developed to increase sales

c) Big data in manufacturing sector

d) Big data for product distribution

e) Big data in Marketing field

f) Price Management using Big data

i) Store Operations using Big Data

j) Big data in Human Resources

k) Big data in Banking

l) Big data in Finance sector

m) Big data in Telecom

n) Big data in retail sector

Retailers harness Big Data to offer consumers personalized shopping experiences.

p) Big data in Media and Entertainment

q) Big Data in tourism

r) Big data in Airlines

s) Big data in Social Media

1.4 BIG DATA VS TRADITIONAL DATA:

Higher cost of traditional data

Accuracy and confidentiality

1.5 RISKS OF BIG DATA:

1.6 STRUCTURE OF BIG DATA:

Apart from these challenges others are categorized as

Scraping the Web:

Unlocking Important Data:

Web Data is BI, and Big Data:

There are five differences between reporting and analysis:

Reporting includes building, configuring, consolidating, organizing, formatting, and

Considering that reporting involves repetitive tasks—often with truckloads of data,

1.10 MODERN DATA ANALYTIC TOOLS:

Das könnte Ihnen auch gefallen