Sie sind auf Seite 1von 8

10 factors to consider when selecting

Visualization tool for Hadoop data


So you have just started on your first project on Hadoop. Having heard about big data,
Hadoop for months, you figured out your first use case that can be implemented using
Hadoop. Data is extracted in a small cluster and you are now ready to 'leverage data'. This is
when you will face question of selecting your data visualization tool.
In the past you might have faced the same question when you started with your BI initiative.
But the dynamics of selecting a data visualization tool for Hadoop are quite different. Yes,
learnings from selecting a BI tool can be reused for selecting visualization tool for Hadoop,
but there is so much extra to be considered.
After having helped customers implement visualization tool for various types of structured,
unstructured and streaming data on Hadoop, here are a few criteria that I can summarize-:

1. Budget available - Hadoop data visualization tools come in 4 categories: Firstly

there are Enterprise BI tools like SAS, Cognos, and Microstrategy, QlikView etc who are
enterprise BI tools and have good Hadoop compatibility. Then there are Hadoop specific
visualization tools like HUNK, Datameer, and Platfora etc. that are meant specifically for
visualization for Hadoop data. Thirdly there are open source tools like Pentaho, BIRT, and
Jaspersoft that have been early adopters of Hadoop and probably have made more investment
in Hadoop compatibility than some of the biggies. Finally there are charting libraries like
RShiny, D3.js, Highcharts etc that are mostly open source / low cost and have good
visualization but require scripting and coding. As you move from first category to fourth
category, costs of the software licenses goes on reducing and ease of development and self
service capabilities also go on reducing. There are some exceptions to this general trend
though.
2. Your existing BI tool - Most probably your company is already using some BI tool or the
other. You may have SAS, Microstrategy, IBM Cognos, and OBIEE in your company. Most
of these tools have made tremendous investment in enhancing their tools for compatibility
with Hadoop ecosystem. They have connectors for Hadoop and NoSQL databases, graphical
tools are available. It may be easy for the end users to use something that they are already
using. Think of using your existing BI tool for Hadoop data visualization unless there are
obvious drawbacks in it.
3. Hadoop distribution used - If you are using Hadoop distribution from say Cloudera or
Hortonworks, you can safely select tools that are certified by these distributors of Hadoop.
For example, Tableau, Microstrategy, Pentaho, QlikView are all certified by Cloudera and
have proven connectors to Cloudera distribution of Hadoop. Similarly, most of these tools are
partners of Hortonworks also. In case your Big Data platform is IBM BigInsights, then going

for Cognos makes sense since being IBM products, compatibility will not be an issue. It is
always advisable to check if the tool you are selecting for visualization is certified by the
Hadoop distribution being used.
4. Nature of data - If the data you want to analyze is tabular, columnar data then most of the
tools are capable of providing visualization facilities. However if the data is say log data
special purpose charting libraries like timeplot may be good option. Similarly, for social
mediadata, tools like Zoomdata provide better visualization capabilities.
5. End user profile - Who are your end users? Are they data scientists? Then a visualization
tool with very high end visualization patterns will be required. If operational business users
(such as sales managers, finance managers) are end users, then more than advanced
visualization, speed of delivery and cost of tool (since number of users may be very high) is
important.
6. Programming skills available - If you have good Java and JavaScript skills available in
house, going for scripting based tools makes sense. Also, if you are an R shop and have good
R programming capabilities, RShiny can be a good alternative. Standard BI tools such as
Microstrategy, Pentaho on the other hand allow writing SQL on top of Hadoop data. Tools
like Datameer are schema free and drag and drop tools. So in short, each tool comes with its
own set of programming skill requirements and you need to make sure these requirements are
compatible with programming skills available in house.
7. Operating system - It is a basic checkbox while selecting tools for visualization. We
come across customers who use Linux platforms only and using Windows based tools like
QlikView, Tableau, Microsoft BI is not possible in this case. Also, if you are planning an
implementation on cloud, make sure your cloud provider can provide OS required by the
visualization tool.
8. Visualization features required - Traditional BI tools that have added Hadoop capabilities
are more mature compared to new entrants in providing visualization patterns commonly
required. For example, multiple Y axes, support for HTML5 and animation, user friendly drill
down are some features that are very mature in traditional BI tools, but still evolving in new
entrants, open source BI tools and some charting libraries. It is advisable to compare your
visualization needs to capabilities offered by the tools.
9. Data Volume - Data volume and streaming nature of data is important consideration
especially if you are thinking of an in memory architecture visualization tool. If your Hadoop
data store has Terabytes of data, data is being added real time and you plan to use in memory
visualization tool then you need to think of mechanism to reduce the volume and feed data
continuously from Hadoop to the visualization tool. This is possible, but not very simple. Be
aware of impact of real time high volume data on in memory architecture.
10. Industry experience - It is always advisable to depend on dominant players in your
industry vertical. SAS for example has been used by banks in analyzing big data for customer
intelligence and risk management. In cases like this, the availability of underlying algorithms
and visualization patterns makes the big data project implementation much easier.

All of these factors need to be carefully thought after. Some of them like operating system
seem to be no brainer, but I have seen companies make an oversight and select visualization
tool that later needed to be changed. After shortlisting visualization alternatives considering
these factors, you are ready for next step in the journey of building visualization platform for
big data and that step is initiating a Proof of Concept. More about learnings from data
visualization Proof of Concept in the next blog.

Big Data with Qlikview


Deepak Tibhe
BI Consultant at Bristlecone

QlikView offers two approaches to Big


Data analytics
Because Big Data is a relative term and the use cases and infrastructure in every organization
are different, QlikView offers two approaches to handling Big Data that put the power in the
Hands of our customer to best manage the inherent tradeoffs between user performance and

Data volume, variety, and velocity. Its worth mentioning QlikView has the flexibility to be
developed in small cycles so theres an opportunity to develop in lieu of requirements.
Although QlikView is well suited to this style of development I wouldnt suggest theyre
missed completely.

100% in-memory for maximum user performance:


With the patented data engine in the QlikView memory, data are compressed with a factor 10,
which means that a single server with 256 GB RAM, can store up to 2 TB of uncompressed
data. This equals billions of data rows while response times only possible with in-memory
architectures, are offered.

Other QlikView functions like document chaining and binary load accelerate the process of
exploring very large data sets. This is the path taken by many QlikView customers when
analyzing terabytes of data stored in data warehouses or Hadoop clusters and similar
archiving systems.

QlikView Direct Discovery for truly massive datasets


Did your company already invest in Big Data infrastructure? Then check out QlikView
Direct Discovery:
For companies that have already invested in a large data warehouse or other Big Data
infrastructure and dont want to load all data into the QlikView s in-memory engine,
QlikView Direct Discovery is a hybrid approach that leverages both in-memory data and data
that is dynamically queried from an external source.
QlikView Direct Discovery offers 3 big things:
Query data directly from the Big Data repositories
1. Cache Query-results in memory for faster recall
2. Maintain associations among all data, wherever its stored

This hybrid approach offers business users the possibility to benefit from Big Data with no
programming skills and the ability to add context and insight while drilling-down to granular
details.

When to Choose 100% In-Memory vs.


Direct Discovery

When to Choose 100% In- When to Choose Direct


Memory
Discovery
All the necessary (e.g., relevant and
Data cannot fit in-memory and document
contextual) data can fit in server memory. chaining is not a viable solution due to analytical
(~200GB compressed)
requirements.

Users require only aggregated or summary Users require access to record-level of detail
data, i.e. hourly or daily averages, or record stored in a large fact table that will not fit inlevel detail over a limited time period.
memory.

Query performance of external source is


not satisfactory or the number of queries
expected from concurrent users would
negatively impact the Big Data repository.

Network bandwidth limitations means that it


would take too long to copy raw data to a
QlikView server. Direct Discovery queries return
aggregated data hence requires less bandwidth.

Use case-demo app


Big data analyzed and visualized by Qlik and ParStream in the Amazon cloud

QlikView analyze 50 billion rows worth of retail data in seconds. In this powerful demo,
both QlikView and ParStream are running in Amazon's cloud. You'll see how QlikView is
able to connect to ParStream database in real time using Direct Discovery.
Please refer below Link for demo app:
https://www.youtube.com/watch?v=xSFbCtV7__8

"Big Data"
The term "Big Data" has been thrown around for several years, and yet it continues to have a
very vague definition. In fact, there are no two big data installations and configurations alike
insert snowflake paradigm here. Its no surprise, given the unique nature of big data, it
cannot be forced into an abstract model. These type of data systems evolve organically, and
morph based on the ever changing business requirements.

If we accept that no two big data systems are alike, how can one deliver analytics from
those systems with a singular approach?

Well, we cant in fact it would be quite limiting to do so. Why?

Picking one and only one method of analysis prevents the basic question What problem is
the business user trying to solve? from being answered. So what do I mean by picking one
version of analysis?

The market breaks it down into the following narrow paths:

Simple SQL on Hadoop/Spark/etc.

Some form of caching of SQL on Hadoop/Spark/etc

ETL into database then analysis

These solutions have their place, but to pick only one greatly limits a users ability to
succeed, especially when the limits of each solution are reached.

So how does Qlik differentiate itself from the narrow approaches and tools that exist in
the market?

Simple answer, variety. Qlik is in a unique position that offers a set of techniques and
strategies that allow the widest range of capabilities within a big data ecosystem.

Below are some of the approaches Qlik addresses the big data community with:

In-Memory Analytics: Get the data you need and accelerate it, which provides a
great solution for concepts such as data lakes. Qlik creates a Synch and Drink
strategy for big data. Fast and powerful, but does not retrieve all the data, which
might be ok given the requirements. Think of it as a water tower for your data lake.
Do you really need 1 petabyte of log data, or maybe just the errors and anomalies over
the last 30 days?

Direct/Live Query: Sometimes you do need all the data, or a large set that isnt
realistic to fit into memory, or latency is a concern then use Qlik in live query mode.
The catch with this strategy is you are completely dependent on the source system to
provide speed. This scenario is best when an accelerator (Teradata, Jethro, atScale,
Impala, etc) is used as a performance booster. Qlik uses our Direct Discovery
capability to enable this scenario

On-Demand-App-Generation: This is a shopping cart approach that allows users


to select from a cart of content curated from the big data system. By guiding the users
to make selections this technique reduces the raw volume of data being returned from
a system to just what they need, it also allows IT to place controls, security, and
limiters in front of those choices so mistakes (trying to return all records from a multipetabyte system) can be avoided.

API - App on Demand: This is a API evolution of the shopping cart method above
but embedded within a process or environment of another interface or mashup. This
technique allows Qlik apps to be created temporarily (i.e. session app) or permanently
based on the inputs from another starting point. This is an ideal solution for big data
partners or OEMs who would like to build Qlik integration directly into their tool.

In summary, to prevent limited interactions with whatever big data system you use, you
need options. Qlik is uniquely positioned in this area due to the power of the QIX engine and
our ELT + Acceleration + Visualization three-in-one architecture. Since no two big data
systems are alike, Qlik offers the most flexibility with solutions in the market to adapt to any
data scenario big, or small.

Das könnte Ihnen auch gefallen