Sie sind auf Seite 1von 36

Big Data -Hadoop

Big Data -Hadoop




Table of Contents
1. Process of installing the cloudera Virtual Machine on your system: ______________________ 3
1.1. Virtual Machine: __________________________________________________________ 3
1.2. Centos ("Community Enterprise Operating System"): _____________________________ 3
1.3. Cloudera: ________________________________________________________________ 3
1.4. Usages: _________________________________________________________________ 3
2. Installation of cloudera-Virtual Machine (VM) into the Virtual Box: ______________________ 3
2.1. Software Requirement: _____________________________________________________ 4
2.2. Processes of Installation: ___________________________________________________ 4
2.2.1. VirtualBox Installation: ____________________________________________________ 4
2.2.2. Downloading the Cloudera Virtual Machine: ___________________________________ 6
2.2.3. Starting the Cloudera Virtual Machine ________________________________________ 7
2.2.4. Installing VirtualBoxAdditions: ______________________________________________ 8
3. Big Data: ___________________________________________________________________ 10
3.1. Example of Big Data: ______________________________________________________ 11
3.2. Hadoop for the research users a quick timeline of how things have progressed: _______ 11
3.3. Define as hadoop ecosystem. _______________________________________________ 12
4. The clodera supports the following list of features of Big Data. ________________________ 13
5. Defining Hbase database: ______________________________________________________ 14
5.1. RDBMS (Relational Database Management System):_____________________________ 15
5.2. Major differences between the RDBMS AND HBase _____________________________ 16
5.3. HBase Architecture: ______________________________________________________ 17
6. Working on HBase using cloudera Enterprise application: _____________________________ 19
6.1. TwitBase Application (Twitter Prototype): _____________________________________ 19
6.1.1. Steps to create database in HBase:_______________________________________ 19
6.2. Creating Sample Java Program application to access HBase Database: ______________ 21
6.2.1. Apache Maven: ______________________________________________________ 21
6.2.2. Repository: _________________________________________________________ 21
6.3. Let us start with the normal eclipse IDE: ______________________________________ 22
6.4. Choose the Maven project to create: _________________________________________ 24
6.4.1. The pom: ___________________________________________________________ 25
6.4.2. What is groupId and artifactId: __________________________________________ 25
7. Creating the java program to add, get and delete the users from the database. ___________ 27
Big Data -Hadoop


7.1. Creating the user class : __________________________________________________ 27
7.2. Creating the UserDAO java code: ____________________________________________ 27
7.3. Creating the UsersTool class (Main class of TwitBase Application): __________________ 29
7.4. Pom.xml file: ____________________________________________________________ 30
7.5. Run the Project: _________________________________________________________ 32
7.5.1. Running the Project without Arguments: __________________________________ 32
7.5.2. Running the Project to add users details to the Hbase database: _____________ 33
7.5.3. Running the project to display the list of Users in the Hbase Users Table:_______ 34
7.5.4. Running the Project to get the User details: ______________________________ 34
8. Downloads: _________________________________________________________________ 34


















Big Data -Hadoop


1. Process of installing the cloudera Virtual Machine on your system:
1.1. Virtual Machine:
We use Virtual Machine to install more number of Operating Systems (OS) on your pc. By installing the OS in
the virtual machine it act as self-contained operating environment that behaves as if it is a separate computer.
For example, Java applets run in a Java virtual machine (VM) that has no access to the host operating system.
This design has two advantages:
System Independence: A Java application will run the same in any Java VM, regardless of the
hardware and software underlying the system.
Security: Because the VM has no contact with the operating system, there is little possibility of a Java
program damaging other files or applications. The second advantage, however, has a downside.
Because programs running in a VM are separate from the operating system, they cannot take
advantage of special operating system features.
There are different types of virtual machine but most popular are VMware workstation, VMware fusion and
Virtual Machine Box. Each machine as its open importance and features based on what type of platform you are
installing as VMware fusion supports Mac but nor windows.
1.2. Centos ("Community Enterprise Operating System"):
We are working with big data concepts. Big data is the collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or traditional data processing
applications. It uses the Linux / UNIX files system to process such credential data in order to provide high
security and multiple transaction to service the multiple users.
Centos are a Linux distribution system. It has a numerous advantages over some of the other clone projects
including: an active and growing user community, quickly rebuilt, tested, and QA'ed errata packages, an
extensive mirror network etc.
1.3. Cloudera:
Cloudera is a well-known Big Data brand. Cloudera has invented by the Doug cutting and others who started the
open-source Big Data movement by developing and contributing to Hadoop, HBase etc.
Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio
files, communications records, email just about anything you can think of, regardless of its native format. Even
when different types of data have been stored in unrelated systems, you can dump it all into your Hadoop cluster
with no prior need for a schema. In other words, you dont need to know how you intend to query your data
before you store it.
Cloudera Enterprise customers are like family (cluster of computer). It is a big organization, customers have the
ability to influence the product roadmap, drive product improvements or enhancements, ensure a smoothly
operating Hadoop environment, and communicate with both Cloudera and the community at large.
1.4. Usages:
In order to understand the Big data concepts that support the Hadoop, HBase, Hive, Flume, Mahout, Fuse,
zookeeper etc. We are using the Cloudera Virtual Machine that is built in package that contains cloudera
supporting files in the centos operating system installing on the Virtual Machine. By installing the cloudera VM
we can perform the Big Data operation as a cloudera Manager (Administration Manager).
2. Installation of cloudera-Virtual Machine (VM) into the Virtual Box:
Your Computer Specification to support the cloudera VM:
Windows OS should support 64 bit Operating System and 64 bit Processor.
o You can see the specification of your PC by Right click on Mycomputer.
Big Data -Hadoop


PC should contains minimum of 2GB RAM (atlest 1GB for Cloudera VM).



2.1. Software Requirement:
VirtualBox
Cloudera that supports VirtualBox.
2.2. Processes of Installation:
VirtualBox Installation.
Virtualising the PC with the Virtual Box to support the Guest OS by the Host OS using Virtual
Machine.
Installing the Cloudera VM into the Virtual BOX
2.2.1. VirtualBox Installation:
VirtualBox is open source software (free license) and can be downloaded from the link specified:
https://www.virtualbox.org/wiki/Downloads-It should be downloaded based on your Operating System
or
http://www.filehippo.com/download_virtualbox/download/3bef5d939af083fce266f156dfbe295c/-
It supports windows

Install the virtualBox from the downloaded file: VirtualBox-4.3.4-91027-Win
1. It will install in your current drive by default or you can change to some other location to save the
VirtualBox files.
2. After installing you get pop-up box shown as below:

Big Data -Hadoop




3. Check whether your PC supports the virtualization Process. Enable the virtualization into the Bootstrap
of your PC. It can be done by pressing the F2 or F9 function keys (Based on your Motherboard) while
restarting your system. In case if it is disabled your virtual machine does not support for installing your
Guest OS.


4. Choose the new machine from your Virtual box:


5. Choose the name, Type and version of OS you want to installing on Virtual Machine.
(As the type and version of cloudera is not specified by the VBox, so choose other option)
Big Data -Hadoop



6. Choose the RAM memory need to allocate to the Cloudera-VM. If your total RAM size is 2 GB
allocate minimum of 1GB to the cloudera-VM. The processing speed of Cloudera-VM depends on the
amount of RAM memory allocated to it.

Note: I had 4GB of RAM, so I have chosen 2GB of RAM for VM and remaining is used to process my host
operating system tasks. To increase the processing speed of Cloudera-VM you can choose 3GB for Guest OS
and 1GB for Host OS.
7. To install OS on the virtual machine, you have to choose the files that support OS or use the exiting
virtual machine file to install it.
2.2.2. Downloading the Cloudera Virtual Machine:
Cloudera Virtual machine can be downloaded from the link specified:
Link:http://www.cloudera.com/content/support/en/downloads/download-components/download-
products.html?productID=F6mO278Rvo
8. Choose the specific cloudera Virtual machine to support the VirtualBox and download the file.
9. After downloading the Cloudera VM for virtual box. Please choose this downloaded file in the
VirtualBox to create the Cloudera-virtual machine.
Big Data -Hadoop



10. After successful installation, you see the following screen:

Note: You are allowed to change the properties of the sandbox even after installation. By double clicking the
system you get the properties box to look and modify if required.
2.2.3. Starting the Cloudera Virtual Machine
11. Click the sandbox to start the installation process of cloudera. After sucessful installation you see the
following screen:
Big Data -Hadoop



12. Open the cloudera Manager by selecting the Adminstation Manager in the browser:
By default login username and password are set to cloudera.
13. After login the administator, you see the of cloudera manger section that support the list of task
peformed by the admin on the clusters.




2.2.4. Installing VirtualBoxAdditions:
After Successful installation of Cloudera Virtual Machine, We have to install the Geust Additions to make a link
between Host OS and the Guest OS. This linking helps the Guest OS to share the necessary feautures of
host OS such as sharing the folders, Using the supported devices by the host os such as CD ROM, USB device,
drivers etc.
Steps to install the Virtual Box Additions:
1. Click the Devices from the menu toolbar and select Insert Guest Addition CD image.
Big Data -Hadoop




2. You see installation box when you click the InsertGuestAdditionCDimage. The cloudera automatically
mount the VBox Addition box on the cdrom. It will start the autorun file by pressing OK in the
above box. Please run the files automatically.

As the default root password of the clouderaVM is cloudera. It will take around 3 mins to install the VBox
additions. After successful installation, please create a shared folder between the host OS and Guest OS. The
files between the guest and host OS can be shared by using this shared folder. The below image show the
process of creating the shared folder.
Shared folder can be created by seletiong the shared folder from the Device list from the menu toolbar. The
shared folder of the Host-OS should be linked to the machine folder on the Guest-OS.
Note: You should create a folder on your OS to link it to the shared folder of Guest-OS. This folder is named aas
sharedfolderon my Host-OS system.
Big Data -Hadoop




3. Create a folder on your Guest-OS to link it with the shared folder of Host-OS.
The below command help us to mount the shared folder between Host-OS and Guest-OS.

4. After successful process you can able to share a file between the Host-OS and Guest-OS using the folder
shared on Guest-OS and folder Sharedfolder on Host-OS.
Before we start with the concepts of Big data, let we go through the basic understanding / defintions of it. It
gives basic introduction of each concepts of cloudera topics.
3. Big Data:
By the name Big Data it defines the huge volume of data that varies in volume, varity and velocity.
Where,
Volume describes the amount of data generate by organizatins.
Big Data -Hadoop


Variety describes the structured and unstructued data such as text, sensor data,audio,video click
streams, log files and more.
Velocity describes the frequency at which data is generated, captured and shared.

3.1. Example of Big Data:
Google Process 20PB a day
Facebook has 2.5PB of users data + 15 TB/day
Ebay has 6.5PB of user data + 50 TB /day
The Wayback machine has 3 PB + 100 TB/month
3.2. Hadoop for the research users a quick timeline of how things have
progressed:
2004:- Initial versions of what is now known as HDFS (Hadoop Distributed File System) and
MapReduce implemented by DougCutting & Mike Cafarella.
Dec 2005:- Nutch ported to the new framwwork. Hadoop runs reliably on 20 nodes.
Jan 2006:- Doug Cutting joins yahoo!
Feb 2006:- Apache Hadoop project of officially started to support the standalone development of Map
Reduce and HDFS.
April 2006:- Sort Benchmark (10GB/node) run on 188 nodes in 47.9 hrs
May 2006:- Yahoo setup ahadoop research cluster-300 nodes, sort benchmark run on 500 nodes in 42
hrs
August 2006:- Research Cluster reaches 600 ndoes.
Nov 2006:- Sort benchmark run on
20 nodes 1.8 hrs

100 nodes 3.3 hrs

500 nodes 5.2 hrs
900 nodes 7.8 hrs
Jan 2007:- Research cluster reaches 900 nodes.
April 2007:- Research clusters divided into 2 cluster of 1000 nodes
April 2008:- Won the 1-TeraBytes sort benchmark in 209 seconds on 900 nodes.
Oct 2008:- Loading 10 TeraBytes of data per day onto the research clusters.
March 2009:- 17 clusters with a total of 24,000 nodes.
April 2009:- Won the minute sort by sorting 500GB in 59 seconds[on 1400 nodes] and the 100-terabyte
sort in 173 minutes[on 3,400 nodes ].
Let us take an example for one minute search of Internet:
From Statistical Analysis:
Google receives over 2,000,000 search queries.
Facebook receives 34,722 likes.
Apple receives 47,000 Apps downloads.
370,00 minutes of calls on skype.
98,000 posts on tweets,
Consumers spend $272,070 on web shopping
13,000 hours of music streaming on pandora
6,600 pictures uploaded to flickr
Big Data -Hadoop


1,500 new blog posts
600 new youtube videos

3.3. Define as hadoop ecosystem.
Let us describe the basic concepts that bring us the hadoop technology:

1. Large data on the web.
2. Nutch built to crawl this web data.
3. This large volume of data had to saved- HDFS introduced.
4. How to use this data? It cantains the report that shows the data usage statistics.
5. Map Reduce Framework built for coding & running analytics
6. Unstructured data-weblogs, click streams, Apache logs, Server logs-fuse,flume and scribe.
7. Sqoop and Hiho for loading data into HDFS-RDMS data.
8. High level interfaces required over low level map reduce programming-Hive,Pig, JAQL
9. BI tools with advanced UI reporting.
10. Workflow tools over map-reduce processes and High level languages- Oozie
11. Monitor & Manage Hadoop, run jobs/hive, view HDFS-high level view-Hue, Karmasphere, eclipse
pligin, gangila.
12. Support frameworks-Avro (Serialization), Zookeeper (Coordination).
13. More High level interfaces/uses- Mahout, Elastic MapReduce.
14. OLTP also possible in HBase.
Big Data -Hadoop


Cloudera is an enterprise application that build to operate all the necessary functions requred to process the Big
data.



4. The clodera supports the following list of features of Big Data.

Let us Describe the basic requriement of each phase one by one:
Big Data Hadoop: Hadoop is the Apache open Source software framework for working with Big Data.
Hadoop was created by Doug cutting & Michael J.Cafarella in late 1990s/ early 2000s. It was originally
developed to support distribution for the Nutch search engine project.
Nutch is an open source search engine implemented in Java.
Nutch is divided into two pieces: the crawler and the searcher.
The crawler fetches pages and turns them into an inverted index.
This inverted index is used by the searcher to resolve users queries.
Flume: A real time loader for streaming your data into Hadoop. It stores data in HDFS and Hbase. You will
want to get started with Flume, which improves on the original Flume.
Hbase: A super-scapable key-value store. It works very much lika a persistent Hash-Map. It is not a relational
database despite the name Hbase.
Big Data -Hadoop


HDFS: Hadoop Distributed File System stores large amount of information / Data. If you want 4000+
computers to work on storage system. Then youd better spread your data across 4000+ computers. HDFS does
this for you. Data will be written to the HDFS once and then read serveral times.
Hive : If you like SQL, you will be delighted to hear that you can write SQL and Hive will convert it to a
mapReduce job. But you dont get a full ANSI-SQL environment.
Hue: It gives you a browser based graphical interface to do your Hive works.
MapReduce: This is the programming model for Hadoop. There are two phases in map reduce called as Map
and Reduce. It can be also called as Sort and Shuffle. The Job tracker manages the 4000+ componenets of
your mapreduce job. The Task trackers take orders from the Job trackers, as hadoop is developed in Java and it
is really good to code the mapreduce in Java. But in some cases if a non-Java background they had a tool called
the Hadoop Streaming.
Hadoop Streaming: A utility to enable map Reduce code in any lanugage: such as C, Perl, Python, C++, Bash
etc
Pig: A Higher- Level programming environment to do mapreduce coding. The Pig language is called Pig Latin.
It will helps to improve the incredible price-performance and high availability.
Oozie:- Manages Hadoop workflow. This does not replace your scheduler jobs or BPM (Business Process
Management) tooling. It provides the if-then-else branching and control within your Hadoop Jobs.
Sqoop:- As name derived from SQL + OOPs. It provides the bi-directional data transfer between Hadoop and
your favorite relational database.
Mahout:- Machine learning for Hadoop. Used for predicitive analytics and other advanced analysis.
ZooKeeper: Used to manage synchronization for the cluster. It has list a packages to support a link your map
reduce code or database reterivel code to link with the cluster and perform the necessary actions.

As there are number of technologies to start with the Big Data Hadoop technology. Let we start from basic by
creating a database and making some alter, update, addition to it. And accessing the database using a basic java
code to reterive the data from it and display based on our requirements.
The technology that support the creating of database is HBase. Where as HDFS helps to reterive the data from
the Hbase file system and distribute to the no of nodes supporting your storage system.
5. Defining Hbase database:
Hbase stores a piece of data in table based on a 4D coordinate system. The below figure shows the data storing
in the Hbase table.

Big Data -Hadoop




I think you didnt get any thing from the above database table. Let me explain you the tradiltional concept of
storing the data into the database and then later jump to the Hbase data storing structure.

5.1. RDBMS (Relational Database Management System):

In relational database the piece of data is stored in a table in the form of 2D coordinate system. A set of data is
manipulated at a time rather than a record at a time. SQL is used to manipulate relational databases. The below
figure shows the basic RDBMS table.




But Hadoop is not the draw back mechanism of RDBMS. RDBMS has its own features against Hadoop.

For a majority of small- to medium-volume applications, there is no substitute for the ease of use, flexibility,
maturity, and powerful feature set of available open source RDBMS solutions like MySQL and PostgreSQL.
However, if you need to scale up in terms of dataset size, read/write concurrency, or both, youll soon find that
the conveniences of an RDBMS come at an enormous performance penalty and make distribution inherently
difficult. The scaling of an RDBMS usually involves loosening ACID restrictions, forgetting conventional DBA
wisdom, and on the way losing most of the desirable properties that made relational databases so convenient in
the first place.

Note ACID Define as Atomicity, Consistency, Isolation and Durability should be followed by the system in
order to share the required data through network.
Big Data -Hadoop



5.2. Major differences between the RDBMS AND HBase


HBase store the data in the form of cells.
The cell is represented by the following values:
[rowkey, column family, column qualifier, version].

Rowkey: Rows in the Hbase table are identified uniquely by their rowkey. Rowkeys dont have a data
type and are always treated as a byte[].
Column family: Data within a row is grouped by column family. Column families also impact the
physical arrangement of data stored in HBase. For this reason, they must be defined up front and arent
easily modified. Every row in a table has the same column families, although a row need not store data
in all its families.Column family names are Strings and composed of characters that are safe for use in
a file system path.
Column qualifier: Data within a column family is addressed via its column qualifier, or column.
Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between
rows. Like rowkeys, column qualifiers dont have a data type and are always treated as a byte[].
Version: Values within a cell are versioned. Versions are identified by their timestamp, a long. When a
version isnt specified, the current timestamp is used as the basis for the operation. The number of cell
value versions retained by HBase is configured via the column family. The default number of cell
versions is three.

Pictorial representation of how the data is stored in the User table is shown below:

RDBMS HBase
Row Oriented
Mulit-row ACID
SQL
On arbitary columns
TeraBytes
1000s queries/second
Column-family oriented
Single row only
Get/put/scan/etc
Row-key only
~1PB
Millions fo queries/second

Big Data -Hadoop






5.3. HBase Architecture:
Till now we discussed the basic structure of HBase and how does it represent the data and stores into the
database. But the hidden factor is that how HBase does stores the actual data. We discuss the step by step
process about the HBase data storage.

Big Data -Hadoop


The above figure shows the HBase Architecture. From the figure it shows that, HBase contains a number of
components such as Client, zookeeper, HMaster, HRegionServer, HRegion, Store, Memstore, HFile, DataNodes
and DFS.
But basically the HBase handles the data by two kinds of file types. One type is used for the write-ahead log and
the other is for actual data storage. These files are handled by the HRegionServers.
HMaster:
When you start the HBase, the HMaster is responsible for assigning the regions to each
HRegionServer. This includes the special ROOT- and .META tables.
HMaster Server is responsible for monitoring all Regionserver instances in the cluster, and is the
interface for all metadata changes.
Master is not involved in the read/write path.
Even master is down, cluster can response to read/write request.
Master is stateless i.e all the data & state info is stored in the HDFS & zookeeper.
HRegionServer:-
Table: Stores Hbase tables
Region: Regions for the tables
Store: Store per column family for each region for the table
MemStore: It is temporary to story the data for each store for region before it is permanently stored in
HFile.
Store File: It store files for each state for each region for the table.
Block: Blocks within a store file within a store for each region for the table.
HLog: These files are used for recovering.
Regions:
These are horizontal partitions of a storage system.
Every region has a subset of the tables rows
Table starts on a single region
If regions split into two equal sized regions as the original region grows bigger and so on.
HBase Tables and Regions:
HBase tables are made up of roughly equal sized regions
Each region may live on different node and is made up of several HDFS files and blocks, each of
which is replicated by Hadoop.
Zookeeper:
Client contacts zookeeper to bootstrap connection to the HBase cluster.
It maintains the key ranges for region and region server address.
It guarantees consistency of data across clients.
HBase client:
The HBase client is responsible for finding RegionServers that are serving the particular row range of
interest.
It does this by querying the .META and ROOT- catalog tables in zookeeper.
After locating the required region, the client directly contacts the RegioServer serving that region
Big Data -Hadoop


I thought to explain you how each component will interact to serve the client to fetch the required
data. But you can find a best theory as well as practical work at below link. Please after reading &
understanding of each component go to the below link to find how each components serves when
users try to put data into the database or need to get the data from the database. It covers all the
necessary topic along with Hadoop and HFile.
Link: http://wiki.toadforcloud.com/index.php/HBase_Storage#KeyValues

6. Working on HBase using cloudera Enterprise application:

In order to work with HBase database we need to start the Hbase Service from the cloudera administrator
console. The below figure show the starting of Hbase service in cloudera




After starting the service, you are allowed to perform the hbase database operations.

In order to understand the basic concepts of Hbase we build an application from scratch. Let we build an
application named as TwitBase, it is a simplified clone of the social network Twitter, implemented entirely
in HBase. We wont cover all the features of Twitter and this isnt intended to be a production-ready system.
Instead, think of TwitBase as an early Twitter prototype.

6.1. TwitBase Application (Twitter Prototype):
We are working with the sample application of Twitter and we naming it as TwitBase. The basic requirement
for the TwitBase application is users. The Basic TwitBase application is developed in Java programming
Before we start developing the application it is necessary to create a database table to store the users info into
the system. We are using hbase shell to create user in the hbase database.

6.1.1. Steps to create database in HBase:

1. Open the Hbase shell by typing the below command in the command prompt.
Big Data -Hadoop




At its core, TwitBase stores three simple data elements: users, twits, and relationships. Users are the
center of TwitBase. They log into the application, maintain a profile, and interact with other users by
posting twits. Twits are short messages written publicly by the users of TwitBase. Twits are the
primary mode of interaction between users. Users have conversations by twitting between themselves.
Relationships are the glue for all this interaction. A relationship connects one user to another, making it
easy to read twits from other users.


2. Create a table for users: The below command is used to create a database table with name users and
contains a column family info.




3. Adding data to the table from the command line:

Commands that helps to add, get the data from the hbase table are put,get and scan.
1. Writing Data: The put command is used to add the data to the HBase table

hbase(main):005:0> put 'users', 'TheRealMT', 'info:name', Sir Arthur conan Doyal
0 row(s) in 0.0130 seconds
hbase(main):006:0> put 'users', 'TheReakMT', 'info:email', art@TheQueensMen.co.uk
0 row(s) in 0.0080 second

2. Reading Data: The get or scan command are used to reterive data from the Hbase table.

hbase(main):007:0> get 'users', 'TheRealMT'
COLUMN CELL
info:name timestamp=1323483954406, value=Sir Arthur conan Doyal
info:email timestamp=1323483954406, value=art@TheQueensMen.co.uk
2 row(s) in 0.0250 seconds

hbase(main):008:0> scan 'users'
ROW COLUMN+CELL
TheRealMT column=info:name, timestamp=1323483954406, value= Sir Arthur conan
Doyal
column=info:email, timestamp=1323483954406, value=
art@TheQueensMen.co.uk

SirDoyle column=info:name, timestamp=1323442435345, value= Fuodor
Dostoyevsky
column=info:email, timestamp=1323442435345, value= aubrey@sea.com

5 row(s) in 0.0240 seconds

There are different methods to put, get and scan the Hbase table. The below link shows some of the methods:
http://hadoopbigdatas.blogspot.in/2013/03/hbase-shell-and-commands.html

Big Data -Hadoop


Hoping that you might gone through some of the methods of Hbase implementation such as
Put, Get, Scan and Delete etc.

Put- It is used to add data to the Hbase table.
Get- It is used to read the data from the Hbase table.
Scan- It is used to read the data from the Hbase table in a particular format.
Delete- It is used to delete the data form the Hbase table.

6.2. Creating Sample Java Program application to access HBase Database:
The Basic twitter application is developed in Java programming. Now we created a basic users table to add
the twitter users to the Hbase database. After successful creation of the database, we need to write a Java code
that allows the twitter application to add users to the database, when they register into the application. Here we
are not creating any registration forms to take the details from the users. However, we are adding the users to
our database through command line argument to the appalciation.

The best method to write the java code is by using the IDEs. Currently there are two best IDEs to write Java
application are 1. Eclipse and 2. Netbeans. As due to some performance issues with the netbeans IDE, it is
required to use Eclipse Juno IDE to support our application. It is already installed on cloudera Virtual Machine.
We dont have to install it separately.

6.2.1. Apache Maven:
Apache Maven is a software project management and comprehension tool. Based on the concept of a project
object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of
information. Maven application is built up using the dependencies that supported in its maven repository.
Maven is a build automation tool used primarily for Java projects. Maven addresses two aspects of building
software:
First: It describes how software is built
Second: It describes its dependencies.

An XML file describes the software project being built, its dependencies on other external modules and
components, the build order, directories, and required plug-ins. It comes with pre-defined targets for performing
certain well-defined tasks such as compilation of code and its packaging. Maven dynamically
downloads Java libraries and Maven plug-ins from one or more repositories such as the Maven 2 Central
Repository, and stores them in a local cache. This local cache of downloaded artifacts can also be updated with
artifacts created by local projects.

6.2.2. Repository:
Repositories are storage areas for the information that is created in a business process. Every
repository has a name and an associated type. Usually the name of a repository is the same as the
name of the business items it contains. For example, a repository for invoices is called Invoices.
Use repositories when you have several activities (tasks, processes, or services within a process) that
need to access or share the same information. Rather than passing information along flows from one
activity to another, you can instead place the information in a common place which several activities
can then access.
By default our cloudera Virtual machine contains a built in maven repository. You can find the built in
repository at .m2 file. You can find the files supported by this repository using following below commands.
This repository is called as Central Repository
Big Data -Hadoop




By default the maven repository (i.e Central Repository) does not support the big data technologies such as
Hadoop, Hive, Hbase, flume, Zookepeer etc. This supported repository that required implementing the big data
technology is provided by the Apache-release. You can add this repository locally to the maven.
In our case, we added a apache-release to the maven repository. As the Maven 2 Central Repository does not
provide supportable functionality to the BigData-Hadoopd technologies. The below link will list the
functionalities supported by the clodera repository (Our most requirement is org.apache.hadoop &
org.apache.hbase)

Link: https://repository.apache.org/content/repositories/releases/.




6.3. Let us start with the normal eclipse IDE:
The normal IDE does not contain any projects and the eclipse IDE supports only central repository by default
that does not holds requirement for big data technologies. So we need to add the Apache repository locally to
the development environment.
Big Data -Hadoop



To view the list of repositories supported by the Eclipse IDE:
Please go to Eclipse Juno IDE select Window Show View Other Maven Repositories. Find the list of
Maven Repositories supported by eclipse.


By default the maven repositories does not support the hadoop technologies. So we need to add the appropriate
repository to our project to work with Hbase, Hive, flume etc.
Let start with the new maven project by selecting in the Eclipse IDE:
Big Data -Hadoop



6.4. Choose the Maven project to create:
Eclipse support different types of catalogues to built the maven application. Please choose the maven-archetype-
quickstart

maven-archetype-quickstart is an archetype which contains a sample Maven project:
Big Data -Hadoop


project
|-- pom.xml
`-- src
|-- main
| `-- java
| `-- App.java
`-- tes t
`-- java
`-- AppTest.java
6.4.1. The pom:
Maven projects, dependencies, builds, artifacts: all of these are objects to be modeled and described. These
objects are described by an XML file called a Project Object Model. The POM tells Maven what sort of project
it is dealing with and how to modify default behaviour to generate output from source. In the same way a Java
web application has a web.xml that describes, configures, and customizes the application, a Maven project is
defined by the presence of a pom.xml.
Please press next to select the project groupID and artifactID.

6.4.2. What is groupId and artifactId:
In maven repository individual project are identified using their artifactId and the groupId.
groupId: It will identify the project uniquely across all projects, so we need to enforce a naming schema. It has
to follow the package name rules such as e.g org.apache.maven, org.apache.commons.
That means it has to have domain name to control and you can create as many sub groups as you want. As your
project will be identified uniquely with your groupId.
artifactId: It is your project name that describes about your project requirement and scope. Eg. Twitter to define
twitter application etc.
Select the artefactId and groupId as shown in above figure.
After setup the project. You will find the file structure of the project as shown belo
Big Data -Hadoop



Where TwitBase is the project name. It holds two
source folders one for java class file and other contains the test files.
Maven Dependency is one of the features of Maven. It
helps to add the object dependencies to your application based on your
requirement.

For example if you are using a Hbase method HTablePool which is provided by the class
org.apache.hbase.client.HTablePool. The maven repository will add this dependency into
your application automatically by assigning it in the pom.xml file. It manages dependencies
for a single a project, but when you start getting into dealing with multi-module projects and
applications that consist of tens or hundreds of modules this is where Maven can help you a
great deal in maintaining a high degree of control and stability.
The pom.xml file generated from the above application look like:


Adding the hadoop supported repository to the project. It can be added through the pom.xml file. Please add the
below line of code to the pom.xml file.

Adding this repository to the eclipse, So that you can easily use the required package during program
development.
There are 3 different types of repositories in the project development. They are:
Big Data -Hadoop




1. Global Repository: A global repository is a top-level repository that you create in the Project
Tree view. It can be used by multiple processes. Many business processes are not directly
connected, but do share common information, and the global repository provides the
mechanism to model the sharing of information across processes.
2. Local Repository: A local repository is owned by a process and can only be used by elements
within that process. The repository exists only while the process exists. Business items are
created during the process, stored in the repository, and used in another part of the process.
3. Project Repository: It is created for the particular project. As your project is creating to support
hadoop technology then you have to use the supported repository to create the program. In
our project apache release is the project repository to support hbase, hive, flume, zookeeper
etc.
Before we start creating an application it is necessary to set a system/repository inorder to support the database
access with the Hbase system.

7. Creating the java program to add, get and delete the users from the
database.
7.1. Creating the user class :
This class will hold the properties of User. As a unique user object is created for the each user to do
the operations/requirement with the Twitbase application.





7.2. Creating the UserDAO java code:
This class will perform the necessary operations to access the database and contains the methods to
add, get and scan the tables from the database.
Big Data -Hadoop





Big Data -Hadoop






7.3. Creating the UsersTool class (Main class of TwitBase Application):
The last piece of this puzzle is a main() method. This method contains the logic to invoke which
function should be called based on the users request. Lets make a UsersTool, shown in the next
listing, to simplify interaction with the users table in HBase.

Big Data -Hadoop





The above three java code should be developed with their respective Packages described.

While developing the java codes it is necessary to add their respective depenedencies to run the Application.
e.g. As we are using the HTableInterface, this interface is available in the org.hadoop.hbase pacakge.

In maven applications we add these jar files as dependencies using the pom.xml file. The java application
checks whether any dependencies are available for the object that declared for processing the code. In case if
it does not have any dependency then application check in pom.xml file for its dependency. If it does not
support any dependencies then the IDEs will recommend to check wheather any dependency exit for the
required object , if it showed then try to choose it from the list. Other wise the object do not provide the
required functionality.
7.4. Pom.xml file:
In our case, the pom.xml file has the following dependencies.

Big Data -Hadoop


Big Data -Hadoop




7.5. Run the Project:
After successful setting up the project, run the project. Let run the project without giving any input arguments to
check does it rise any errors (If it raises then there might be some error in your code or might with the setting up
the dependencies). Please make shore it runs with out any errors.

7.5.1. Running the Project without Arguments:
Run the project in the Eclipse IDE without arguments. Please right click on the UsersTool class Run as
Java Application. Where UsersTool class is the main class of the TwitBase project

Big Data -Hadoop




Here the output of the file would be shown as below. As we have not added any input arguments to the program.
It will display the static variable data by default.


7.5.2. Running the Project to add users details to the Hbase database:
Adding the users details to the database using command line Argument to the main class. Command
line arguments can be set to the eclipse IDE using main class configuration window.

Please right click on the UsersTool class Run as Run Configuration. It will show you the configuration
window of the main class as shown below:


Big Data -Hadoop




The output of the porject is show below:



7.5.3. Running the project to display the list of Users in the Hbase Users Table:
Please Run the application using argument list. It will display the list of users in the database as:



7.5.4. Running the Project to get the User details:
Run the application using get TheRealMT. It will display the user details as:



Before proceeding with the project further. Please make sure that it should run the above code properly.

NEED To ADD THE DOCUMENTATION for adding twits, log files, linking users, map reduce code etc

8. Downloads:
It contains the list of Java code to download.
Pom.xml: https://www.dropbox.com/s/a71qql5y4xzynma/pom.xml?n=257653477
Big Data -Hadoop


Users Class: https://www.dropbox.com/s/saib3uwmuql95sk/User.java?n=257653477
UsersDA0 Class: https://www.dropbox.com/s/tyue314s22x54ln/UsersDAO.java?n=257653477
UsersTool Class: https://www.dropbox.com/s/xu4367bhmx99d8v/UsersTool.java?n=257653477

Das könnte Ihnen auch gefallen