Sie sind auf Seite 1von 6

Data Collection Framework for Social Network and Hadoop-HBase Performance Evaluation

Dileep V K, Anuj Kumar, Dr, Udayakumar Shenoy

Computer Science and Engineering Department, NMAMIT, Nitte, India.

dileepvk@gmail.com

ukshenoy@gmail.com

Formcept

Bangalore, India

anuj.kumar@formcept.com

ABSTRACT

The data collection framework is focusing on the social network Twitter. It collected the data from search by searching the specified keyword. The resultant data is stored in HBase, which is a distributed database. Each resultant data consist of tweet message, image of the author, date of message created, and the user id. Each search consists of hundreds of tweet data. HBase is a distributed database which is built on top of a distributed file system, named Hadoop. The searched data will be analyzed, and the analysis will be shown using graphs. HBase is the open source version of BigTable - distributed storage system developed by Google for the management of large volume of structured data. Like most non SQL database systems, HBase is written in Java. The current work’s purpose is to evaluate the performances of the HBase implementation in comparison with SQL database, and, of course, with the performances offered by HBase. The tests aim at evaluating the performances regarding the random writing and random reading of rows, sequential writing and sequential reading, how are they affected by increasing the number of column families and using MapReduce functions.

Key words: Hadoop, HBase, BigTable, MapReduce, Hadoop, DataNode, NameNode, Twitter, ActiveMQ

1. INTRODUCTION

Many modern applications include a database server, serving multiple web servers accessed by many clients. In this situation, many consider upgrading the hardware, without taking into consideration of the database server. We can consider that the SQL technology has reached its maximum point of scalability. Applications that use databases are generally designed for complex environments of large information management. As a general idea the ease of usage is the main criteria. A classic database model is the relational one, according to which the data is stored in tables. When the situation requires a large set of data, relational databases lose their power. Therefore distributed relational databases are

more and more substituted by non-SQL database versions. There is a need for a new approach in which a large increase in performance requires insignificant costs and provides a good scalability. Apache Hadoop is a top-level Apache project that includes open source implementations of a distributed file system and MapReduce that were inspired by Google’s GFS and MapReduce projects. The Hadoop ecosystem also includes projects like Apache HBase which is inspired by Google’s BigTable, Apache Hive, a data warehouse built on top of Hadoop, and Apache ZooKeeper, a coordination service for distributed systems. Here, Hadoop has been used in conjunction with HBase for storage and analysis of large data sets. These workloads typically read and write large amounts of data from disk sequentially. As such, there has been less emphasis on making Hadoop performant for random access workloads by providing low latency access to HDFS. Administration of MySQL clusters requires a relatively high management overhead and they typically use more expensive hardware. Given our high confidence in the reliability and scalability of HDFS, we began to explore Hadoop and HBase for such applications. We decided to use HBase for this project. HBase in turn leverages HDFS for scalable and fault tolerant storage and ZooKeeper for distributed consensus. In the following sections we present some of these applications in more detail and why we decided to use Hadoop and HBase as the common foundation technologies. Finally we discuss ongoing and future work in the project.

2. RELATED WORK

First step of the project is setting up of Hadoop.Here use a dedicated Hadoop user account for running Hadoop. Our goal is a single-node setup of Hadoop. Our setup will use Hadoop’s Distributed File System.The first step to start up the Hadoop installation is formatting the Hadoop file system which is implemented on top of the local file system of the “cluster”. This is a need to do this for the first time to set up a Hadoop cluster.

Hbase uses the same configuration system as Hadoop. When running in distributed mode, after make an edit

to an HBase configuration, make sure copy the content of the conf directory to all nodes of the cluster. In distributed mode, replace the hadoop jar found in the HBase lib directory with the hadoop jar running on the cluster to avoid version mismatch issues. Make sure to replace the jar in HBase everywhere on the cluster. Distributed modes require an instance of the Hadoop Distributed File System (HDFS). A pseudo-distributed mode is simply a distributed mode run on a single host. Use this configuration testing and prototyping on HBase.

2.1 HDFS CLIENT CONFIGURATION

To finish the HDFS client configuration on your Hadoop cluster do one of the following:

I. Add a pointer to your HADOOP_CONF_DIR to the HBASE_CLASSPATH environment variable in hbase- env.sh.

II. Add a copy of hdfs-site.xml (or hadoop-site.xml) or, better, symlinks, under ${HBASE_HOME}/conf, or

III. if only a small set of HDFS client configurations, add them to hbase-site.xml.

2.2 RUNNING AND CONFIRMING YOUR INSTALLATION

Make sure HDFS is running first. To ensure it started properly by testing the put and get of files into the Hadoop filesystem. HBase start up ZooKeeper as part of its start process. Shutdown can take longer if the cluster is comprised of many machines. To run a distributed operation, wait until HBase has shut down completely before stopping the Hadoop daemons.

2.3 MESSAGE QUEUE IMPLEMENTATION USING ACTIVEMQ

ActiveMQ is an open source, Java Message Service (JMS) –compliant, message-oriented middleware (MOM) from the Apache Software Foundation that provides high availability, performance, scalability, reliability, and security for enterprise messaging. The generic term « Destination » refers both to Queues and Topics. Consumers and producers only share the name of a Destination. The message is deleted from the storage with some service-level acknowledgement from the consumer. - Remember to start the connection; otherwise the consumer will never receive the message - Remember to close the connection, in order to save resources. The program will not end if the connection is not closed. The receive command blocks; a consumer will block waiting forever if there is no message in the Queue.

2.4 TWITTER API PROGRAMMING USING JAVA

Ensured searching parameters are properly URL encoded. Constructing a Query http://search.twitter.com/search.json?q=. For example: http://search.twitter.com/search.json?q=

%40twitterapi

Returns tweets that match a specified query. The Search API provides an option to retrieve "popular tweets" in addition to

real-time search results. See the result_type parameter below for more information. Resource URL http://search.twitter.com/search.format Parameters q required : Search query. Should be URL encoded. Queries will be limited by complexity. Count : Indicates the number of previous statuses to consider for delivery before transitioning to live stream delivery. On unfiltered streams, all considered statuses are delivered, so the number requested is the number returned. On filtered streams, the number requested is the number of statuses that are applied to the filter predicate, and not the number of statuses returned. result_type: Specifies what type of search results you would prefer to receive. Valid values include:

mixed: Include both popular and real time results in the response. recent: return only the most recent results in the response popular: return only the most popular results in the response.

3.SYSTEM MODEL

3.1 DATA COLLECTION FRAMEWORK

This project deals with a data collection framework, includes storage of unstructured data by creating a structured framework. The Framework is looking forward to take up the challenge of managing the incoming flow of unstructured content, irrespective of it's source. The solution will be built on top of distributed file system and message queues. It will handle the incoming flow of data, storage of data and retrieval on demand. This data collection framework is only focusing the social network Twitter. It collected the data from Twitter by searching the specified keyword. The resultant data is stored in Hbase, which is a distributed database. Each resultant data consist of tweet message, image of the author, date of message created, and the user id. Each search consists of hundreds of tweet data. The controlled movement of data is done by a message queue using ActiveMQ. Hbase is a distributed database which is built on top of a distributed file system, named Hadoop. The searched data will be analyzed, and the analysis will be shown using graphs. The last part of the project will be ran some performance test with one region server. HBase performed adequately for the most part under my various test conditions. HBase works well in most situations, especially if it is not pushed to its limits, but it is still a work in progress. Finally it can retrieve the data from the Database.

3.2 HADOOP FRAMEWORK

Hadoop’s approach to this non-relational world is very simple, providing a reliable shared storage and an analysis system. The storage is supplied by HDFS, and analysis by MapReduce. The main advantage of this architecture is the reduction of cost. Hadoop handles data management by

keeping each block of data replicated.three times. MapReduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into a combination over sets of keys and values. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks. MapReduce is also provided with a job tracker and a number of task trackers. The job tracker works as a master by coordination of all the jobs run on the system and by establishing a schedule for tasks to run on task trackers. Task trackers are the slaves that run tasks and send progress reports to the job tracker. HDFS comes in handy when the dataset outgrows the storage capacity of a single physical machine and we need several machines to process the task. Blocks. The default measurement unit for HDFS is the block size. This is the minimum amount of data that it can read or write. HDFS has the concept of a block, but it is a much larger unit-64 MB by default. Namenodes and Datanodes. A HDFS cluster has two types of nodes that operate in a master-slave configuration: a namenode (the master) and a number of datanodes (slaves). The namenode is the one in charge with the filesystem’s namespace. It maintains the metadata for all the files and directories in the tree. The access to the filesystem is established by the user by communicating with the namenode and datanodes. In fact, if the machine running the namenode were to be down, all the files on the filesystem would be lost since there would be no way of finding out how to reconstruct the files from the blocks on the datanodes. HBASE. It is said that data is stored in a database in structured manner, while a distributed storage system similar to the one proposed by Google through BigTable can store large amounts of semi-structured data without having to redesign the entire scheme. In this paper, we try to assess the performances of an open source implementation of BigTable, named Hbase, developed using the Java programming language. HBase is a distributed column-oriented database built on top of HDFS. HBase is built from the ground-up to scale just by adding nodes. Applications that use Map Reduce store data into labeled tables. Tables are made of rows and columns. Table cells have different version which is just a timestamp assigned by HBase at the time of inserting any kind of information in a cell. Table row keys are also byte arrays, so theoretically anything can serve as a row key from strings to binary representations of longs or even serialized data structures. Table rows are sorted by row key, the table’s primary key. All table accesses are via the table primary key. HBase uses column families as a response to the relational indexing. Therefore row columns are grouped into column families. All column family members have a common prefix. Tables are automatically partitioned horizontally by HBase into regions. Each region consists in a subset of a table’s rows. The HBase master is responsible for booting an initial installation, for assigning regions to registered regionservers, and for recovering regionserver failures. Start and stop scripts are like those in Hadoop using the same SSH-based running of

remote commands mechanism. Conf/hbase-site.xml and conf/hbase-env.sh files are used to keep the cluster site configuration, having the same format as that of their equivalents up in HDFS . HBase, also has some special catalog tables named -ROOT- and .META. within which it maintains the current list, state, recent history, and location of all regions. The -ROOT- table holds the list of .META. table regions. The .META. table holds the list of all user-space regions.

4. WHY HADOOP AND HBASE

The requirements for the storage system can be summarized as follows:

1. Elasticity: We need to be able to add incremental

capacity to our storage systems with minimal overhead and no downtime. In some cases we may want to add capacity rapidly and the system should automatically balance load and utilization across new hardware.

2. High write throughput: Most of the applications store

tremendous amounts of data and require high aggregate write

throughput.

3. Efficient and low-latency strong consistency semantics

within a data center: There are important applications like Messages that require strong consistency within a data center.

This requirement often arises directly from user expectations. We also knew that, Messages was easy to federate so that a particular user could be served entirely out of a single data center making strong consistency within a single data center.

4. Efficient random reads from disk: In spite of the

widespread use of application level caches a lot of accesses

miss the cache and hit the back-end storage system.

5. High Availability and Disaster Recovery: We need to

provide a service with very high uptime to users that covers both planned and unplanned events.

6. Fault Isolation: In the warehouse usage of Hadoop,

individual disk failures affect only a small part of the data and

the system quickly recovers from such faults.

7. Range Scans: Several applications require efficient

retrieval of a set of rows in a particular range. For example all the last 100 messages for a given user or the hourly impression counts over the last 24 hours for a given advertiser. HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column- orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. Hbase is ideal for workloads that are write- intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly.

1)

5. RESULTS

Data collection Framework:

The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets. Twitter Search API searching for RSS feeds. The Search API is not complete index of all Tweets, but instead an index of recent

Tweets. At the moment that index includes between 6-9 days of Tweets. You cannot use the Search API to find Tweets older than about a week. Queries can be limited due to complexity. Search does not support authentication meaning all queries are made anonymously. Search is focused in relevance and not completeness. This means that some Tweets and users may be missing from search results. The Data Collection Framework is an API for running searches against the index of recent tweets based on keywords. Here Search API searching for JSON feeds. Using the framework we can keep the index includes all the Tweets realated to a particular keyword. That is we can find the Tweets older than one month or above. Quries can be flexible to get various attributes related to the Tweets. Our framework shows the details such as Message, Timestamp of Tweet, Author identification such as Image and User Id. All the details are stored in Hbase, which can be retrieved later. Further analsis can be done using these data.

2) Hadoop-HBase Performance Evaluation:

HBase Vs RDBMS

HBase Twitter Message Example

– Table Twitter with family message

– Row is RowKey with Columns

• message:text stores tweet message

• message:date stores Timestamp of tweet

• message:userID stores id of the use

– If processing raw data for hyperlinks and images,

add families links and images

hyperlink

links:<RowKey>

column

for

each

images:<RowKey> column for each image

RDBMS Twitter Message Example

– Table Twitter with columns RowKey, text, date,

userID

– Table links with columns RowKey and link

- Table images with columns RowKey and image

• How will this scale?

– 10M documents w/ avg10 links and 10 images

– 210M total rows versus 10M total rows

-Index bloat with links/images tables

3) Column Test

 

# of columns

 

Experiment

1000

10000

10,00,000

Writes/Sec

55

42

Crash

Reads/Sec

15

41

Crash

Test Description: The BigTable paper claimed that BigTable can handle an unbounded number of columns. This test was designed to test that claim within HBase. The test worked by creating a table with a single column family and then writing out one 1000-byte value to that column family for the specified number of columns. Test Analysis: Table shows that HBase does not scale well as the number of columns increases to very large numbers. Write performance suffers somewhat but read performance suffers a lot. This is probably because as the number of columns increases, the reads have a higher chance of having to fetch that row from disk instead of in memory. Writing 10,00,000 columns exposed an HBase bug that caused the test program to crash.

4) Sort Test

 

# of rows

 

Experiment

10000

100000

1000000

Lexicographic

485

430

334

Reverse

450

471

354

Random

462

421

334

Test Description: HBase stores rows sorted lexicographically by row key. The motivation of this test is to determine whether performance changes if row keys are written in in reverse lexicographic order or randomly. Test Analysis: The results show that there is no significant performance loss when rows are inputted in reverse lexicographic order or random order. In fact, the reverse test performed slightly better with 100,000 rows.

5) Interspersed Read/Write Test

   

# of writes

Experiment

10000

100000

1000000

Reads/sec

95

42

24

Test Description:The goal of this test is to determine the performance hit when reading from very large tables. The test works by writing a specified number of rows and then reading back 5000 of them randomly and averaging the elapsed time. Test Analysis: Reads slow down as the number of rows written increases. This means that HBase will not scale that well on very large amounts of data per table. There are several reasons this could be. First of all, if the table is very large an arbitrary read will have to hit the disk instead of memory.

Secondly scanning data structures will be more costly as their size increases.

6) Testing by number of column families In this test, we studied both speed for reading, writing and updating rows from a table with multiple column families and we tried to identify the maximum number of column families that can be used in a HBase table. HBase performance was evaluated using tables of 10, 50, 100, 500, 700 and 1,000 column families. For each of the above cases the following steps were made:

Step 1. A table called “test” was created with more column families, a single column and a single row. For this single row 1000 bytes value randomly generated in each column (sequential insert) were inserted. Step 2. Using the table created at Step 1, a number of 5000 sequential readings were made. Step 3. Using the table created at Step 1, a number of 5000 random reads were made. Step 4. Using the table created at Step 1, sequential updates of the rows from the table were made.

# of column families

10

50

100

500

700

Setup time for one col. family (ms)

12

24

45

133

100

Sequential insert (col. families / sec)

181

41

77

51

32

Sequential reads (col. families / sec)

800

23

142

6

Crash

Eclipse

Random reads (col. families / sec)

800

23

140

6

Crash

Eclipse

Sequential updates ( col. families / sec

136

50

76

56

Crash

Eclipse

families / sec 136 50 76 56 Crash Eclipse Table shows that one of the main

Table shows that one of the main problems is time to add columns families to the table. Time needed to set up a family of columns has greatly increased. Reading records from the table is just as slow sequentially and randomly. When reading data from a table with 10 columns families, the operation is quick, but as the number of columns families’ increases, speed decreases more and more.

Google specialists argue that the column families can be used in limited number, not exceeding a few hundreds. It is true that the number can go up to 500, but the performance is decreasing considerably. At more than 500 column families, the table could be built, but could not be used. In an attempt to build a table with 1000 column families, the answer was a fast one:”Connection refused”.

6. CONCLUSION

Our tests performed on an Ubuntu platform. Hadoop together with Hbase are oriented towards columns are often compared with relational databases. As we pointed out, HBase is a distributed database system oriented on using columns. HBase is a continuation of Hadoop, offering random access

read/write having a data storage based on HDFS architecture.

It was built from scratch following a few important principles:

a very large number of rows (the size of billions), a large

number of columns (the size of millions), the ability of horizontal partitioning and the ability of easy replication on a large number of nodes in the system. The general relational databases have a fixed structure that uses rows and columns that have the ACID properties and a powerful SQL engine behind. The accent falls on strong consistency, on referential integrity, abstracting from the physical and complex queries by language SQL. We can then easily create secondary indexes, bring inner and outer joins complex, use functions such as Sum, Count, Sort, Group, and track data on multiple tables, rows and columns. Among the advantages of using a Hadoop and HBase platform we can find that the data is parallel processed, so the execution time decreases.The data is replicated so that there is always a backup and employment problems of space on each machine are passed to HDFS. Another advantage would be that one or more column families can be added or deleted at any time. A disadvantage might be that HBase does not support joins between tables. This is not a major drawback because all information must be kept together in a single table and can be more easily accessible. In this way, it eliminates the need for joins. Extracting data from HBase proved to be quite speedy, regardless of the number of entries in the table. Moreover, HBase provides the possibility of conducting a table scan in which HBase + MapReduce test has proven to be more effective than reading sequential/random data because the option of table scan is implemented within HBase. Another benefit of HBase is Zookeeper's use which is intended to release the master node of various tasks as checking availability of cluster servers, client applications for sending replies to the table root.

FUTURE WORK We are trying to work on the analysis of data stored in Hbase, and visualise the results using a visualisation tool. Hadoop NameNodes can crash, which wouldn’t cause data loss, but would shut down the cluster for a little while. An option to avoid this is to use the option of Distributed mode of Hadoop installation. The other big area for Hadoop improvement is

modularity, pluggability, and coexistence, on both the storage and application execution tiers. The use of Hadoop and HBase at Facebook is just getting started and we expect to make several iterations on this suite of technologies and continue to optimize for our applications. As we try to use HBase for more applications, we have discussed adding support for maintenance of secondary indices and summary views in HBase. Finally, as we try to use Hadoop and HBase for applications that are built to serve the same data in an active- active manner across different data centers, we are exploring approaches to deal with multi data-center replication and conflict resolution. Our data collection framework will be useful for other social networks besides, Twitter.

ACKNOWLEDGMENTS

The current state of the Hadoop Realtime Infrastructure has been the result of ongoing work over the last couple of years. Last but not the least, thanks are also due to the users of our infrastructure who have patiently dealt with periods of instability during its evolution and have provided valuable feedback that enabled us to make continued improvements to this

REFERENCES

[1] Apache Hadoop. Available at http://hadoop.apache.org

[2] Apache HDFS. Available at

http://hadoop.apache.org/hdfs

[3] Apache HBase. Available at http://hbase.apache.org

[4] The Google File System. Available at

http://labs.google.com/papers/gfs-sosp2003.pdf

[5] Hadoop:The Definitive Guide: Tom White

[6] Hbase - non SQL Database, Performances Evaluation,

Dorin Carstoiu, Elena Lepadatu, Mihai Gaspar

[7] BigTable: A Distributed Storage System for

Structured Data. Available at

http://labs.google.com/papers/bigtable-osdi06.pdf

[8] HBase: The Definitive Guide by Lars George

[9] ActiveMQ in Action by Bruce Snyder, Dejan

Bosanac,Rob Davies

[10] Apache Hadoop Goes Realtime at Facebook at

borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf

[11] Hadoop Hbase-0.20.2 Performance Evaluation, D.

Carstoiu, A. Cernian, A. Olteanu “Politehnica” University

of Bucharest.

[12] https://dev.twitter.com/docs/api/1/get/search