Using Hadoop with Couchbase Martin Brown VP of Technical Publications Couchbase Skill Level: Intermediate Date: 11 Sep 2012 Hadoop is great for processing large quantities of data and resolving that information down into a smaller set of information that you can query. However, the processing time for that process can be huge. By integrating with Couchbase Server you can do live querying and reporting on information, while continuing to work with Hadoop for the large data set and heavy processing of the data set. Couchbase Server also uses a MapReduce querying system, which makes it easy for you to migrate and integrate your indexing and querying system to extract and manipulate the information effectively. Hadoop and data processing Hadoop combines a number of key features together that ultimately makes it very useful for processing a large quantity of data down into smaller, usable chunks. The primary component is the HDFS file system, which allows for information to be distributed across a cluster. The information stored in this distributed format can also be processed individually on each cluster node through a system called MapReduce. The MapReduce process converts the information stored in the HDFS file system into smaller, processed and more manageable chunks. Because Hadoop works on multiple nodes, it can be used to process vast quantities of input information and simplify it into more usable blocks of information. This processing is handled by using a simple MapReduce system. MapReduce is a way of turning the incoming information, which may or may not be in a structured format, and converting it into a structure that can be more easily used, queried and processed. For example, a typical usage is to process log information from hundreds of different applications so that you can identify specific problems, counts or other events. By using the MapReduce format, you can start to measure and look for trends developerWorks ibm.com/developerWorks/ Using Hadoop with Couchbase Page 2 of 14 translating what would otherwise be a very significant quantity of information into a smaller size. When looking at the logs of a web server, for example, you might want to look at the errors that occur within a specific range on specific pages. You can write a MapReduce function to identify specific errors on individual pages, and generate that information in the output. Using this method, you can reduce many lines of information from the log files into a much smaller collection of records containing only the error information. Understanding MapReduce MapReduce works in two phases. The map process takes the incoming information and maps it into a standardized format. For some information types, this mapping can be direct and explicit. For example, if you are processing input data such as a web log, you will be extracting a single column of data from the text of the web log. For other data, the mapping might be more complex. Processing textual information, such as research papers, you might be extracting phrases or more complex blocks of data. The reduce phase is used to collate and summarize the data together. The reduction can actually take place in a number of different ways, but the typical process is to perform a basic count, sum, or other statistic based on the individual data from the map phase. Thinking about a simple example, such as the word count used as sample MapReduce in Hadoop, the map phase breaks apart raw text to identify individual words, and for each word, generates a block of output data. The reduce function then takes these blocks of mapped information, and reduces them down to increment the count for each unique word seen. Given a single text file of 100 words, the map process would generate 100 blocks of data, but the reduce phase would summarize this down to provide a count of each unique word into, say, 56 words, with a count of the number of times each word appeared. With web logs, the map would take the input data, create a record for each error within the log file, and generate a block for each error that contains the date, time, and page that caused the problem. Within Hadoop, the MapReduce phases take place on the individual nodes on which the individual blocks of source information are stored. This is what enables Hadoop to work with such large data sets of information by allowing multiple nodes to work on the data simultaneously. With 100 nodes, for example, you could process 100 log files simultaneously and simplify many gigabytes (or terabytes) of information much quicker than could be achieved through a single node. Hadoop limitations One of the major limitations of the core Hadoop product is that there is no way to store and query information in the database. Data is added into the HDFS system, but you cannot ask Hadoop to return a list of all the data matching a specific data set. ibm.com/developerWorks/ developerWorks Using Hadoop with Couchbase Page 3 of 14 The primary reason for this is that Hadoop doesn't store, structure, or understand the structure of the data that is being stored within HDFS. This is why the MapReduce system is required to parse and process the information into a more structured format. However, we can combine the processing power of Hadoop with a more traditional database so that we can query the data that Hadoop has generated through it's own MapReduce system. There are many possible solutions available, including many traditional SQL databases, but we can keep the MapReduce theme, which is very effective for large data sets, by using Couchbase Server. The basic structure of the data sharing between the systems is shown in Figure 1. Figure 1. Basic structure of the data sharing between the systems Installing Hadoop If you haven't installed Hadoop already, the easiest way is to make use of one of the Cloudera installations. For compatibility between Hadoop, Sqoop, and Couchbase, the best solution is to make use of the CDH3 installation (see Resources). For this, you will need to use Ubuntu 10.10 to 11.10. Later Ubuntu releases have introduced developerWorks ibm.com/developerWorks/ Using Hadoop with Couchbase Page 4 of 14 an incompatibility because they no longer support a package required by the Cloudera Hadoop installation. Before installation, make sure you have installed a Java virtual machine, and ensure that you've configured the correct home directory of your JDK in the JAVA_HOME variable. Note that you must have a full Java Development Kit available, not just a Java Runtime Environment (JRE), as Sqoop compiles code to export and import data between Couchbase Server and Hadoop. To install using the CDH3 on Ubuntu and similar systems, you need to do the following: 1. Download the CDH3 configuration package. This adds the configuration for the CDH3 source files to the apt repository. 2. Update your repository cache: $ apt-get update. 3. Install the main Hadoop package: $ apt-get install hadoop-0.20. 4. Install the Hadoop components (see Listing 1). Listing 1. Installing the Hadoop components $ for comp in namenode datanode secondarynamenode jobtracker tasktracker do apt-get install hadoop-0.20-$comp done 5. Edit the configuration files to ensure you've set up the core components. 6. Edit /etc/hadoop/conf/core-site.xml to read as shown in Listing 2. Listing 2. Edited /etc/hadoop/conf/core-site.xml file <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> This configures the default hdfs location for storing data. Edit /etc/hadoop/conf/hdfs-site.xml (see Listing 3). Listing 3. Edited /etc/hadoop/conf/hdfs-site.xml file <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> This enables replication of the stored data. Edit /etc/hadoop/conf/mapred-site.xml (see Listing 4). ibm.com/developerWorks/ developerWorks Using Hadoop with Couchbase Page 5 of 14 Listing 4. Edited /etc/hadoop/conf/mapred-site.xml file <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration> This enables the job tracker for MapReduce. 7. Finally, edit the Hadoop environment to correctly point to the directory of your JDK installation within the /usr/lib/hadoop/conf/hadoop-env.sh. There will be a commented out line for the JAVA_HOME variable. You should uncomment it and set it to your JDK location. For example: export JAVA_HOME=/usr/lib/jvm/ java-1.6.0-openjdk. 8. Now, start up Hadoop on your system. The easiest way is to use the start-all.sh script: $ /usr/lib/hadoop/bin/start-all.sh. Assuming everything is configured correctly, you should now have a running Hadoop system. Couchbase Server overview Couchbase Server is a clustered, document-based database system that makes use of a caching layer to provide very fast access to your data by storing the majority of your data in RAM. The system makes use of multiple nodes and a caching layer with automatic sharding across the cluster. This allows for an elastic nature so that you can grow and shrink the cluster to take advantage of more RAM or disk I/O to help improve performance. All data in Couchbase Server is eventually persisted down to disk, but initially the writes and updates operate through the caching layer, which is what provides the high-performance and which we can exploit when processing Hadoop data to get live information and query the contents. In its basic form, Couchbase Server remains as a basic document and key/value- based store. You can only retrieve the information from the cluster provided you know the document ID. In Couchbase Server 2.0, you can store documents in JSON format, and then use the view system to create a view on the stored JSON documents. A view is a MapReduce combination that is executed over the documents stored in the database. The output from a view is an index, matching the structure you've defined through the MapReduce functions. The existence of the index provides you with the ability to query the underlying document data. We can use this functionality to take the processed data from Hadoop, store that information within Couchbase Server, and then use it as our basis for querying that data. Conveniently, Couchbase Server uses a MapReduce system for processing the developerWorks ibm.com/developerWorks/ Using Hadoop with Couchbase Page 6 of 14 documents and creating the indexes. This provides some level of compatibility and consistency with the methods for processing the data. Installing Couchbase Server Installing Couchbase Server is easy. Download the Couchbase Server 2.0 release from the Couchbase website for your platform (see Resources), and install the package using dpkg or RPM (depending on your platform). After installation, Couchbase Server will start automatically. To configure it, open a web browser and point it to localhost:8091 on your machine (or access it remotely using the IP address of the machine). Follow the on screen configuration instructions. You can use most of the default settings as provided during the installation, but the most important settings are the location of the data files for the data written into the database, and the amount of RAM you allocate to Couchbase Server. Getting Couchbase Server to talk to the Hadoop connector Couchbase Server uses the Sqoop connector to communicate your Hadoop cluster. Sqoop provides a connection to transfer data in bulk between Hadoop and Couchbase Server. Technically, Sqoop is an application designed to convert information between structured databases and Hadoop. The name Sqoop is actually derived from SQL and Hadoop. Installing Sqoop If you are using the CDH3 installation, you can install sqoop by using your package manager: $ sudo apt-get install sqoop. This will install sqoop in /usr/lib/sqoop. Note: A recent bug in Sqoop means that it will sometimes try to transfer the wrong datasets. The fix is part of Sqoop Version 1.4.2. If you experience problems, try V1.4.2 or a later version. Installing the Couchbase Hadoop Connector The Couchbase Hadoop Connector is a collection of Java jar files that support the connectivity between Sqoop and Couchbase. Download the Hadoop connector from the Couchbase website (see Resources). The file is packaged as a zip file. Unzip it, and then run the install.sh script inside, supplying the location of the Sqoop system. For example: $ sudo bash install.sh /usr/lib/sqoop. That installs all the necessary library and configuration files. Now we can start exchanging information between the two systems. ibm.com/developerWorks/ developerWorks Using Hadoop with Couchbase Page 7 of 14 Importing Data from Couchbase Server to Hadoop Although not the scenario we will directly deal with here, it's worth noting that we can export data from Couchbase Server into Hadoop. This could be useful if you had loaded a large quantity of data in Couchbase Server, and wanted to take advantage of Hadoop to process and simplify it. To do this, you can load the entire data set from the Couchbase Server into a Hadoop file within HDFS using: $ sqoop import -- connect http://192.168.0.71:8091/pools --table cbdata. The URL provided here is the location of the Couchbase Server bucket pool. The table specified here is actually the name of the directory within HDFS where the data will be stored. The data itself is stored as a key/value dump of information from Couchbase Server. In Couchbase Server 2.0, this means that the data is written out using the unique document ID, and containing the JSON value of the record. Writing JSON data in Hadoop MapReduce For exchanging information between Hadoop and Couchbase Server, we need to speak a common language -- in this case the JSON (see Listing 5). Listing 5. Outputting JSON within Hadoop MapReduce package org.mcslp; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; import com.google.gson.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, Text> { class wordRecord { private String word; private int count; wordRecord() { developerWorks ibm.com/developerWorks/ Using Hadoop with Couchbase Page 8 of 14 } } public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } wordRecord word = new wordRecord(); word.word = key.toString();; word.count = sum; Gson json = new Gson(); System.out.println(json.toJson(word)); output.collect(key, new Text(json.toJson(word))); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } The code is a modification of the word counting sample provided with the Hadoop distribution. This version uses the Google Gson library for writing JSON information from the reduce phase of the processing. For convenience, a new class is used (wordRecord), which is converted by Gson into a JSON record, which is the format we require on a document by document basis for Couchbase Server to process and parse the contents. Note that we do not define a Combiner class for Hadoop. This will prevent Hadoop from trying to re-reduce the information, which with the current code will fail because our reduce takes in the word and a single digit and outputs a JSON value. For a secondary reduce/combine stage, we would need to parse the JSON input or define a new Combiner class that output the JSON version of the information. This simplifies the definition slightly. ibm.com/developerWorks/ developerWorks Using Hadoop with Couchbase Page 9 of 14 To use this within Hadoop, you first need to copy the Google Gson library into the Hadoop directory (/usr/lib/hadoop/lib). Then restart Hadoop to ensure that the library has been correctly identified by Hadoop. Next, compile your code into a directory: $ javac -classpath ${HADOOP_HOME}/ hadoop-${HADOOP_VERSION}-core.jar:./google-gson-2.2.1/gson-2.2.1.jar -d wordcount_classes WordCount.java . Now create a jar of your library: $ jar -cvf wordcount.jar -C wordcount_classes/. With this completed, you can copy a number of text files into a directory, and then use this jar to process the text files into a count of the individual words, with a JSON record containing the word, and the count. For example, to process this data on some Project Gutenberg texts: $ hadoop jar wordcount.jar org.mcslp.WordCount / user/mc/gutenberg /user/mc/gutenberg-output. This will generate a list of words in out directory having been counted by the MapReduce function within Hadoop. Exporting the data from Hadoop to Couchbase Server To get the data back from Hadoop and into Couchbase Server, we need to use Sqoop to export the data back: $ sqoop export --connect http://10.2.1.55:8091/ pools --table ignored --export-dir gutenberg-output. The --table argument in this example is ignored, but the --export-dir is the name of the directory where the information to be exported is located. Writing MapReduce in Couchbase server Within Hadoop, MapReduce functions are written using Java. Within Couchbase Server, the MapReduce functionality is written in Javascript. As an interpreted language, this means that you do not need to compile the view, and it allows you to edit and refine the MapReduce structure. To create a view within Couchbase Server, open the admin console (on http:// localhost:8091), and then click the View button. Views are collected within a design document. You can create multiple views in a single design document and create multiple design documents. To improve the overall performance of the server, the system also supports a development view that can be edited, and a production view that cannot be edited. The production view cannot be edited because doing so would invalidate the view index and cause the index to require rebuilding. Click the Create Development View button and name your design document and view. Within Couchbase Server, there are the same two functions: map and reduce. The map function is used to map the input data (JSON documents) to a table. The reduce developerWorks ibm.com/developerWorks/ Using Hadoop with Couchbase Page 10 of 14 function is then used to summarize and reduce that. Reduce functions are optional and are not required for the index functionality, so we'll ignore reduce functions for the purposes of this article. For the map function, the format of the function is shown in Listing 6. Listing 6. Format of the map function map(doc) { } The argument doc is each stored JSON document. The storage format for Couchbase Server is a JSON document and the view is in Javascript, so we can access a field in JSON called count using: doc.count. To emit information from our map function, you call the emit() function. The emit() function takes two arguments, the first is the key, which is used to select and query information, and the second argument is the corresponding value. Thus we can create a map function that outputs the word and the count using the code in Listing 7. Listing 7. map function that outputs the word and the count function (doc) { if (doc.word) { emit(doc.word,doc.count); } } This will output a row of data for each input document, containing the document ID (actually our word), the word as a key, and the count of incidences of that word in the source text. You can see the raw JSON output in Listing 8. Listing 8. The raw JSON output {"total_rows":113,"rows":[ {"id":"acceptance","key":"acceptance","value":2}, {"id":"accompagner","key":"accompagner","value":1}, {"id":"achieve","key":"achieve","value":1}, {"id":"adulteration","key":"adulteration","value":1}, {"id":"arsenic","key":"arsenic","value":2}, {"id":"attainder","key":"attainder","value":1}, {"id":"beerpull","key":"beerpull","value":2}, {"id":"beware","key":"beware","value":5}, {"id":"breeze","key":"breeze","value":2}, {"id":"brighteyed","key":"brighteyed","value":1} ] } In the output, id is the document ID, key is the key you specified in the emit statement, and value is the value specified in the emit statement. Getting live data ibm.com/developerWorks/ developerWorks Using Hadoop with Couchbase Page 11 of 14 Now that we have processed the information in Hadoop, imported it into Couchbase Server, and created a view on that data within Couchbase Server, we can begin to query the information that we have processed and stored. Views are accessed using a REST like API, or if you are using one of the Couchbase Server SDKs, through the corresponding view querying functions. Querying is possible by three main selections: Individual key. For example, showing the information matching a specific key, such as 'unkind'. List of keys. You can supply an array of key values, and this will return all records where the view key matches one of the supplied values. For example, ['unkind','kind'] would return records matching either word. Range of keys. You can specify a start and end key. For example, to find the count for a specified word, you use the key argument to the view: http://192.168.0.71:8092/words/_design/dev_words/_view/byword?connection_timeout= 60000&limit=10&skip=0&key=%22breeze%22 Couchbase Server naturally outputs the results of a MapReduce in UTF-8 ordered fashion, sorted by the specified key. This means that you can get a range of values by specifying the start value and end value. For example, to get all the words between 'breeze' and 'kind: http://192.168.0.71:8092/words/_design/dev_words/_view/byword?connection_timeout= 60000&limit=10&skip=0&startkey=%22breeze%22&endkey=%22kind%22 The querying is simple, but very powerful, especially when you realize that you can combine it with the flexible view system to generate data in the format you want. Conclusion Hadoop on its own provides a powerful processing platform, but there is no method for actually extracting useful information from the data that is processed. By connecting Hadoop to another system, you can use it to query and extract information. Since Hadoop uses MapReduce for processing, you can take advantage of the knowledge of MapReduce through the MapReduce system in Couchbase Server to provide your querying platform. Using this method, you process in Hadoop, export from Hadoop into Couchbase Server as a JSON document, and then use MapReduce in Couchbase Server to query the processed information. developerWorks ibm.com/developerWorks/ Using Hadoop with Couchbase Page 12 of 14 Resources Learn Apache Hadoop Project Apache Hadoop Distributed File System HadoopDB Project website Hadoop MapReduce tutorial: Learn more from this tutorial on Apache.org. Using MapReduce and load balancing on the cloud (Kirpal A. Venkatesh et al., developerWorks, July 2010): Learn how to implement the Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing to improve the performance of both a single- and multiple-node system. CDH3 Installation - Cloudera Support: Find information on installing Hadoop using CDH3. Big Data Glossary By Pete Warden, O'Reilly Media, ISBN: 1449314597, 2011. Hadoop: The Definitive Guide by Tom White, O'Reilly Media, ISBN: 1449389732, 2010. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads, Azza Abouzeid et al., Proceedings of the VLDB Endowment, 2(1), 2009: This paper explores the feasibility of building a hybrid system that takes the best features from both technologies. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI, 2004 SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions, Eric Friedman et al., Proceedings of the VLDB Endowment, 2(2), 2009: This paper describes the motivation for this new approach to UDFs as well as the implementation within AsterData Systems' nCluster database. MapReduce and parallel DBMSs: friends or foes?, Michael Stonebraker et al., Commun. ACM 53(1), 2010. A Survey of Large Scale Data Management Approaches in Cloud Environments, Sherif Sakr et al., Journal of IEEE Communications Surveys and Tutorials, 13(3), 2011: This paper gives a comprehensive survey of numerous approaches and mechanisms of deploying data-intensive applications in the cloud which are gaining a lot of momentum in both research and industrial communities. James Phillips looks ahead in 2012: Tune into this podcast to hear what's going on at CouchDB. Aaron Miller and Nitin Borwankar on CouchDB and the CouchOne mobile platform: Tune into this podcast to learn more about this full-stack application environment, written in Erlang, and ported to Android. developerWorks Business analytics: Find more analytic technical resources for developers. ibm.com/developerWorks/ developerWorks Using Hadoop with Couchbase Page 13 of 14 developerWorks Open source: Find extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM products. developerWorks on Twitter: Join today to follow developerWorks tweets. developerWorks podcasts: Listen to interesting interviews and discussions for software developers. developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts. Get products and technologies Couchbase Server: Download this clustered, document-based database system. Couchbase Server Hadoop Connector: Download this collection of Java jars that support the connectivity between Sqoop and Couchbase. Hadoop 0.20.1, Hadoop MapReduce, and Hadoop HDFS: Download all from Apache.org. Evaluation software: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2, Lotus, Rational, Tivoli, and WebSphere. Discuss developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis. developerWorks profile: Create your profile today and set up a watchlist. developerWorks ibm.com/developerWorks/ Using Hadoop with Couchbase Page 14 of 14 About the author Martin Brown A professional writer for over 15 years, Martin 'MC' Brown is the author and contributor to over 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac OS and more. He is a former LAMP Technologies Editor for LinuxWorld magazine and is a regular contributor to ServerWatch.com, LinuxPlanet, ComputerWorld, and IBM developerWorks. He draws on a rich and varied background as founder member of a leading UK ISP, systems manager and IT consultant for an advertising agency and Internet solutions group, technical specialist for an intercontinental ISP network, and database designer and programmer and as a self-confessed compulsive consumer of computing hardware and software. MC is currently the VP of Technical Publications and Education for Couchbase and is responsible for all published documentation, training program and content, and the Couchbase Techzone. Copyright IBM Corporation 2012 (www.ibm.com/legal/copytrade.shtml) Trademarks (www.ibm.com/developerworks/ibm/trademarks/)