Sie sind auf Seite 1von 50

A start to hadoop

By:Ayush Mittal Krupa Varughese Parag Sahu

Major Focus
What is hadoop? Why hadoop? What is map reduce? Phases in map reduce. What is hdfs? Setting hadoop in pseudo distributed mode. A simple pattern matcher program.

The Elementary itemDATA


What is Data? In computer era data is termed as relevant information which flows through one machine to another. Types can be: Structured:Identifiable because of its organized structure Example: Database(Information stored in column and rows) Unstructured:It does not have predefined data model Generally text heavy(contain date ,numbers ,logs) Irregular and ambiguous

What is the problem with unstructured DATA?


Hides important insights
Storage Updation Time consuming

Big Data
It consists of datasets that grow so large that they become awkward to work with using on-hand database management tools Sizes in terabyte ,Exabyte ,zettabyte Difficulties include capture, storage, search, sharing, analytics, and visualizing Examples include web logs; RFID; sensor networks; social data (due to the Social data Revolution), Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, military surveillance; medical records; photography archives; video archives; and large-scale ecommerce.

Here comes Hadoop !!!


The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

History of hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene. In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). In 2004, Google introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS. In 2006 DOUG Cutting joined yahoo , and it build its site index using 10,000 core hadoop cluster. Now hadoop can sort 1 Terabyte data in 62 seconds.

Hadoop ecosystem

Core A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures). Avro A data serialization system for efficient, crosslanguage RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)

MapReduce A MapReduce is a distributed data processing model introduced by Google to support distributed computing on large data sets on clusters of computers Hadoop can run MapReduce programs written in various languages

HDFS
Reliable Storage: HDFS Hadoop includes a faulttolerant storage system called the Hadoop Distributed File System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally. It can survive the failure of significant parts of the storage infrastructure without losing data. Clusters can be built with inexpensive computers. If one fails, Hadoop continues to operate the cluster without closing data or interrupting work, by shifting work to the remaining machines in the cluster. HDFS manages storage on the cluster by breaking incoming files into pieces, called blocks, and storing each of the blocks redundantly across the pool of servers. If namenode fails all system crashes down.

HDFS example

Zookeeper Distributed consensus engine Provides well-defined concurrent access semantics: Distributed locking / mutual exclusion Message board / mailboxes

Pig Data-flow oriented language Pig Latin Data types include sets, associative arrays, tuples High-level language for routing data, allows easy integration of Java for complex tasks Developed at Yahoo! Hive SQL-based data warehousing app Feature set is similar to Pig Language is more strictly SQL-esque Supports SELECT, JOIN, GROUP BY, etc

HBase Column-store database Based on design of Google Big-Table Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model (key, value) lookup Limited transactions (only one row)

Chukwa A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports

Fuse-dfs Allows mounting of HDFS volumes via Linux FUSE filesystem Does not imply HDFS can be used for general-purpose file system Does allow easy integration with other systems for data import/export

Properties of hadoop System


AccessibleHadoop runs on large clusters of commodity machines or on cloud computing services RobustBecause it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ScalableHadoop scales linearly to handle larger data by adding more nodes to the cluster. SimpleHadoop allows users to quickly write efficient parallel code.

Comparing SQL databases and Hadoop


SCALE-OUT INSTEAD OF SCALE-UP Scaling commercial relational databases is expensive. Their design is more friendly to scaling up. To run a bigger database you need to buy a bigger machine. In fact, its not unusual to see server vendors market their expensive high-end machines as database-class servers. Unfortunately, at some point there wont be a big enough machine available for the larger data sets. KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES A fundamental tenet of relational databases is that data resides in tables having relational structure defined by a schema . Although the relational model has great formal properties, many modern applications deal with data types that dont fit well into this mo del. Text documents, images, and XML files are popular examples.

FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL) SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine . Under SQL you have query statements; under MapReduce you have scripts and codes. OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS Hadoop is designed for offline processing and analysis of large-scale data. It doesnt work for random reading and writing of a few records, which is the type of load for online transaction processing. In fact, as of this writing (and in the foreseeable future), Hadoop is best used as a write-once , read-many-times type of data store. In this aspect its similar to data warehouses in the SQL world.

Building blocks of hadoop


NameNode DataNode Secondary NameNode JobTracker TaskTracker

Namenode

Datanodes

MapReduce
A MapReduce program processes data by manipulating (key/value) pairs in the general form

map: (K1,V1) list(K2,V2) reduce: (K2,list(V2)) list(K3,V3)


map: (K1, V1) list(K2, V2) combine: (K2, list(V2)) list(K2, V2) reduce: (K2, list(V2)) list(K3, V3)

MapReduce

Mapper maps input key/value pairs to a set of intermediate key/value pairs. MapReduce framework spawns one map task for each InputSplit (input splits are a logical division of your records ,64MB by default and can be customized) generated by the InputFormat for the job. The mapping class needs to extend an abstract class called Mapper map: (K1,V1) list(K2,V2)

Mapper

Inputsplit
Each map processes a single split, Each split is divided into records, and the map processes each recorda key-value pairin turn

Mapper Example
static class FriendMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String data = value.toString(); String[] friends = data.split(" "); String friendpair = friends[0]; for (int i = 1; i < friends.length; i++) { if (friendpair.compareTo(friends[i]) > 0) { friendpair = friendpair + "," + friends[i]; } else { friendpair = friends[i] + "," + friendpair; } context.write(new Text(friendpair), new IntWritable(0)); friendpair = friends[0]; } for (int j = 0; j < friends.length; j++) { friendpair= friends[j]; for (int i = j + 1; i < friends.length; i++) { if (friendpair.compareTo(friends[i]) > 0) { friendpair = friendpair + "," + friends[i]; } else { friendpair = friends[i] + "," + friendpair; } context.write(new Text(friendpair), new IntWritable(1)); friendpair = friends[j]; } } } }

Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values. Reducer has 3 primary phases: shuffle, sort and reduce. shuffle In this phase the framework fetches the relevant partition of the output of all the mappers Sort framework sorts Reducer inputs by keys lexographically Reduce the reduce(WritableComparable,Iterator,context) method is called for each <key, (list of values)> pair in the grouped inputs.

Reducer Example
static class FriendReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int count = 0; boolean mark = false; for (IntWritable v : values) { if (v.get() == 0) { mark = true; } count++; } if (!mark) { context.write(key, new IntWritable(count)); }

Combiner-local reduce
Running combiners makes map output more compact , so there is less data to write to local disk and to transfer to the reducer. If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation.

A quick view of hadoop

Setting hadoop in pseudo mode


Prerequisite:Linux with jdk 1.6 or later Hadoop jar

Used configuration:Linux rhel 5.5 using Vmware player Jdk 1.6

Login as root username root password- root@123


Add group $ groupadd hadoop Create user(hduser) and add it to group $ useradd G hadoop hduser Change password of hduser $ passwd hduser Adding hduser to sudoers list , this is done to give hduser the privileges to use root using sudo $ visudo The file look like this scroll down to the page where:-

Save the file using esc+:wq!

Run the ifconfig command to get the ipaddress and add localhost to the hosts file $ ifconfig $vi /etc/hosts

Changing the user $su hduser

Generating public private key of ssh ssh is used by the hadoop server to automatically login into the system

When it ask to save the key save it in your home with file name id_rsa /$home/.ssh/id_rsa which is /home/hduser/.ssh/id_rsa here

Adding public key to authorized file $cat /home/hduser/.ssh/id_rsa.pub>>/home/hduser/.ssh/authorized_keys This command will create a file authorized_keys and add your public key to it.

Restricting permission of /home/hduser/.ssh and /home/hduser/.ssh/authorized_keys This is done so that the passphrase will be used instead of password which is empty here $chmod 0700 /home/hduser/.ssh $chmod 0600 /home/hduser/.ssh/authorized_keys
Testing whether the ssh uses the passphrase or not $ssh localhost it should be run without password Extract the hadoop tar in /usr/local/ $cd /usr/local $sudo tar xzf hadoop-0.20.2.tar.gz $sudo mv hadoop-0.20.2 hadoop //changing name of the folder $sudo chown R hduser hadoop //granting ownership of folder //to hduser

Configuring the environment variable $vi /home/hduser/.bashrc Add the following to the file

Save the file using esc+:wq!

Configuring the hadoop server Files to be edited:hadoop-env.sh core-site.xml mapred-site.xml hdfs-site.xml Change the directory to conf $cd /usr/local/hadoop/conf Edit the file hadoop-env.sh $vi hadoop-env.sh

Save all the files using esc:wq! Creating a temp dir for hadoop $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp Editing the core-site.xml file vi core-site.xml

Editing the mapred file $vi mapred-site.xml

Editing the hdfs file $vi hdfs-site.xml

Formatting the namnode $/usr/local/hadoop/bin/hadoop namenode format

Starting the cluster $/usr/local/hadoop/bin/start-all.sh

Stopping the cluster $/usr/local/hadoop/stop-all.sh

Program to find a pattern in a text file. import java.io.IOException; import java.util.regex.Pattern; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class PatternFinder { /** * @param args */

static class PatternMapper extends Mapper<LongWritable,Text,Text,IntWritable>{ public void map(LongWritable key ,Text value,Context context){ String line =value.toString(); String lineReadArray[]=line.split(" "); Configuration conf=context.getConfiguration(); String pattern=conf.get("ln"); String pat="[\\w]*"+pattern+"[\\w]*"; for(String val:lineReadArray){ if(Pattern.matches(pat, val)){ try { context.write(new Text(val), new IntWritable(1)); }

catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }

}
} static class PatternReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(Text key,Iterable<IntWritable> value,Context context){ int sum=0; for(IntWritable i:value){ sum=sum+i.get(); } try { context.write(key, new IntWritable(sum)); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }

catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } public static void main(String[] args) throws Exception{ // TODO Auto-generated method stub if (args.length != 3) { System.err.println("Usage: PatternFinder <input path> <output path> <pattern>"); System.exit(-1); } Configuration conf =new Configuration(); conf.set("pattern", args[2]); Job job = new Job(conf); job.setJarByClass(PatternFinder.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(PatternMapper.class); job.setReducerClass(PatternReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1); } }

Create a jar of the class file and store it in /home/krupa/inputjar/


Copying a test file to hdfs Starting the cluster if not started $/usr/local/hadoop/bin/start-all.sh Creating a input directory $/usr/local/hadoop/hadoop fs mkdir input Copying a file to input directory $/usr/local/hadoop/hadoop fs put /home/krupa/abc.txt input Running the jar $ $HADOOP_HOME/bin/hadoop jar /home/krupa/inputjar/patternfinder PatternFinder input output Copying the output $ $HADOOP_HOME/hadoop fs copyToLocal /user/kruap/output/partr-00000 /home/krupa/output/out.txt

Das könnte Ihnen auch gefallen