Beruflich Dokumente
Kultur Dokumente
Major Focus
What is hadoop? Why hadoop? What is map reduce? Phases in map reduce. What is hdfs? Setting hadoop in pseudo distributed mode. A simple pattern matcher program.
Big Data
It consists of datasets that grow so large that they become awkward to work with using on-hand database management tools Sizes in terabyte ,Exabyte ,zettabyte Difficulties include capture, storage, search, sharing, analytics, and visualizing Examples include web logs; RFID; sensor networks; social data (due to the Social data Revolution), Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, military surveillance; medical records; photography archives; video archives; and large-scale ecommerce.
History of hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene. In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS). In 2004, Google introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS. In 2006 DOUG Cutting joined yahoo , and it build its site index using 10,000 core hadoop cluster. Now hadoop can sort 1 Terabyte data in 62 seconds.
Hadoop ecosystem
Core A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures). Avro A data serialization system for efficient, crosslanguage RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)
MapReduce A MapReduce is a distributed data processing model introduced by Google to support distributed computing on large data sets on clusters of computers Hadoop can run MapReduce programs written in various languages
HDFS
Reliable Storage: HDFS Hadoop includes a faulttolerant storage system called the Hadoop Distributed File System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally. It can survive the failure of significant parts of the storage infrastructure without losing data. Clusters can be built with inexpensive computers. If one fails, Hadoop continues to operate the cluster without closing data or interrupting work, by shifting work to the remaining machines in the cluster. HDFS manages storage on the cluster by breaking incoming files into pieces, called blocks, and storing each of the blocks redundantly across the pool of servers. If namenode fails all system crashes down.
HDFS example
Zookeeper Distributed consensus engine Provides well-defined concurrent access semantics: Distributed locking / mutual exclusion Message board / mailboxes
Pig Data-flow oriented language Pig Latin Data types include sets, associative arrays, tuples High-level language for routing data, allows easy integration of Java for complex tasks Developed at Yahoo! Hive SQL-based data warehousing app Feature set is similar to Pig Language is more strictly SQL-esque Supports SELECT, JOIN, GROUP BY, etc
HBase Column-store database Based on design of Google Big-Table Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model (key, value) lookup Limited transactions (only one row)
Chukwa A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports
Fuse-dfs Allows mounting of HDFS volumes via Linux FUSE filesystem Does not imply HDFS can be used for general-purpose file system Does allow easy integration with other systems for data import/export
FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL) SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine . Under SQL you have query statements; under MapReduce you have scripts and codes. OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS Hadoop is designed for offline processing and analysis of large-scale data. It doesnt work for random reading and writing of a few records, which is the type of load for online transaction processing. In fact, as of this writing (and in the foreseeable future), Hadoop is best used as a write-once , read-many-times type of data store. In this aspect its similar to data warehouses in the SQL world.
Namenode
Datanodes
MapReduce
A MapReduce program processes data by manipulating (key/value) pairs in the general form
MapReduce
Mapper maps input key/value pairs to a set of intermediate key/value pairs. MapReduce framework spawns one map task for each InputSplit (input splits are a logical division of your records ,64MB by default and can be customized) generated by the InputFormat for the job. The mapping class needs to extend an abstract class called Mapper map: (K1,V1) list(K2,V2)
Mapper
Inputsplit
Each map processes a single split, Each split is divided into records, and the map processes each recorda key-value pairin turn
Mapper Example
static class FriendMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String data = value.toString(); String[] friends = data.split(" "); String friendpair = friends[0]; for (int i = 1; i < friends.length; i++) { if (friendpair.compareTo(friends[i]) > 0) { friendpair = friendpair + "," + friends[i]; } else { friendpair = friends[i] + "," + friendpair; } context.write(new Text(friendpair), new IntWritable(0)); friendpair = friends[0]; } for (int j = 0; j < friends.length; j++) { friendpair= friends[j]; for (int i = j + 1; i < friends.length; i++) { if (friendpair.compareTo(friends[i]) > 0) { friendpair = friendpair + "," + friends[i]; } else { friendpair = friends[i] + "," + friendpair; } context.write(new Text(friendpair), new IntWritable(1)); friendpair = friends[j]; } } } }
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values. Reducer has 3 primary phases: shuffle, sort and reduce. shuffle In this phase the framework fetches the relevant partition of the output of all the mappers Sort framework sorts Reducer inputs by keys lexographically Reduce the reduce(WritableComparable,Iterator,context) method is called for each <key, (list of values)> pair in the grouped inputs.
Reducer Example
static class FriendReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int count = 0; boolean mark = false; for (IntWritable v : values) { if (v.get() == 0) { mark = true; } count++; } if (!mark) { context.write(key, new IntWritable(count)); }
Combiner-local reduce
Running combiners makes map output more compact , so there is less data to write to local disk and to transfer to the reducer. If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation.
Run the ifconfig command to get the ipaddress and add localhost to the hosts file $ ifconfig $vi /etc/hosts
Generating public private key of ssh ssh is used by the hadoop server to automatically login into the system
When it ask to save the key save it in your home with file name id_rsa /$home/.ssh/id_rsa which is /home/hduser/.ssh/id_rsa here
Adding public key to authorized file $cat /home/hduser/.ssh/id_rsa.pub>>/home/hduser/.ssh/authorized_keys This command will create a file authorized_keys and add your public key to it.
Restricting permission of /home/hduser/.ssh and /home/hduser/.ssh/authorized_keys This is done so that the passphrase will be used instead of password which is empty here $chmod 0700 /home/hduser/.ssh $chmod 0600 /home/hduser/.ssh/authorized_keys
Testing whether the ssh uses the passphrase or not $ssh localhost it should be run without password Extract the hadoop tar in /usr/local/ $cd /usr/local $sudo tar xzf hadoop-0.20.2.tar.gz $sudo mv hadoop-0.20.2 hadoop //changing name of the folder $sudo chown R hduser hadoop //granting ownership of folder //to hduser
Configuring the environment variable $vi /home/hduser/.bashrc Add the following to the file
Configuring the hadoop server Files to be edited:hadoop-env.sh core-site.xml mapred-site.xml hdfs-site.xml Change the directory to conf $cd /usr/local/hadoop/conf Edit the file hadoop-env.sh $vi hadoop-env.sh
Save all the files using esc:wq! Creating a temp dir for hadoop $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp Editing the core-site.xml file vi core-site.xml
Program to find a pattern in a text file. import java.io.IOException; import java.util.regex.Pattern; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
static class PatternMapper extends Mapper<LongWritable,Text,Text,IntWritable>{ public void map(LongWritable key ,Text value,Context context){ String line =value.toString(); String lineReadArray[]=line.split(" "); Configuration conf=context.getConfiguration(); String pattern=conf.get("ln"); String pat="[\\w]*"+pattern+"[\\w]*"; for(String val:lineReadArray){ if(Pattern.matches(pat, val)){ try { context.write(new Text(val), new IntWritable(1)); }
catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
}
} static class PatternReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(Text key,Iterable<IntWritable> value,Context context){ int sum=0; for(IntWritable i:value){ sum=sum+i.get(); } try { context.write(key, new IntWritable(sum)); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }
catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } public static void main(String[] args) throws Exception{ // TODO Auto-generated method stub if (args.length != 3) { System.err.println("Usage: PatternFinder <input path> <output path> <pattern>"); System.exit(-1); } Configuration conf =new Configuration(); conf.set("pattern", args[2]); Job job = new Job(conf); job.setJarByClass(PatternFinder.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(PatternMapper.class); job.setReducerClass(PatternReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1); } }