Beruflich Dokumente
Kultur Dokumente
A stack of nodes forms a Rack A group of racks form a Cluster All connected via high speed network to enable fast exchange of information.
Why Hadoop?
Way to continuously index massive amounts of data. Initial work was done by a little company called Google
who needed a way to index the entire world wide web everyday!
HDFS: Backup
In addition, some versions of Hadoop have other nodes serving as backup for the Name Node.
Hadoop MapReduce
MapReduce sees the computational task as consisting of two phases:
The Mapper phase and The Reduce phase
MapReduce: Example
Consider a file that aggregates the blog interest related to big data posted in the last 24 hours. The file has been store in a cluster using HDFS under the name BigData.txt File->File Blocks 3 copies of file blocks across the nodes in a 50 node cluster.
MapReduce: Example(2)
Analysis: Count the no. of times Hadoop, BigData and Green Plum appeared.
Mapper Function: Input: address of a file block Output: counts the number of times each one of these appear Each node participating will receive a pointer to the Mapper function and the address of a file block located in a node
MapReduce: Example(3)
List of key value pairs. The key would be Hadoop, Big Data or Green Plum The value would be the number of times the name appears in the file. After executing the Mapper function, each node would produce a list of key value pairs.
MapReduce: Example(4)
MapReduce: Example(5)
We can then use the Reduce Phase to sum the results obtained by each node to reduce the output of the Map function to a single list of key value pairs A node is selected to execute the reduce function.
MapReduce: Example(6)
After sorting, the key value pairs with the same key is being grouped together. The reduce function then adds the values up for each key-value pair, leading to the final result.
Advantages of Hadoop
Leverages cluster architecture to organize big data across hundreds of thousands of nodes Tasks can be executed in parallel across nodes Computational tasks on big data with total freedom on the type of analytics performed by data scientists Removes the potential bottleneck from rendering the network
Big Data was the term dujour for 2011. In 2012, well move away from this term, as ANY DATA will be of paramount importance for enterprises.
Conclusion
References
http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Apache_Hadoop http://www.facebook.com/note.php?note_id =16121578919 http://www.facebook.com/notes/paulyang/moving-an-elephant-large-scale-hadoopdata-migration-atfacebook/10150246275318920 http://developer.yahoo.com/hadoop/
Thank You