Sie sind auf Seite 1von 24

MAP/REDUCE, HADOOP & PIG

Data mining applied on the enterprise


DEFINITIONS
Data mining is the process of extracting
patterns from data. Commonly used in a wide
range of profiling practices, such as marketing,
surveillance, fraud detection and scientific
discovery.
A Framework is a re-usable design for a
software system (or subsystem). A software
framework may include support programs, code
libraries, a scripting language, or other software
to help develop and glue together the different
components of a software project. Various parts
of the framework may be exposed through an
API.
MAP/REDUCE
Framework for processing huge datasets on certain kinds
of distributable problems using a large number of
computers.
MapReduce provides
Automatic parallelization & distribution
Fault tolerance
I/O scheduling
Status monitoring
Use cases
Document clustering
Machine learning
Inverted index construction
Was used to completely regenerate Google's index of the
World Wide Web
MAP/REDUCE
"Map" step: The master node takes the input,
chops it up into smaller sub-problems, and
distributes those to worker nodes. A worker node
may do this again in turn, leading to a multi-
level tree structure.

"Reduce" step: The master node then takes the


answers to all the sub-problems and combines
them in a way to get the output - the answer to
the problem it was originally trying to solve.

Defined with respect to data structured in (key,


value) pairs
MAP/REDUCE
MAP/REDUCE DATA FLOW SECTIONS

Input reader

Map function

Partition function

Compare function

Reduce function

Output writer
MAP/REDUCE -> HADOOP

Google calls it Hadoop equivalent


MapReduce Hadoop MapReduce
GFS HDFS
Sawzall Hive, Pig
BigTable Hbase
Chubby ZooKeeper
HADOOP
Java Map/Reduce implementation

Framework that schedules tasks, provides


monitoring, and re-executing the failed ones.

Single master JobTracker

Several slave TaskTracker, one per node

Hadoop DFS (not explicitly required)

Add-ons: Hive (Facebook dev), Pig (Yahoo! dev)


HADOOP EXAMPLE
A program that takes web server access log files
and counts the number of hits in each minute slot
over a week

Differentiate input & output phases: Map &


Reduce
Map phase: Access log files
Reduce phase: Key set + iterator over each key subset
HADOOP EXAMPLE
HADOOP EXAMPLE MAP PHASE
HADOOP EXAMPLE REDUCE PHASE
HADOOP EXAMPLE MAIN CODE
PROBLEMS
Hadoop Map/Reduce is very powerful, but

Requires a Java Programmer

User has to reinvent the wheel everytime a


functionality is needed (join, filter, etc)

Harder to write, harder to maintain

User optimized
PIG
Platform for analyzing large data sets

High-level language + infrastructure (compiler)

Pig Latin
Data flow language rather than procedural or
declarative

Ease of programming

Optimization opportunities

Extensibility
PIG - ADVANTAGES
Increases productivity. In one test
10 lines of Pig Latin 200 lines of Java.
What took 4 hours to write in Java took 15 minutes
in Pig Latin.

Opens the system to non-Java programmers.

Provides common operations like join, group,


filter, sort.
PIG HOW IT WORKS
PIG EXAMPLE

Start a terminal and run


$ cd /usr/share/cloudera/pig/
$ bin/pig x local

Should see a prompt like:


grunt>
PIG EXAMPLE - AGGREGATION
Lets count the number of times each user
appears in a given data set.
log = LOAD excite-small.log AS (user, timestamp,
query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group,
COUNT(log);
STORE cntd INTO output;

Results:
002BB5A52580A8ED 18
005BD9CD3AC6BB38 18
PIG
Supports several functions
Aggregation
Grouping
Filtering
Ordering
Joins & Anti-Joins
Cogrouping (grouping generalization)
Several data types:
Scalar: int, long, double, chararray, bytearray
Complex: Maps, Tuples, Bags
PIG - COMMANDS
Pig Command What it does
load Read data from file system.

store Write data to file system.

foreach Apply expression to each record and output one or more


records.
filter Apply predicate and remove records that do not return true.

group/cogroup Collect records with the same key from one or more inputs.

join Join two or more inputs based on a key.

order Sort records based on a key.

distinct Remove duplicate records.

union Merge two data sets.


POSSIBLE APPLICATIONS AT VLEX
Faster and improved, parallelized document
indexing

Targeted advertisement

Recommendation system

Trending topics

Better search tools (Search assist)


QUESTIONS?
REFERENCIAS
Cloudera: Introduction to Pig

Hadoop, a Free Software Program, Finds Uses


Beyond Search

Digging Deeper Into Data With Hadoop

Apache Hadoop

Pig Tutorial

Das könnte Ihnen auch gefallen