Sie sind auf Seite 1von 30

Hadoop Usage At

Yahoo!
Milind Bhandarkar
(milindb@yahoo-inc.com)
About Me

Parallel Programming since 1989

High-Performance Scientic Computing


1989 - 2005, Data-Intensive Computing
2005 - ...

Hadoop Solutions Architect @ Yahoo!

Contributor to Hadoop since 2006

Training, Consulting, Capacity Planning


History

2004-2005: Hadoop prototyped in Apache Lucene

January 2006: Hadoop becomes subproject of


Lucene

January 2008: Hadoop becomes top-level Apache


Project

Latest Release: 0.21

Stable Release: 0.20.x (+ Strong Authentication)


Hadoop Ecosystem

HBase, Hive, Pig, Howl, Oozie,


Zookeeper, Chukwa, Mahout,
Cascading, Scribe, Cassandra,
Hypertable, Voldemort, Azkaban,
Sqoop, Flume, Avro ...
Hadoop at Yahoo!

Behind Every Click !

38,000+ Servers

Largest cluster is 4000+ servers

1+ Million Jobs per month

170+ PB of Storage

10+ TB Compressed Data Added Per Day

1000+ Users
!"
$"
%"
&"
'"
("
)"
*"
+"
"
*'"
*""
+'"
+""
'"
"
*""& *""% *""$ *""! *"+"
,-./0
)$1 2345346
+%" 78 29-4/:3
+;< ;-=9>?0 @-A6
!
"
#
$
%
&
'
(
%

#
*

+
,
-
.
,
-
%

/
,
0
&
1
2
0
,
%

3,%,&-4"
+45,'4,
678&40
9&5:2
/-#($4;#'
B/.--C 2345346
B/.--C 29-4/:3 D78E
Hadoop Growth
Hadoop Clusters

Hadoop Dev, QA, Benchmarking (10%)

Sandbox, Release Validation (10%)

Science, Ad-Hoc Usage (50%)

Production (30%)
Sample Applications

Science + Big Data + Insight = Personal


Relevance = Value

Log processing: Analytics, Reporting, Buzz

User Modeling

Content Optimization, Spam lters

Computational Advertising
Content
(Web Pages, Blogs,
News Articles, Media)
Search Queries
Advertisements
(Display, Search)
User
Major Data Sources
Web Graph Analysis

100+ Billion Web Pages

1+ PB Content

2 Trillion links

300+ TB of compressed output

Before Hadoop: 1 Month

With Hadoop: 1 Week


Search Assist
Search Assist

Related concepts occur closer together

3 years of query logs, sessionized per user

10+ Terabytes of Natural Language Text Corpus

Build Dictionaries with Hadoop & Push to Serving

Before Hadoop: 4 Weeks

With Hadoop: 30 minutes


Mail Spam Filtering

Challenge: Scale

450+ Million mailboxes

5+ Billion deliveries per day

25+ Billion connections

Challenge: User feedback is often late, noisy,


inconsistent
Data Factory

40+ Billions of Events Per Day

Parse & Transform Event Streams

Join Clicks & Views

Filter out Robots

Aggregate, Sort, Partition

Data Quality Checks


User Modeling

Objective: Determine User-Interests by


mining user-activities

Large dimensionality of possible user


activities

Typical user has sparse activity vector

Event attributes change over time


User Activities
Attribute Possible Values
Typical Values
Per User
Pages
Queries
Ads
1+ Million 10-100
100+ Millions 10s
100+ Thousands 10s
User-Modeling Pipeline

Sessionization

Feature and Target Generation

Model Training

Ofine Scoring & Evaluation

Batch Scoring & Upload to serving


Data Acquisition
User Time Event Source
U0 T0
Visited Yahoo! Autos Web Server logs
U0 T1
Searched for Car
Insurance
Search Logs
U0 T2
Browsed stock quotes Web Server Logs
U0 T3
Saw ad for discount
brokerage, did not click
Ad Logs
U0 T4
Checked Yahoo! Mail Web Server Logs
U0 T5
Clicked Ad for Auto
Insurance
Ad Logs, Click Logs
Normalization
User Time Event Tag
U0 T0
View Category: Autos, Tag: Mercedes Benz
U0 T1
Query Category: Insurance, Tag: Auto
U0 T2
View Category: Finance, Tag: YHOO
U0 T3
View-Click Category: Finance, Tag:Brokerage
U0 T4
Browse Irrelevant Event, Dropped
U0 T5
View+Click Category: Insurance, Tag: Auto
Features & Targets
Time
Query View Click
T
Feature
Window
Target
Window
Targets

User-Actions of Interest

Clicks on Ads & Content

Site & Page visits

Conversion Events

Purchases, Quote requests

Sign-Up for membership etc


Features

Summary of user activities over a time-


window

Aggregates, moving averages, rates over


various time-windows

Incrementally updated
Joining Targets &
Features

Target rates very low: 0.01% ~ 1%

First, construct targets

Filter user activity without targets

Join feature vector with targets


Model Training

Regressions

Boosted Decision Trees

Naive Bayes

Support Vector Machines

Maximum Entropy modeling

Constrained Random Fields


Model Training

Some algorithms are difcult/inefcient to


implement in Map-Reduce

Require ne-grain iterations

Different models in parallel

Model for each target response in parallel


Ofine Scoring &
Evaluation

Apply model weights to features

Pleasantly parallel

Janino Embedded Compiler

Sort by scores and compute metrics

Evaluate metrics
Batch Scoring

Apply models to features from all user


activity

Upload scores to serving systems


User Modeling Pipeline
Component Data Volume Time
Data Acquisition
Feature & Target
Generation
Model Training
Scoring
1 TB 2-3 Hours
1 TB * Feature
Window Size
4-6 Hours
50 - 100 GB
1-2 Hours for 100s
of Models
500 GB 1 Hour
Acknowledgements

Vijay K Narayanan

Vishwanath Ramarao

Nitin Motgi

And Numerous Hadoop Application


Developers at Yahoo!
Questions ?

Das könnte Ihnen auch gefallen