Hadoop Usage at Yahoo!

Hadoop Usage At
Yahoo!
Milind Bhandarkar
(milindb@yahoo-inc.com)
About Me
Parallel Programming since 1989
High-Performance Scientic Computing

1989 - 2005, Data-Intensive Computing
2005 - ...
Hadoop Solutions Architect @ Yahoo!
Contributor to Hadoop since 2006
Training, Consulting, Capacity Planning

History
2004-2005: Hadoop prototyped in Apache Lucene
January 2006: Hadoop becomes subproject of

Lucene
January 2008: Hadoop becomes top-level Apache

Project
Latest Release: 0.21
Stable Release: 0.20.x (+ Strong Authentication)

Hadoop Ecosystem
HBase, Hive, Pig, Howl, Oozie,

Zookeeper, Chukwa, Mahout,
Cascading, Scribe, Cassandra,
Hypertable, Voldemort, Azkaban,
Sqoop, Flume, Avro ...
Hadoop at Yahoo!
Behind Every Click !
38,000+ Servers
Largest cluster is 4000+ servers
1+ Million Jobs per month
170+ PB of Storage
10+ TB Compressed Data Added Per Day
1000+ Users
!"
$"
%"
&"
'"
("
)"
*"
+"
"
*'"
*""
+'"
+""
'"
"
*""& *""% *""$ *""! *"+"
,-./0
)$1 2345346
+%" 78 29-4/:3
+;< ;-=9>?0 @-A6
!
"
#
$
%
&
'
(
%

#
*

+
,
-
.
,
-
%

/
,
0
&
1
2
0
,
%

3,%,&-4"
+45,'4,
678&40
9&5:2
/-#($4;#'
B/.--C 2345346
B/.--C 29-4/:3 D78E
Hadoop Growth
Hadoop Clusters
Hadoop Dev, QA, Benchmarking (10%)
Sandbox, Release Validation (10%)
Science, Ad-Hoc Usage (50%)
Production (30%)
Sample Applications
Science + Big Data + Insight = Personal

Relevance = Value
Log processing: Analytics, Reporting, Buzz
User Modeling
Content Optimization, Spam lters
Computational Advertising
Content
(Web Pages, Blogs,
News Articles, Media)
Search Queries
Advertisements
(Display, Search)
User
Major Data Sources
Web Graph Analysis
100+ Billion Web Pages
1+ PB Content
2 Trillion links
300+ TB of compressed output
Before Hadoop: 1 Month
With Hadoop: 1 Week

Search Assist
Search Assist
Related concepts occur closer together
3 years of query logs, sessionized per user
10+ Terabytes of Natural Language Text Corpus
Build Dictionaries with Hadoop & Push to Serving
Before Hadoop: 4 Weeks
With Hadoop: 30 minutes

Mail Spam Filtering
Challenge: Scale
450+ Million mailboxes
5+ Billion deliveries per day
25+ Billion connections
Challenge: User feedback is often late, noisy,

inconsistent
Data Factory
40+ Billions of Events Per Day
Parse & Transform Event Streams
Join Clicks & Views
Filter out Robots
Aggregate, Sort, Partition
Data Quality Checks

User Modeling
Objective: Determine User-Interests by

mining user-activities
Large dimensionality of possible user

activities
Typical user has sparse activity vector
Event attributes change over time

User Activities
Attribute Possible Values
Typical Values
Per User
Pages
Queries
Ads
1+ Million 10-100
100+ Millions 10s
100+ Thousands 10s
User-Modeling Pipeline
Sessionization
Feature and Target Generation
Model Training
Ofine Scoring & Evaluation
Batch Scoring & Upload to serving

Data Acquisition
User Time Event Source
U0 T0
Visited Yahoo! Autos Web Server logs
U0 T1
Searched for Car
Insurance
Search Logs
U0 T2
Browsed stock quotes Web Server Logs
U0 T3
Saw ad for discount
brokerage, did not click
Ad Logs
U0 T4
Checked Yahoo! Mail Web Server Logs
U0 T5
Clicked Ad for Auto
Insurance
Ad Logs, Click Logs
Normalization
User Time Event Tag
U0 T0
View Category: Autos, Tag: Mercedes Benz
U0 T1
Query Category: Insurance, Tag: Auto
U0 T2
View Category: Finance, Tag: YHOO
U0 T3
View-Click Category: Finance, Tag:Brokerage
U0 T4
Browse Irrelevant Event, Dropped
U0 T5
View+Click Category: Insurance, Tag: Auto
Features & Targets
Time
Query View Click
T
Feature
Window
Target
Window
Targets
User-Actions of Interest
Clicks on Ads & Content
Site & Page visits
Conversion Events
Purchases, Quote requests
Sign-Up for membership etc

Features
Summary of user activities over a time-

window
Aggregates, moving averages, rates over

various time-windows
Incrementally updated
Joining Targets &
Features
Target rates very low: 0.01% ~ 1%
First, construct targets
Filter user activity without targets
Join feature vector with targets

Model Training
Regressions
Boosted Decision Trees
Naive Bayes
Support Vector Machines
Maximum Entropy modeling
Constrained Random Fields

Model Training
Some algorithms are difcult/inefcient to

implement in Map-Reduce
Require ne-grain iterations
Different models in parallel
Model for each target response in parallel

Ofine Scoring &
Evaluation
Apply model weights to features
Pleasantly parallel
Janino Embedded Compiler
Sort by scores and compute metrics
Evaluate metrics
Batch Scoring
Apply models to features from all user

activity
Upload scores to serving systems

User Modeling Pipeline
Component Data Volume Time
Data Acquisition
Feature & Target
Generation
Model Training
Scoring
1 TB 2-3 Hours
1 TB * Feature
Window Size
4-6 Hours
50 - 100 GB
1-2 Hours for 100s
of Models
500 GB 1 Hour
Acknowledgements
Vijay K Narayanan
Vishwanath Ramarao
Nitin Motgi
And Numerous Hadoop Application

Developers at Yahoo!
Questions ?

Hadoop Usage at Yahoo!

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop Usage at Yahoo!

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop Usage At

Parallel Programming since 1989

High-Performance Scientic Computing

Hadoop Solutions Architect @ Yahoo!

Contributor to Hadoop since 2006

Training, Consulting, Capacity Planning

2004-2005: Hadoop prototyped in Apache Lucene

January 2006: Hadoop becomes subproject of

January 2008: Hadoop becomes top-level Apache

Latest Release: 0.21

Stable Release: 0.20.x (+ Strong Authentication)

HBase, Hive, Pig, Howl, Oozie,

Behind Every Click !

Largest cluster is 4000+ servers

1+ Million Jobs per month

10+ TB Compressed Data Added Per Day

Hadoop Dev, QA, Benchmarking (10%)

Sandbox, Release Validation (10%)

Science, Ad-Hoc Usage (50%)

Science + Big Data + Insight = Personal

Log processing: Analytics, Reporting, Buzz

Content Optimization, Spam lters

100+ Billion Web Pages

300+ TB of compressed output

Before Hadoop: 1 Month

With Hadoop: 1 Week

Related concepts occur closer together

3 years of query logs, sessionized per user

10+ Terabytes of Natural Language Text Corpus

Build Dictionaries with Hadoop & Push to Serving

Before Hadoop: 4 Weeks

With Hadoop: 30 minutes

450+ Million mailboxes

5+ Billion deliveries per day

25+ Billion connections

Challenge: User feedback is often late, noisy,

40+ Billions of Events Per Day

Parse & Transform Event Streams

Join Clicks & Views

Filter out Robots

Aggregate, Sort, Partition

Data Quality Checks

Objective: Determine User-Interests by

Large dimensionality of possible user

Typical user has sparse activity vector

Event attributes change over time

Feature and Target Generation

Ofine Scoring & Evaluation

Batch Scoring & Upload to serving

Clicks on Ads & Content

Site & Page visits

Purchases, Quote requests

Sign-Up for membership etc

Summary of user activities over a time-

Aggregates, moving averages, rates over

Target rates very low: 0.01% ~ 1%

First, construct targets

Filter user activity without targets

Join feature vector with targets

Boosted Decision Trees

Support Vector Machines

Maximum Entropy modeling

Constrained Random Fields