Hadoop at Foursquare

Hadoop @ Foursquare
Joe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake
What is Foursquare?
An app that helps you explore your city and connect with friends A pla5orm for loca7on based services and data
What is Foursquare?
People use foursquare to: share with friends discover new places get 7ps get deals earn points and badges keep track of visits
What is Foursquare?
Mobile Social
Local
Stats
20,000,000+ people 30,000,000+ places 2,000,000,000+ check-ins 1500+ ac7ons/second
Video: hIp://vimeo.com/29323612
Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure
Past Present Future
Explore
A social recommenda7on engine built from check-in data
What is a place?
Central Park JFK
Time signatures for places
Ice cream?
ice cream shops 100 0.7 90
80
0.6
Temperature (f)
0.5
60 0.4 50
0.3 40
30 Jan
Feb
Mar
Apr
May
Jun
Jul Month in 2011 (New York)
Aug
Sep
Oct
Nov
Dec
Jan
% of checkins
70
Check-ins and the weather

Warm weather spots ice cream shops Cold weather spots lakes
roof decks boats or ferries harbors or marinas sculpture gardens tracks basketball courts parks
basketball stadiums hockey stadiums art galleries ska7ng rinks bou7ques steakhouses ramen or noodle house
Finding similar items

Critical for our recommendation engine Large sparse k-nearest neighbor problem Items can be places, people, brands Different distance metrics Need to exploit sparsity otherwise
intractable
Finding similar items

Metrics we nd work best for recommending: Places: cosine similarity x x
sim(xi , xj ) =
i j
Friends: intersec7on
kxi kkxj k
Brands: Jacaard similarity

sim(A, B) =
sim(A, B) = |A \ B|
|A\B| |A[B|
Computing venue similarity
each entry is the log(# of checkins at place i by user j) one row for every 30m venues...
X2R
nd
Kij = sim(xi , xj ) xi xj = kxi kkxj k
K2R
nn
Computing venue similarity Naive solu7on for

compu7ng : K
O(n d)
Requires ~4.5m
machines to compute in < 24 hours!!! and 3.6PB to store!
Kij = sim(xi , xj ) xi xj = kxi kkxj k
K2R
nn
Venue similarity w/ map reduce

key user vi, vj vi, vj key vi, vj score score ... score ... visited venues score score
map
emit all pairs of visited venues for each user
reduce
nal score
Sum up each users score contribu7on to this pair of venues
Past Present Future
Data pipeline - stats
1,500,000,000+ 2,500,000,000,000+
log events / week
bytes / week
GB / day (compressed) for api collec7on May 2011 Nov 2011 May 2012

And lots of people are using it!
100+ hive users. several users with > 100 jobs. ~ 700 MR jobs / day.
Data pipeline - background

Foursquares technology stack
Amazon EC2 MongoDB Solr / elas7csearch Scala
Lij web framework
Flume (0.9.x aka old-gen) Amazon S3
Data pipeline - overview

API / WWW Flume Collector JSON .../collection-name/dt=2012-06-19/... S3 Export Process
mongodb
Hive
MapReduce
Data pipeline - logs

API / WWW Flume Collector JSON .../collection-name/dt=2012-06-19/... S3
Applications log JSON

some common elds (e.g. event id, 7mestamp, host) data is par77oned by collec7on and date in S3. one table per collec7on in Hive.
Flume for data transport.
Data pipeline - mongodb

Mongo data is nice to work with in MapReduce
info in logs can be stale. certain aIributes not in logs. can scan much less data.
Export Process
S3
mongodb
MapReduce against production mongo cluster would degrade performance and/or cause denial-of-service.
Data pipeline - analytics

Automated reporting
typically a Hive query -> google docs spreadsheet.
Ad hoc reporting
hive dashboard for entering query and receiving an email when results are ready. RoR
Data pipeline - beekeeper
Data pipeline - Summary

Log data and snapshots of mongo data are stored in S3. Users query/analyze the data using Hive, Pig, and MapReduce. Compiled data is inserted to mongo or google spreadsheets for reporting.
Past Present Future
Data infrastructure - past

Elastic MapReduce
but we were keeping clusters con7nuously running.
Rudimentary workow management

start daily repor7ng at 7me X. Hope that data is there. dicult to monitor.

Scaling the number of users was troublesome
most of the company uses hive. hive server con7nuously crashed. lots of memory issues. resource conten7on.
Mongo data
converted to delimited records, which doesnt always make sense. incremental dumps - some data not consistent (e.g. if two venues are merged). basic schema detec7on. single threaded per-collec7on.

Hive and EMR ows supported for automated reporting. lots of mapreduce tools written in ruby
everything else is scala
want to use common u7li7es
installing gems on system is briIle
Past Present Future
Data infrastructure - present

Introduced a lot of new systems
Clouderas Hadoop Distro - CDH3u3 Oozie for workow / data management Pig for repor7ng Scaled back ruby / hive dashboard BSON mongo dumps Scala MapReduce Scoobi
36
Data infrastructure - CDH3u3

12-node cluster in EC2 on cc8.xlarge instances
data is in S3 fair scheduler (jobs run as submipng user) performance improvements
skew means slowest reducer ojen denes wall-clock 7me
signs of virtualiza7on cpu bound (data compression)
Data infrastructure - oozie

Pros
beIer monitoring (though not perfect). coordinators for dataset management are great. oozie distributes job submission via map tasks. SPoF but recovers ajer a restart (state stored in DB).
Cons
deployment is not ideal, its dicult to version workows. congura7on via XML - lots of boilerplate
we have a scaolding script to bootstrap a workow.
Oozie coordinator (the good)

S3 / HDFS Dataset Instance Coordinator Workow F Does data exist yet? Depends On Yes? Kickoff workow!
Dataset A
Oozie XML (the bad)

Hello World in Oozie
just invoking HelloWorld#main
Pig for reporting

Converted some ruby streaming to Pig + Scala UDFs. More natural than Hive for some reports, especially those that output to multiple locations. Elephant-Bird (twitter), Piggybank (apache), Data-fu (LinkedIn) all great UDF resources.
Ad hoc reporting dashboard

Uses hive thrift server to validate syntax (via EXPLAIN) Submits jobs as Oozie workows.
The query is a parameter to the workow. queries run as the users that submit them.
$QUERY Beekeeper
1. EXPLAIN $QUERY
Thrift Server
3. (repeatedly) is workow done?
2. submit workow, query=$QUERY
Oozie REST
42
Hive dashboard - error

Click to edit Master text styles
43
BSON data dumps

Full loads each day, parallelized. mongodbs native format is BSON.
Binary JSON some extensions to JSON schema-less
BSON data infrastructure

Hive SerDe and Scoobi Inputs InputFormat for Thrift objects to use in MR. Scala Codegen converts to Thrift Object BSON InputFormat converts to BSONObject Oozie Workow to mount snapshot, split data, compress, upload to S3. Mongo stores BSON on EBS.
-Periodic EBS Snapshots
Recordv2 SerDe / Scoobi Input ThriftBsonInputFormat
Thrift (scala codegen)
BSONObject BSON Split / LZO compress BSON
EBS Snapshots
45
Scooby
Not that Scooby!
Scoobi
A strongly-typed data ow language written in Scala. Much easier than writing MapReduce, but still very exible. https://github.com/nicta/scoobi
Scoobi Example
Counting checkins
Data infrastructure - Data Joins

Joins in MapReduce are cumbersome.
Do them once!
Data infrastructure - Data Joins

Venue Checkins Checkins Checkins Tips Tips Tips Likes Likes Likes
Venue
Data Join Checkins Tips Checkins Checkins Tips Checkins Tips Checkins Tips Tips Checkins Tips Checkins Checkins Tips Checkins Tips Tips
Likes Likes Likes Likes Likes Likes Likes Likes Likes
Past Present Future
Future Work
HCatalog
Makes Hive tables (including input formats and serdes) available to Pig and MapReduce Add support for Scoobi
Indexing/Hive-indexing Relational / MPP database for analytics dashboarding Key-value store for easily serving hadoop data in prod. Replacing Flume 0.9.4
Open Source
Let us know what you might nd useful
Join us!
foursquare is hiring! 115+ people and growing foursquare.com/jobs
Joe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake

Hadoop at Foursquare

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop at Foursquare

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop @ Foursquare

20,000,000+ people 30,000,000+ places 2,000,000,000+ check-ins 1500+ ac7ons/second

Time signatures for places

Jul Month in 2011 (New York)

Check-ins and the weather

Finding similar items

Finding similar items

Brands: Jacaard similarity

Computing venue similarity

Kij = sim(xi , xj ) xi xj = kxi kkxj k

Computing venue similarity Naive solu7on for

machines to compute in < 24 hours!!! and 3.6PB to store!

Kij = sim(xi , xj ) xi xj = kxi kkxj k

Venue similarity w/ map reduce

emit all pairs of visited venues for each user

Sum up each users score contribu7on to this pair of venues

Data pipeline - stats

log events / week

Data pipeline - stats

Data pipeline - stats

Data pipeline - background

Flume (0.9.x aka old-gen) Amazon S3

Data pipeline - overview

Data pipeline - logs

Applications log JSON

Flume for data transport.

Data pipeline - mongodb

Data pipeline - analytics

Data pipeline - beekeeper

Data pipeline - Summary

Data infrastructure - past

Rudimentary workow management

Data infrastructure - past

Data infrastructure - past

installing gems on system is briIle

Data infrastructure - present

Data infrastructure - CDH3u3

signs of virtualiza7on cpu bound (data compression)

Data infrastructure - oozie

Oozie coordinator (the good)

Oozie XML (the bad)

Pig for reporting

Ad hoc reporting dashboard

3. (repeatedly) is workow done?

2. submit workow, query=$QUERY

Hive dashboard - error

BSON data dumps

BSON data infrastructure

Thrift (scala codegen)

BSONObject BSON Split / LZO compress BSON

Not that Scooby!

Data infrastructure - Data Joins

Data infrastructure - Data Joins

Likes Likes Likes Likes Likes Likes Likes Likes Likes

Das könnte Ihnen auch gefallen