Sie sind auf Seite 1von 54

Hadoop @ Foursquare

Joe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake

What is Foursquare?
An app that helps you explore your city and connect with friends A pla5orm for loca7on based services and data

What is Foursquare?
People use foursquare to: share with friends discover new places get 7ps get deals earn points and badges keep track of visits

What is Foursquare?
Mobile Social

Local

Stats

20,000,000+ people 30,000,000+ places 2,000,000,000+ check-ins 1500+ ac7ons/second

Video: hIp://vimeo.com/29323612

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure
Past Present Future

Explore
A social recommenda7on engine built from check-in data

What is a place?
Central Park JFK

Time signatures for places

Ice cream?
ice cream shops 100 0.7 90

80

0.6

Temperature (f)

0.5

60 0.4 50

0.3 40

30 Jan

Feb

Mar

Apr

May

Jun

Jul Month in 2011 (New York)

Aug

Sep

Oct

Nov

Dec

Jan

% of checkins

70

Check-ins and the weather


Warm weather spots ice cream shops Cold weather spots lakes

roof decks boats or ferries harbors or marinas sculpture gardens tracks basketball courts parks

basketball stadiums hockey stadiums art galleries ska7ng rinks bou7ques steakhouses ramen or noodle house

Finding similar items


Critical for our recommendation engine Large sparse k-nearest neighbor problem Items can be places, people, brands Different distance metrics Need to exploit sparsity otherwise
intractable

Finding similar items


Metrics we nd work best for recommending: Places: cosine similarity x x
sim(xi , xj ) =
i j

Friends: intersec7on

kxi kkxj k

Brands: Jacaard similarity


sim(A, B) =

sim(A, B) = |A \ B|
|A\B| |A[B|

Computing venue similarity

each entry is the log(# of checkins at place i by user j) one row for every 30m venues...

X2R

nd

Kij = sim(xi , xj ) xi xj = kxi kkxj k

K2R

nn

Computing venue similarity Naive solu7on for


compu7ng : K

O(n d)

Requires ~4.5m

machines to compute in < 24 hours!!! and 3.6PB to store!

Kij = sim(xi , xj ) xi xj = kxi kkxj k

K2R

nn

Venue similarity w/ map reduce


key user vi, vj vi, vj key vi, vj score score ... score ... visited venues score score

map

emit all pairs of visited venues for each user

reduce

nal score

Sum up each users score contribu7on to this pair of venues

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure
Past Present Future

Data pipeline - stats

1,500,000,000+ 2,500,000,000,000+

log events / week

bytes / week

GB / day (compressed) for api collec7on May 2011 Nov 2011 May 2012

Data pipeline - stats


And lots of people are using it!
100+ hive users. several users with > 100 jobs. ~ 700 MR jobs / day.

Data pipeline - stats

Data pipeline - background


Foursquares technology stack
Amazon EC2 MongoDB Solr / elas7csearch Scala
Lij web framework

Flume (0.9.x aka old-gen) Amazon S3

Data pipeline - overview


API / WWW Flume Collector JSON .../collection-name/dt=2012-06-19/... S3 Export Process

mongodb

Hive

MapReduce

Data pipeline - logs


API / WWW Flume Collector JSON .../collection-name/dt=2012-06-19/... S3

Applications log JSON


some common elds (e.g. event id, 7mestamp, host) data is par77oned by collec7on and date in S3. one table per collec7on in Hive.

Flume for data transport.

Data pipeline - mongodb


Mongo data is nice to work with in MapReduce
info in logs can be stale. certain aIributes not in logs. can scan much less data.
Export Process

S3

mongodb

MapReduce against production mongo cluster would degrade performance and/or cause denial-of-service.

Data pipeline - analytics


Automated reporting
typically a Hive query -> google docs spreadsheet.

Ad hoc reporting
hive dashboard for entering query and receiving an email when results are ready. RoR

Data pipeline - beekeeper

Data pipeline - Summary


Log data and snapshots of mongo data are stored in S3. Users query/analyze the data using Hive, Pig, and MapReduce. Compiled data is inserted to mongo or google spreadsheets for reporting.

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure
Past Present Future

Data infrastructure - past


Elastic MapReduce
but we were keeping clusters con7nuously running.

Rudimentary workow management


start daily repor7ng at 7me X. Hope that data is there. dicult to monitor.

Data infrastructure - past


Scaling the number of users was troublesome
most of the company uses hive. hive server con7nuously crashed. lots of memory issues. resource conten7on.

Mongo data
converted to delimited records, which doesnt always make sense. incremental dumps - some data not consistent (e.g. if two venues are merged). basic schema detec7on. single threaded per-collec7on.

Data infrastructure - past


Hive and EMR ows supported for automated reporting. lots of mapreduce tools written in ruby
everything else is scala
want to use common u7li7es

installing gems on system is briIle

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure
Past Present Future

Data infrastructure - present


Introduced a lot of new systems
Clouderas Hadoop Distro - CDH3u3 Oozie for workow / data management Pig for repor7ng Scaled back ruby / hive dashboard BSON mongo dumps Scala MapReduce Scoobi
36

Data infrastructure - CDH3u3


12-node cluster in EC2 on cc8.xlarge instances
data is in S3 fair scheduler (jobs run as submipng user) performance improvements
skew means slowest reducer ojen denes wall-clock 7me

signs of virtualiza7on cpu bound (data compression)

Data infrastructure - oozie


Pros
beIer monitoring (though not perfect). coordinators for dataset management are great. oozie distributes job submission via map tasks. SPoF but recovers ajer a restart (state stored in DB).

Cons
deployment is not ideal, its dicult to version workows. congura7on via XML - lots of boilerplate
we have a scaolding script to bootstrap a workow.

Oozie coordinator (the good)


S3 / HDFS Dataset Instance Coordinator Workow F Does data exist yet? Depends On Yes? Kickoff workow!

Dataset A

Oozie XML (the bad)


Hello World in Oozie
just invoking HelloWorld#main

Pig for reporting


Converted some ruby streaming to Pig + Scala UDFs. More natural than Hive for some reports, especially those that output to multiple locations. Elephant-Bird (twitter), Piggybank (apache), Data-fu (LinkedIn) all great UDF resources.

Ad hoc reporting dashboard


Uses hive thrift server to validate syntax (via EXPLAIN) Submits jobs as Oozie workows.
The query is a parameter to the workow. queries run as the users that submit them.
$QUERY Beekeeper

1. EXPLAIN $QUERY

Thrift Server

3. (repeatedly) is workow done?

2. submit workow, query=$QUERY

Oozie REST

42

Hive dashboard - error


Click to edit Master text styles

43

BSON data dumps


Full loads each day, parallelized. mongodbs native format is BSON.
Binary JSON some extensions to JSON schema-less

BSON data infrastructure


Hive SerDe and Scoobi Inputs InputFormat for Thrift objects to use in MR. Scala Codegen converts to Thrift Object BSON InputFormat converts to BSONObject Oozie Workow to mount snapshot, split data, compress, upload to S3. Mongo stores BSON on EBS.
-Periodic EBS Snapshots
Recordv2 SerDe / Scoobi Input ThriftBsonInputFormat

Thrift (scala codegen)

BSONObject BSON Split / LZO compress BSON

EBS Snapshots
45

Scooby

Not that Scooby!

Scoobi
A strongly-typed data ow language written in Scala. Much easier than writing MapReduce, but still very exible. https://github.com/nicta/scoobi

Scoobi Example
Counting checkins

Data infrastructure - Data Joins


Joins in MapReduce are cumbersome.
Do them once!

Data infrastructure - Data Joins


Venue Checkins Checkins Checkins Tips Tips Tips Likes Likes Likes

Venue

Data Join Checkins Tips Checkins Checkins Tips Checkins Tips Checkins Tips Tips Checkins Tips Checkins Checkins Tips Checkins Tips Tips

Likes Likes Likes Likes Likes Likes Likes Likes Likes

Overview Intro to Foursquare Data Mining Signals from Check-ins Data Pipeline Data Infrastructure
Past Present Future

Future Work
HCatalog
Makes Hive tables (including input formats and serdes) available to Pig and MapReduce Add support for Scoobi

Indexing/Hive-indexing Relational / MPP database for analytics dashboarding Key-value store for easily serving hadoop data in prod. Replacing Flume 0.9.4

Open Source
Let us know what you might nd useful

Join us!
foursquare is hiring! 115+ people and growing foursquare.com/jobs
Joe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake

Das könnte Ihnen auch gefallen