Beruflich Dokumente
Kultur Dokumente
Joe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake
What is Foursquare?
An
app
that
helps
you
explore
your
city
and
connect
with
friends A
pla5orm
for
loca7on
based
services
and
data
What is Foursquare?
People
use
foursquare
to:
share
with
friends
discover
new
places
get
7ps
get
deals
earn
points
and
badges
keep
track
of
visits
What is Foursquare?
Mobile Social
Local
Stats
Video: hIp://vimeo.com/29323612
Overview Intro
to
Foursquare
Data Mining
Signals
from
Check-ins Data
Pipeline Data
Infrastructure
Past Present Future
Explore
A
social
recommenda7on
engine
built
from
check-in
data
What is a place?
Central
Park JFK
Ice cream?
ice cream shops 100 0.7 90
80
0.6
Temperature (f)
0.5
60 0.4 50
0.3 40
30 Jan
Feb
Mar
Apr
May
Jun
Aug
Sep
Oct
Nov
Dec
Jan
% of checkins
70
roof decks boats or ferries harbors or marinas sculpture gardens tracks basketball courts parks
basketball stadiums hockey stadiums art galleries ska7ng rinks bou7ques steakhouses ramen or noodle house
Friends: intersec7on
kxi kkxj k
sim(A, B) = |A \ B|
|A\B| |A[B|
each entry is the log(# of checkins at place i by user j) one row for every 30m venues...
X2R
nd
K2R
nn
O(n d)
Requires ~4.5m
K2R
nn
map
reduce
nal score
Overview Intro
to
Foursquare
Data Mining
Signals
from
Check-ins Data
Pipeline Data
Infrastructure
Past Present Future
1,500,000,000+ 2,500,000,000,000+
bytes / week
GB / day (compressed) for api collec7on May 2011 Nov 2011 May 2012
mongodb
Hive
MapReduce
S3
mongodb
MapReduce against production mongo cluster would degrade performance and/or cause denial-of-service.
Ad hoc reporting
hive
dashboard
for
entering
query
and
receiving
an
email
when
results
are
ready. RoR
Overview Intro
to
Foursquare
Data Mining
Signals
from
Check-ins Data
Pipeline Data
Infrastructure
Past Present Future
Mongo data
converted
to
delimited
records,
which
doesnt
always
make
sense. incremental
dumps
-
some
data
not
consistent
(e.g.
if
two
venues
are
merged). basic
schema
detec7on. single
threaded
per-collec7on.
Overview Intro
to
Foursquare
Data Mining
Signals
from
Check-ins Data
Pipeline Data
Infrastructure
Past Present Future
Cons
deployment
is
not
ideal,
its
dicult
to
version
workows. congura7on
via
XML
-
lots
of
boilerplate
we
have
a
scaolding
script
to
bootstrap
a
workow.
Dataset A
1. EXPLAIN $QUERY
Thrift Server
Oozie REST
42
43
EBS Snapshots
45
Scooby
Scoobi
A strongly-typed data ow language written in Scala. Much easier than writing MapReduce, but still very exible. https://github.com/nicta/scoobi
Scoobi Example
Counting checkins
Venue
Data Join Checkins Tips Checkins Checkins Tips Checkins Tips Checkins Tips Tips Checkins Tips Checkins Checkins Tips Checkins Tips Tips
Overview Intro
to
Foursquare
Data Mining
Signals
from
Check-ins Data
Pipeline Data
Infrastructure
Past Present Future
Future Work
HCatalog
Makes
Hive
tables
(including
input
formats
and
serdes)
available
to
Pig
and
MapReduce Add
support
for
Scoobi
Indexing/Hive-indexing Relational / MPP database for analytics dashboarding Key-value store for easily serving hadoop data in prod. Replacing Flume 0.9.4
Open Source
Let us know what you might nd useful
Join us!
foursquare is hiring! 115+ people and growing foursquare.com/jobs
Joe Crobak Software Engineer @joecrobak Blake Shaw, PhD Data Scientist @metablake