Sie sind auf Seite 1von 18

Data Workflows at Foursquare using Luigi

Foursquare

35 million users Nearly 4 billion check-ins More than 5 million check-ins per day

50 million point-of-interest database


100's of GB of log data per day

Tools We Use

Hive
o

Ad hoc analytics, data dumping ground


100's of MapReduce jobs in our codebase Fits between structure Hive and free-form MapReduce Low latency analytics

Raw MapReduce
o

Pig
o

Vertica
o

Cron
E.g.
0 # 0 # 0 * * * ./hadoop-script-1.sh Wait two hours for that job to finish... 2 * * * ./hadoop-script-2.sh And on and on and on

Cron - Problems

Brittle Hard to reason about / visualize Spend a lot of time waiting

Difficult to tell what succeeded or failed


No one likes writing Bash scripts

Oozie
XML-based Workflow Engine, with support for Hadoop, Hive, and Pig Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel" Coordinators launch recurring workflows at a given frequency, when dependent data is available

Oozie - Example

Oozie - Problems

Workflows are all-or-nothing


o
o

Cannot just run step that failed Very little code reuse

Little to no extensibility Limited control flow Extremely verbose Difficult to test No one likes writing XML

Luigi

Python framework for batch processing jobs Created by Spotify, open-sourced Sept. 2012 Tasks are units of work that produce Targets Tasks can depend on one or more other Tasks A Task is only run if all of its dependent Tasks are done Tasks are idempotent

Luigi - Example Task

Luigi - Running the Task


$ python word-count.py WordCount --date 2013-06-01

Luigi - Scheduler
Central scheduler ensures each Task is only run by a single worker. A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01) Will retry failed Tasks after a configured timeout Emails someone when a Task fails

Luigi - Visualizer

Luigi - Visualizer

Luigi - Visualizer

Luigi - Advantages over Cron

Explicit dependencies No wasted time waiting Easy to tell what has failed

Avoid duplicate work / partial failures

Luigi - Advantages over Oozie

Explicit dependencies between workflows Easier to write Vastly more extensible

Code reuse
Can easily re-run individual steps

Thank you!
Check out Luigi: https://github.com/spotify/luigi Drop me a line: Joe Ennever jennever@foursquare.com

Das könnte Ihnen auch gefallen