Beruflich Dokumente
Kultur Dokumente
Foursquare
35 million users Nearly 4 billion check-ins More than 5 million check-ins per day
Tools We Use
Hive
o
Raw MapReduce
o
Pig
o
Vertica
o
Cron
E.g.
0 # 0 # 0 * * * ./hadoop-script-1.sh Wait two hours for that job to finish... 2 * * * ./hadoop-script-2.sh And on and on and on
Cron - Problems
Oozie
XML-based Workflow Engine, with support for Hadoop, Hive, and Pig Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel" Coordinators launch recurring workflows at a given frequency, when dependent data is available
Oozie - Example
Oozie - Problems
Cannot just run step that failed Very little code reuse
Little to no extensibility Limited control flow Extremely verbose Difficult to test No one likes writing XML
Luigi
Python framework for batch processing jobs Created by Spotify, open-sourced Sept. 2012 Tasks are units of work that produce Targets Tasks can depend on one or more other Tasks A Task is only run if all of its dependent Tasks are done Tasks are idempotent
Luigi - Scheduler
Central scheduler ensures each Task is only run by a single worker. A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01) Will retry failed Tasks after a configured timeout Emails someone when a Task fails
Luigi - Visualizer
Luigi - Visualizer
Luigi - Visualizer
Explicit dependencies No wasted time waiting Easy to tell what has failed
Code reuse
Can easily re-run individual steps
Thank you!
Check out Luigi: https://github.com/spotify/luigi Drop me a line: Joe Ennever jennever@foursquare.com