Sie sind auf Seite 1von 48

Practical Medium Data

Analytics with Python


PyData NYC 2013
Practical Medium Data
Analytics with Python
10 Things I Hate
About pandas
PyData NYC 2013
Wes McKinney

@wesmckinn
Former quant and MIT math dude
Creator of Pandas project for Python
Author of
Python for Data Analysis OReilly

Founder and CEO of DataPad

3 www.datapad.io
> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code

4 www.datapad.io
http://datapad.io
Founded in 2013, located in SF

In private beta, join us!

Hiring for engineering


www.datapad.io
Why hate on pandas?
7 www.datapad.io
pandas rocks!
So, pandas

Easy-to-use, fast in-memory data wrangling


and analytics library

Enabled loads of complex data work to be


done by mere mortals in Python

Might have kept R from taking over the


world (hehe)

10 www.datapad.io
11 www.datapad.io
pandas, the project

170 distinct contributors


Over 5400 issues and pull requests
on GitHub

Upcoming 0.13 release

12 www.datapad.io
But.

pandass broad applicability also a


liability

Only game in town for lot of things

pandas being used in some


unplanned ways

13 www.datapad.io
Some things to love

No more structured dtype drudgery!


Easy IO!
Data alignment!
Hierarchical indexing!
Time series analytics!
14 www.datapad.io
More things to love

Table reshaping
Missing data handling
pandas.merge, pandas.concat

Expressive groupby machinery

15 www.datapad.io
Some pandas use cases
General data wrangling
ETL jobs
Business analytics (incl. BI uses)

Time series analysis, statistical


modeling

16 www.datapad.io
pandas does many things
that are tedious, slow, or
dicult to do correctly
without it
Unfortunately, pandas is
not a database
#1 Slightly too far from
the metal
DataFrames internal structure
intended to make row-oriented ops
fast on numerical data

Python objects can be used as data,


indices (a feature, not a bug)

19 www.datapad.io
#2 No support (yet) for
memory maps
Many analytics ops require a small portion
of the data

Many ways to materialize the full data set


in memory by accident

Axis indexes wouldnt necessarily make


sense on out of core data sets

20 www.datapad.io
#2 No support (yet) for
memory maps
N.B. HDF5/PyTables support is a
partial solution

21 www.datapad.io
#3 No tight database
integration
Makes it dicult to be a serious tool
in an ETL toolchain on top of some
SQL-ish system

Inadequacy of pandas/NumPy data


type systems

22 www.datapad.io
#3 No tight database
integration
Jobs with heavy SQL-reading are
slow and use tons of memory

TODO: integrate pandas with ODBC


C API and write out SQL data directly
into NumPy arrays

23 www.datapad.io
#4 Best-efforts NA
representation
Inconsistent representation of
missing data

No Boolean or Integer NA values

NA needs to be a rst class citizen in


analytics operations

24 www.datapad.io
#5 RAM management

Dicult to understand footprint of pandas


object

Ample data copying throughout library


Would benet from being able to compress
data in-memory or shuttle data temporarily
to disk

25 www.datapad.io
#6 Weak support for
categorical data
Makes pandas not quite a fully-
edged R replacement

GroupBy and Joins slower than they


could be

26 www.datapad.io
#7 Complex GroupBy
operations get messy
Must write custom functions to pass
to .apply(..)

Easy to run up against DRY


problems and general Python
syntax limitations

27 www.datapad.io
#8 Appending data slow
and tedious
DataFrame not intended as a
database table

Makes streaming data use a


challenge

B+ tree tables interesting?


28 www.datapad.io
#9 Limited type system,
column metadata
Currencies, units
Time zones
Geographic data

Composite data types

29 www.datapad.io
#10 No true query
processing layer
Filter WHERE, HAVING
Group GROUP BY
Join JOIN
Aggregate SUM, MEAN, ...
Limit/TopK LIMIT
Sorting ORDER BY
30 www.datapad.io
#11 Slow: no multicore /
distributed algos
Hampered by use of Python data
structures / GIL interactions

Object internals not designed for


concurrent use

31 www.datapad.io
Oh no what do we do
Stop believing in the one
tool to rule them all
Real Artists Ship
- Steve Jobs
www.datapad.io
Focus on results

I am heavily biased by focus on


business analytics/BI use cases

Need production-ready software to


ship in relatively short time frame

36 www.datapad.io
A new project

In internal development at DataPad


Code named badger
pandas-ish syntax: designed for
data processing and analytical
queries

37 www.datapad.io
Badger in a nutshell
Consistent data type system

Compressed columnar binary storage

High perf analytical query processor


Data preparation/cleaning tools

38 www.datapad.io
Badger in a nutshell
Time series analytics

Immutable array data, little copying

Analytics kernels: written C with no


dependencies

Caching of useful intermediates

39 www.datapad.io
Some benchmarks
Data set: 2012 Election data (FEC)
5.3 mm records 7 columns
Tools
pandas
badger
R: data.table
SQL: PostgreSQL, SQLite
40 www.datapad.io
Query 1
Total contributions by candidate
SELECT cand_nm,
sum(contb_receipt_amt) AS total
FROM fec
GROUP BY cand_nm

41 www.datapad.io
Query 1
Total contributions by candidate
badger (in-memory) : 19ms (1x)
badger (from-disk) : 131ms (6.9x)
pandas (in-memory) : 273ms (14.3x)
R data.table 1.8.10: 382ms (20x)
PostgreSQL : 4.7s (247x)
SQLite : 72s (3800x)

42 www.datapad.io
Query 2
Total contributions by candidate
and state
SELECT cand_nm, contbr_st,
sum(contb_receipt_amt) AS total
FROM fec
GROUP BY cand_nm, contbr_st

43 www.datapad.io
Query 2
Total contributions by candidate and
state
badger (in-memory) : 269ms (1x)
badger (from-disk) : 391ms (1.5x)
R data.table 1.8.10: 500ms (1.8x)
pandas (in-memory) : 770ms (2.9x)
PostgreSQL : 5.96s (23x)

44 www.datapad.io
Query 3
Total contributions by candidate
and state with 2 lter predicates
SELECT cand_nm,
sum(contb_receipt_amt) as total
FROM fec
WHERE contb_receipt_dt BETWEEN
'2012-05-01' and '2012-11-05'
AND contb_receipt_amt BETWEEN
0 and 2500
GROUP BY cand_nm
45 www.datapad.io
Query 3
Total contributions by candidate
and state with 2 lter predicates

badger (in-memory) : 96ms (1x)


badger (from-disk) : 275ms (2.9x)
pandas (in-memory) : 946ms (9.8x)
PostgreSQL : 6.2s (65x)

46 www.datapad.io
Badger, the future
Distributed in-memory analytics
Multicore algorithms
ETL job-building tools
Open source in some form someday
Looking for algorithms hackers to help

47 www.datapad.io
Thank you!

48 www.datapad.io