Sie sind auf Seite 1von 18

PyPy benchmarks

in Principal Component Analysis (PCA)


(DRAFT)
Progress is always about changes

Valery A. Khamenya
May 19, 2012

This report is a quick attempt to apply Principal Component Analysis (PCA) to


the benchmarking measurements of the fastest (as of 2012) implementation PyPy of
a quite nice language Python. The targeted audience is anyone, who loves PyPy and
perhaps hates statistics ;)
The factors that influence the variance in benchmarking data are (more powerful
tend to go first):
1. progress with time huge!
2. 32-vs-64
3. battle between sympy_expand/sympy_str and twisted_iteration
4. opposition ai vs spitfire-related tests
5. crypto_pyaes vs the rest
6. spectral-norm vs the rest
7. meteor-contest vs the rest
8. fannkuch vs the rest
9. nbody_modified vs the rest

1 What is it good for?


In short, PCA helps to reveal major factors causing variation in the data. This way one could
do the following:
1. to guess whats going on in PyPy last year from benchmarking perspective.
2. to figure out the biggest hidden games behind the PyPy efforts, what is primarily addressed, what are the prios
3. to find benchmarking tests that are too similar to each other and probably just increase
redundance
4. to estimate how a test increases the representativity/coverage of benchmarking in relation
to others
5. to decrease redundancy in the set of benchmarking tests that are used to state the final
speed-up of PyPy over CPython and therefore allow people outside from PyPy world rely
more on the PyPy speed-up factor http://speed.pypy.org
6. to see what was the influence of particular source control revision on the factors
7. to guess what kind of operations are mostly stressed in a new test (a group of simplistic
benchmarking unit-tests would be needed for this though: float arithmetics, integer
arithmetics, strings, dictionaries, loops, recursion, flow control, exeption handling, OOP,
garbage collection, multiprocessing, multithreading, I/O, arrays, etc)
Albeit, one should strongly understand that all conclusions about single test are made based
on analysis of variation of its benchmarking measurements, i.e. changes during multiple measurements. That is, PCA is like a snake that could miss a cold motionless prey. And of course
this snake will be effective, if one has a vector of measurements, e.g. multiple measurements for
a single benchmarking test from different git-revisions or, alternatively, multiple measurements
from different benchmarking tests for a given single git-revision.

2 Details to skip during 1st reading


The measurements we analyze represent PyPy progress from Jan 2010 (svn epoch) to May 2012
(git epoch). First of all lets read the data into a matrix. To make things simple lets respect
and consider only those benchmarking tests that have at least 300 measurements. Then lets
kick heartlessly those benchmarking rounds, where even one of these respected benchmark tests
failed to produce a time measurement.
Oops, svn epoch seems to be kicked, but we dont worry for moment. Each column of the
benchmarking data matrix represents measurements for a benchmarking test. Here is, e.g. the
top-left part of the matrix:
> m[1:5, 1:4]

48277:39882f1dfd15-64
48277:39882f1dfd15
48354:10f7167b3e98-64
48354:10f7167b3e98
48400:adab424acda7-64

ai bm_chameleon
bm_mako
chaos
0.7149890
0.1392867 0.2094372 0.4828594
0.7056351
0.1474193 0.2078679 0.4821176
0.7221774
0.1460866 0.2096604 0.4682653
0.7279082
0.1362366 0.1952811 0.4751145
0.7506032
0.1451451 0.2010512 0.4759733

The fun with PCA is that we could analyse the data matrix and then ...transpose it and get
an alternative point of view. Well discuss it later.
Currently we get 31 benchmarking tests and 332 benchmarking rounds. This means, in terms
of PCA we could dream of 31 factors that we might manage to discover.

3 Just to feed your interest before we go...


Statistical analysis is often underestimated. Just to feed the interest of those, who is too far
away from all PCA, ICA, whatever-else approaches. Lets apply PCA to the data matrix with
benchmarking measurements as-is.

3.1 32 vs 64 an easy Factor Nr. 2


And here are the first 2 strongest factors that influence the variation in benchmarking measurements.

PC1 x PC2
32

32

32

32
32

32

32 32 32

32 32
32
32

32
32

32

32
32
32
32

32
32

64
64

64

64

32

64

64

64

64

32

64 64
64

64

32

32

32

32 32

32

32

32

32

64

64

64
6464

64

64
64

64

64

64

64
64

32

32

64

32

32
32

64
64

64

64

64 64
64

32
32

64

64

64

64

32

64
64 64
64

64
64

6464 64
64

64
64 6464
64
64
64
64
64

6464
64

64

6464 64 64
64
64 64 64

64
64
64 64
64 64

64

64

64
64
64

64 64
6464
64
64
64
64 64
64
64

64



64

64
6464 64
64
64 64 64
64
64

64
64
64 64

64

64
64 64

64
64

64

64
64

6464 64

6464 64 64 64 64
64
64

64

64
64

64

64

64
64

64
64
64
64
64

64
64
64

64
64
64
64

32

3232

32

32

32 32
32

32 32
32 32
32

32

32


32

32
32
32
32
32
32 3232
32 32 3232 32 32
3232
32
32

32

32

32

32
32 32 32
32
32 32

64
32
32


32
32

32

32

32
32
32
32
32
32
3232

3232

32

32

3232

32
32

32
32 32

32 32

32
32
32

32
32
32
32


32 32

32

32
3232
32

32
32 32

32 32
32
32
32 32 32

32

32 32
32

32
32

32 32

32
64

32
32 32 32
64
32


32

32
32
64
32

32

64

64
64

64
64
64
64
64

32
32

32

32

32

32
32

32

32

32

32

64

64

64

64

64

64

64

Each point represents a round of benchmarking measurements for the corresponding PyPy
source control revision and CPU-platform. By the way, factors in PCA are often named Principal
Components (PC), therefore PC1, PC2, etc.
To get an idea about the second factor (Y-axis) the benchmarking rounds are marked with
64 for 64-bit target platform, and similarly with 32 for 32-bit one. Of course, PCA is not
able to interpret factors completely or even magically name them right ;) However it does help
us to sort them by the influence they make on the variances in benchmarking measurements.

The 32-vs-64 was just an easy puzzle about second-strong factor in measurement variance.
How strong it is? This graph shows the contribution of each facor in squared variance of the
data:

4
0

Variances

Ordered factors, major factors are wellseparated

An important thing to know about this graph is the larger is difference between the factors,
the better is algorithmical separation of factors (and more reliable is the interpretation)
Zoom in first 7:

4
0

Variances

Ordered factors, major factors are wellseparated (zoom)

3.2 No 64-bit anymore and Factor Nr.1


OK lets kick 64-bit for simplisity of the next steps. The graphs will be more sparse and easier
to find interesting things during our first approaching to the data.
> m <- m[bitSuffixes == "32", ]
Thus, dont get confused, PC2 will no longer refer to 32-vs-64 story!

1.5

PC1 x PC2 (no 3264 anymore)


54509:b58494d41466

54915:573a6cacf459

53971:2a83c08dcb0e

54841:38a19a4dd9f3

53869:65f628f558ca
53817:62902925695c

54887:b7558f5630d6
54769:f6fbfecb93fd

50819:bb7e012d070a

54779:ac3066573611
54998:d8e00a3ec08d

54335:83dbfcb6f927
54279:859f1579f2bd
54854:4fc21e56dbc9

54315:1da1c1632353

51745:29a811af16dc
54978:89ed5aadced0

53930:d148511060f8

54599:442a3ea22328

49647:ddbc82ef4d8f
49705:0a31d8ef2f8a

54836:52324a85becd

52658:a8321d3e8e9c

53290:76607038e429
55013:8021ff42995c
51285:b09a9354d977
54289:ad03b1c52876
51078:b67e65d709e1
52207:5b7ecbf87681

54131:76c5931f64cc

51647:7cd209e0414e
54737:1e469996fdab
54610:92cfbb56d39e

54270:41c799d11717

50234:33ec28c6d811

52889:0eaf96f13694
54368:57f6dff7fb22

54752:6dffe8f51e7b
54811:cc436eb0a04b
54700:7fc6072593dd

54891:5e8d21a87161

52607:0f03693b05ac
49539:7d9e78a91dce
51935:cf1a8868cf4e
53127:21d7882b8571
53215:edd5581881f4
50621:b3f614a9de14

53975:8ae92dbdda48

53293:478bc4f20cb7
54898:629cfca82920


54085:285ff15e1498
54440:8ae7413e7b32

49556:b64cba156148

49840:17fd3198ef36
50138:4162bc8b5f4c
51466:f254dc780358

53841:1c6dc3e6e70c

52943:092ee39048af
50834:532130e19935

0.5

1.0

48885:c07fe33e541d

48400:adab424acda7

49403:913f736ff114

0.0

50758:85a5e1fe1ad8
53263:986d17b4a13c
53958:ea2751a04d47
50321:03796662a8a0
53248:1dcf738f99a5
54796:8cb0aa4c2211
53240:b608170d963a
49823:d9ef0a8f3fa2
52257:11d854db3e60
50792:862207881328


50594:2b3d72c181dd
54951:94956a840d5f

54046:3b48363cff78
50836:819faa2129a8
52823:f50a42098ae3
50663:16d3f098e8ec

54308:97c57afceef4
50291:e37e4e6e97b8

49857:87f2c234b924


50886:c8ddbb442986
50358:969865e9cb30

51557:9adc55550ee851616:5aa09e8483d3

51333:eb0269c21eec

50481:b673742c84f1

50969:d0d0b1bbbee8
50029:06acac97ffa5

50452:ef8e9023100e

51251:30e3fdc262ca
52576:80d15a9a3932
51216:ca3f367e84af

53274:487174b08100
49672:d550918b20a6

52004:32425967effa

50701:5467c010ecde
50397:6fb87770b5d2

50079:87235ee9b8ab 49589:e3fa364982b2
48673:b387640aa6ba

51498:7745b3fcec92
54487:11d96d5e877f

54010:4dcfa3206067

50605:10601f705a55

53000:836fcc2fe8d8

53305:81acfc4eadac

54107:5b9f7aa356a0
50316:8de6f245c959

50979:6a589f1a038a 52672:b319183b838d
50911:94e9969b5f00

49743:bf59d657e73a
51365:5e43d79c76a7
52113:380432600a53
51738:45af9fa4aed0
52573:59a514e97b66

53032:f9f3b57f1300

53155:3539e2d663f4
50935:69095778cbfd

50098:03e42e96479d
52024:f054c58ba588

51123:7bb8b38d8563

52432:fd14bc0aec12

50524:5a9a29b9c0ae

53083:49afda04d4ce

52056:5bf9a08deeb4

0.5

49458:365410e9e95e
48462:fc37961a668f

48277:39882f1dfd15

48675:3150cc438a42
48724:7cd8e99541db
48616:b994f7ec222e

48572:1847537fd4b5
48354:10f7167b3e98

48599:a3a5ac0a2daf

48482:a2b911e61392
48630:fb26ce1b9d1b
48649:06cddf70488a

49500:f46e309f89bd

48761:98bf21b80fc5

49344:73b76d76352b
48538:993b01fd53d4

48852:9e7c5b33e755

1.0

50040:e4a0b9e4d23b
51018:5afb4fd1f372

50718:196c4e9bbd48
50181:0732486f6a76
52533:f272bf10ef94

52140:646611ce782f

52030:f91dd3570b06

51707:9901f428b3b1

51701:173aa3e5cde0

48778:c4dce4f412b1

52769:37fb24cc3dde
51817:0586c5404983
52741:48ef6cd6e2df
50858:44b0e2106e2d

51002:ece227c225ab

50322:4efbd07c3e55
52396:5e6014f8952051462:c85a96246d2f

1.5

2.0

52490:30cb1ba90150
50826:f6f8ddc1a2f0

54037:4d18306a2fb3

52382:00b830d7bd6a

52336:4b90bae5c842

What is special about this stand-alone cluster of git-revisions, where e.g. git-revisions 48761:98bf21b80fc5
or, say, 48354:10f7167b3e98 are most extrem ? The answer will explain the benchmarking measurement variance Factor Nr.1 in terms of git-revisions.
It looks simply like after revisions in the range near 48277-49500 there was a considerable
qualitative speed up, i.e. after Nov/Dec 2011. Well the first 2 factors were not that much
interesting, but for those, who never saw PCA probably it was fun.

4 No old data
Lets kick old data und focus on the recent changes after git-revision 49600.
> isRecent <- sapply(strsplit(rownames(m), ":"), function(x) as.numeric(x[1])) >
+
49600
> m <- m[isRecent, ]
Do we have a major factor well-separated from the secondary one?

1.0
0.5
0.0

Variances

1.5

2.0

no 64bit data, no old data

2.0

PC1 x PC2 (no 3264 anymore, no old data)


52382:00b830d7bd6a

52336:4b90bae5c842

1.5

54037:4d18306a2fb3

50826:f6f8ddc1a2f0

52396:5e6014f89520

50858:44b0e2106e2d

52490:30cb1ba90150

52140:646611ce782f

52741:48ef6cd6e2df

50322:4efbd07c3e55
51002:ece227c225ab
51462:c85a96246d2f

51817:0586c5404983
51707:9901f428b3b1
51701:173aa3e5cde0

1.0

52769:37fb24cc3dde

50718:196c4e9bbd48

53083:49afda04d4ce

50181:0732486f6a76

52030:f91dd3570b06

52533:f272bf10ef94

51018:5afb4fd1f372

52024:f054c58ba588

51123:7bb8b38d8563
50935:69095778cbfd

53155:3539e2d663f4

52056:5bf9a08deeb4

50911:94e9969b5f00

50979:6a589f1a038a

0.5

50524:5a9a29b9c0ae

52432:fd14bc0aec12

54107:5b9f7aa356a0

51738:45af9fa4aed0

50040:e4a0b9e4d23b

53305:81acfc4eadac
52113:380432600a53
52573:59a514e97b66

50098:03e42e96479d
52672:b319183b838d

53032:f9f3b57f1300
51365:5e43d79c76a7

49743:bf59d657e73a

54487:11d96d5e877f

54010:4dcfa3206067

51557:9adc55550ee8

50701:5467c010ecde
50452:ef8e9023100e

52004:32425967effa
53274:487174b08100
51251:30e3fdc262ca

50397:6fb87770b5d2

51498:7745b3fcec92

53000:836fcc2fe8d8
50605:10601f705a55

50079:87235ee9b8ab

0.0

50969:d0d0b1bbbee8

52576:80d15a9a3932
50029:06acac97ffa5

50886:c8ddbb442986

50836:819faa2129a8

50481:b673742c84f1

49857:87f2c234b924

50358:969865e9cb30

49823:d9ef0a8f3fa2

51935:cf1a8868cf4e

54796:8cb0aa4c2211
53841:1c6dc3e6e70c

50321:03796662a8a0

54308:97c57afceef450758:85a5e1fe1ad8
50834:532130e19935

53240:b608170d963a
53263:986d17b4a13c
50594:2b3d72c181dd
52257:11d854db3e60

50792:862207881328
54951:94956a840d5f

52823:f50a42098ae3
50291:e37e4e6e97b8

51616:5aa09e8483d3

54046:3b48363cff78
53248:1dcf738f99a5

51333:eb0269c21eec

49672:d550918b20a6

50663:16d3f098e8ec 53958:ea2751a04d47

51216:ca3f367e84af

50316:8de6f245c959

49840:17fd3198ef36

54898:629cfca82920
54085:285ff15e1498

53127:21d7882b8571

0.5

53293:478bc4f20cb7
51647:7cd209e0414e
53975:8ae92dbdda48
54752:6dffe8f51e7b
50138:4162bc8b5f4c

51466:f254dc780358

54131:76c5931f64cc
52943:092ee39048af
54270:41c799d11717

54891:5e8d21a87161
50621:b3f614a9de14
54440:8ae7413e7b32
54368:57f6dff7fb22

53215:edd5581881f4

54811:cc436eb0a04b
52889:0eaf96f13694
51285:b09a9354d977

53290:76607038e429

54700:7fc6072593dd

52607:0f03693b05ac

54279:859f1579f2bd

54610:92cfbb56d39e
54289:ad03b1c52876

54737:1e469996fdab
52207:5b7ecbf87681
54599:442a3ea22328
54836:52324a85becd

50234:33ec28c6d811

54335:83dbfcb6f927

54779:ac3066573611

53930:d148511060f8
51078:b67e65d709e1

55013:8021ff42995c

52658:a8321d3e8e9c

54978:89ed5aadced0

51745:29a811af16dc

49647:ddbc82ef4d8f
49705:0a31d8ef2f8a

53817:62902925695c

53971:2a83c08dcb0e

53869:65f628f558ca

54315:1da1c1632353

54998:d8e00a3ec08d

1.0

54887:b7558f5630d6

54854:4fc21e56dbc9

50819:bb7e012d070a

54769:f6fbfecb93fd

54841:38a19a4dd9f3

54915:573a6cacf459

54509:b58494d41466

The PC1 (X-axis) is rather about time progress, but what abot Y-axis? What are its poles
52382:00b830d7bd6a and 54509:b58494d41466 ?

5 Flip-flop!
As mentioned, we could rotate data matrix to see things from different point of view.

point is a benchmarking test


spitfire_cstringio

spitfire

sympy_expand

html5lib

slowspitfire
crypto_pyaes
sympy_str
fannkuch

sympy_integrate

pyflatefast
raytracesimple
go

meteorcontest
sympy_sum
spectralnorm
django
twisted_tcp
telco
nbody_modified
chaos
float
rietveld
spambayes
bm_mako
richards
ai
bm_chameleon

twisted_pb
twisted_names
twisted_iteration

json_bench

100

200

300

400

500

What special about json_bench or html5lib? Nothing much interesting. They always show
higher avg_changed than the others. Lets normalize the avg_changed range for each test.

4e04
2e04
0e+00

Variances

6e04

8e04

first factors are wellseparated, huray! But others... :(

10

PC1 x PC2, point is a benchmarking test (Normalized)


ai

0.02

sympy_expand

sympy_str

meteorcontest

spectralnorm

0.00

richards
raytracesimple

twisted_iteration

float django
json_bench
go telco
rietveld
twisted_pb

spambayes
twisted_tcp
bm_chameleon
html5lib
sympy_sum
chaos twisted_names
sympy_integrate

pyflatefast
nbody_modified

bm_mako

crypto_pyaes

0.02

fannkuch

spitfire

0.04

spitfire_cstringio
slowspitfire

0.10

0.08

0.06

0.04

0.02

0.00

0.02

0.04

So the main battle of the last PyPy year seems to be between sympy_expand/sympy_str
and twisted_iteration.
The second big opposition is ai vs spitfire-related tests.
Well, the problem is that these first two battles are the only well-recognizable
factors.
OK, lets kick these measurements to hear the rest chorus:
> bigSolo <- c("ai", "sympy_expand", "sympy_str", "twisted_iteration",
+
"slowspitfire", "spitfire_cstringio", "spitfire")
> toTake <- sapply(rownames(tm), function(r) !(r %in% bigSolo))
> tm <- tm[toTake, ]
are top-factors separatable now?

11

0.00000

0.00005

0.00010

0.00015

Variances
0.00020

0.00025

no big solo poles changed a bit, but not much

12

PC1 x PC2, no big solo, point is a benchm. (Normalized)


spectralnorm

0.03

0.04

0.05

meteorcontest

0.01

0.00

0.01

0.02

fannkuch

crypto_pyaes

nbody_modified
floatchaos

pyflatefast
raytracesimple
go

richards
bm_chameleon
spambayes
django

twisted_pb

twisted_tcp

rietveld
sympy_integrate

twisted_names

telco
json_bench
sympy_sum

bm_mako

html5lib

0.00

0.02

0.04

0.06

Lets kick 3 more tests:


> backSolo <- c("spectral-norm", "crypto_pyaes", "meteor-contest")
> toTake <- sapply(rownames(tm), function(r) !(r %in% backSolo))
> tm <- tm[toTake, ]
are top-factors separatable now?

13

0.00008
0.00004
0.00000

Variances

0.00012

no back solo too poles changed a bit, but not much

14

fannkuch

0.01

0.02

PC1 x PC2, no back solo too, point is a benchm. (Normalized)

json_bench

bm_mako

go

pyflatefast

0.00

twisted_names
spambayes
sympy_sum

twisted_pbhtml5lib

sympy_integrate
twisted_tcp

rietveld

telco
django

richards

raytracesimple

bm_chameleon

0.01

chaos

float

0.03

0.02

nbody_modified

0.04

0.03

0.02

0.01

0.00

0.01

Lets kick 2 more tests:


> backSolo2 <- c("fannkuch", "nbody_modified")
> toTake <- sapply(rownames(tm), function(r) !(r %in% backSolo2))
> tm <- tm[toTake, ]
are top-factors separatable now?

15

6e05
4e05
2e05
0e+00

Variances

8e05

1e04

no back solo2 too poles changed a bit

16

0.010

PC1 x PC2, no back solo2 too, point is a benchm. (Normalized)


go

richards

0.005

pyflatefast

twisted_pb

twisted_names

twisted_tcp

rietveld
html5lib

json_bench
sympy_sum
spambayes

0.000

telco

raytracesimple

django

sympy_integrate

chaos

float

0.010

bm_chameleon

0.020

bm_mako

0.02

0.01

0.00

17

0.01

6 Appendix
An example of non-recognizable PCA factors from the 31*100 matrix of normally distributed
random data.
> p <- prcomp(matrix(rnorm(31 * 100), ncol = 31))
> plot(p, n = 31, "really a bad case, factors can't be separated")

1.0
0.0

0.5

Variances

1.5

2.0

really a bad case, factors can't be separated

This report is generated using LATEX, Sweave and R.

18

Das könnte Ihnen auch gefallen