Sie sind auf Seite 1von 248

METRICS

METRICS EVERYWHERE
Saturday, April 9, 2011
METRICS

METRICS EVERYWHERE
Saturday, April 9, 2011
Make better decisions
by using numbers.

Saturday, April 9, 2011


Coda Hale
@coda
github.com/codahale

Saturday, April 9, 2011


The enterprise social network.
www.yammer.com

Saturday, April 9, 2011


I write code.

Saturday, April 9, 2011


But that’s not
actually my job.

Saturday, April 9, 2011


code

Saturday, April 9, 2011


business
code
value

Saturday, April 9, 2011


What the hell is
business value?

Saturday, April 9, 2011


A new feature.

Saturday, April 9, 2011


An improved
existing feature.

Saturday, April 9, 2011


Fewer bugs.

Saturday, April 9, 2011


Not pissing our users
off with a slow site.

Saturday, April 9, 2011


Not pissing our users
off with a slow site.
ugly

Saturday, April 9, 2011


Not pissing our users
off with a slow site.
ugly
pretty

Saturday, April 9, 2011


Making future
changes easier.

Saturday, April 9, 2011


Adding a unit test
before fixing that bug.

Saturday, April 9, 2011


Business value is
anything which makes
people more likely to
give us money.

Saturday, April 9, 2011


We want to generate
more business value.

Saturday, April 9, 2011


We need to make
better decisions
about our code.

Saturday, April 9, 2011


Our code generates
business value
when it runs.

Saturday, April 9, 2011


Our code generates
business value
when it runs,
not when we write it.
Saturday, April 9, 2011
We need to know
what our code does
when it runs.

Saturday, April 9, 2011


We can’t do this unless
we measure it.

Saturday, April 9, 2011


Why measure it?

Saturday, April 9, 2011


map ≠ territory

Saturday, April 9, 2011


map ≠ city
of of
San San
Francisco Francisco
Saturday, April 9, 2011
the ≠ the
way way
we it
talk is
Saturday, April 9, 2011
the ≠ the
thing thing
we in
think of itself
Saturday, April 9, 2011
perception ≠ reality

Saturday, April 9, 2011


MIND THE GAP

Saturday, April 9, 2011


We have a
mental model
of what our code does.

Saturday, April 9, 2011


It’s a mental model.
It’s not the code.

Saturday, April 9, 2011


It is often wrong.

Saturday, April 9, 2011


Confusion.

Saturday, April 9, 2011


“This code can’t
possibly work.”

Saturday, April 9, 2011


(It works.)

Saturday, April 9, 2011


MIND THE GAP

Saturday, April 9, 2011


“This code can’t
possibly fail.”

Saturday, April 9, 2011


(It fails.)

Saturday, April 9, 2011


MIND THE GAP

Saturday, April 9, 2011


Which is faster?

Saturday, April 9, 2011


Which is faster?
items.sort_by { |i| i.name }

Saturday, April 9, 2011


Which is faster?
items.sort_by { |i| i.name }

items.sort { |a, b| a.name <=> b.name }

Saturday, April 9, 2011


We don’t know.

Saturday, April 9, 2011


def sort_by(&blk)
sleep(100) # FIXME: I AM POISON
super(&blk)
end

We don’t know.

Saturday, April 9, 2011


def sort_by(&blk)
sleep(100) # FIXME: I AM POISON
super(&blk)
end

We don’t know.
def sort(&blk)
# TODO: make not explode
raise Exception.new("Haw haw!")
end

Saturday, April 9, 2011


We can’t know until
we measure it.

Saturday, April 9, 2011


This affects how we
make decisions.

Saturday, April 9, 2011


“Our application is slow.
This page takes 500ms.
Fix it.”

Saturday, April 9, 2011


Find the bottleneck!

Saturday, April 9, 2011


Find the bottleneck!
SQL Query

Saturday, April 9, 2011


Find the bottleneck!
SQL Query

Template Rendering

Saturday, April 9, 2011


Find the bottleneck!
SQL Query

Template Rendering

Session Storage

Saturday, April 9, 2011


We don’t know.

Saturday, April 9, 2011


Find The Bottleneck 2.0!
SQL Query

Template Rendering

Session Storage

Saturday, April 9, 2011


Find The Bottleneck 2.0!
SQL Query 53ms

Template Rendering

Session Storage

Saturday, April 9, 2011


Find The Bottleneck 2.0!
SQL Query 53ms

Template Rendering 1ms

Session Storage

Saturday, April 9, 2011


Find The Bottleneck 2.0!
SQL Query 53ms

Template Rendering 1ms

Session Storage 315ms

Saturday, April 9, 2011


Find The Bottleneck 2.0!
SQL Query 53ms

Template Rendering 1ms

Session Storage 315ms

Saturday, April 9, 2011


Confusion.

Saturday, April 9, 2011


Saturday, April 9, 2011
We made a better
decision.

Saturday, April 9, 2011


We improve our mental
model by measuring
what our code does.

Saturday, April 9, 2011


map ≠ territory

Saturday, April 9, 2011


map → territory

Saturday, April 9, 2011


We use our
mental model
to decide what to do.

Saturday, April 9, 2011


A better
mental model
makes us better at
deciding what to do.

Saturday, April 9, 2011


A better
mental model
makes us better at
generating
business value.
Saturday, April 9, 2011
Measuring makes your
decisions better.

Saturday, April 9, 2011


But only if we’re
measuring
the right thing.

Saturday, April 9, 2011


We need to measure
our code where it
matters.

Saturday, April 9, 2011


In the wild.

Saturday, April 9, 2011


Generating
business value.

Saturday, April 9, 2011


Saturday, April 9, 2011
PRODUCTION
Saturday, April 9, 2011
Continuously measuring
code in production.

Saturday, April 9, 2011


Metrics

Saturday, April 9, 2011


Metrics
Java/Scala

Saturday, April 9, 2011


Metrics
Java/Scala
github.com/codahale/metrics

Saturday, April 9, 2011


Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Each metric is
associated with a class
and has a name.

Saturday, April 9, 2011


An autocomplete service
for city names.

Saturday, April 9, 2011


An autocomplete service
for city names.
> GET /complete?q=San%20Fra

Saturday, April 9, 2011


An autocomplete service
for city names.
> GET /complete?q=San%20Fra

< HTTP/1.1 200 RAD


<
< ["San Francisco"]

Saturday, April 9, 2011


What does this code
do that affects its
business value?

Saturday, April 9, 2011


And how can we
measure that?

Saturday, April 9, 2011


Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Gauge
The instantaneous value of something.

Saturday, April 9, 2011


# of cities

Saturday, April 9, 2011


metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011


metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011


metrics.gauge("cities") { cities.size }

Saturday, April 9, 2011


“The service has 589
cities registered.”

Saturday, April 9, 2011


Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Counter
An incrementing and
decrementing value.

Saturday, April 9, 2011


# of open connections

Saturday, April 9, 2011


val counter = metrics.counter("connections")

counter.inc()

counter.dec()

Saturday, April 9, 2011


val counter = metrics.counter("connections")

counter.inc()

counter.dec()

Saturday, April 9, 2011


val counter = metrics.counter("connections")

counter.inc()

counter.dec()

Saturday, April 9, 2011


val counter = metrics.counter("connections")

counter.inc()

counter.dec()

Saturday, April 9, 2011


“There are 594 active
sessions on that server.”

Saturday, April 9, 2011


Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Meter
The average rate of events
over a period of time.

Saturday, April 9, 2011


# of requests/sec

Saturday, April 9, 2011


val meter = metrics.meter("requests",
SECONDS)

meter.mark()

Saturday, April 9, 2011


val meter = metrics.meter("requests",
SECONDS)

meter.mark()

Saturday, April 9, 2011


val meter = metrics.meter("requests",
SECONDS)

meter.mark()

Saturday, April 9, 2011


val meter = metrics.meter("requests",
SECONDS)

meter.mark()

Saturday, April 9, 2011


# of events
mean rate =
elapsed time

Saturday, April 9, 2011


# of
requests

time
Saturday, April 9, 2011
# of
requests

time
Saturday, April 9, 2011
# of
requests

time
Saturday, April 9, 2011
MIND THE GAP

Saturday, April 9, 2011


Recency.

Saturday, April 9, 2011


# of events
mean rate =
elapsed time

Saturday, April 9, 2011


# of events
mean rate =
elapsed time

Saturday, April 9, 2011


COGNITIVE HAZARD

Saturday, April 9, 2011


Exponentially weighted
moving average.

Saturday, April 9, 2011


-(1-α) mt-1
k + (1-(1-α) )Yt
k

Saturday, April 9, 2011


-(1-α) mt-1
k + (1-(1-α) )Yt
k

Saturday, April 9, 2011


# of
requests

time
Saturday, April 9, 2011
# of
requests

time
Saturday, April 9, 2011
# of
requests

time
Saturday, April 9, 2011
# of
requests

time
Saturday, April 9, 2011
1-minute rate

Saturday, April 9, 2011


1-minute rate
5-minute rate

Saturday, April 9, 2011


1-minute rate
5-minute rate
15-minute rate

Saturday, April 9, 2011


“We went from 3,000
requests/sec to
<500 a second.”

Saturday, April 9, 2011


Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Histogram
The statistical distribution of
values in a stream of data.

Saturday, April 9, 2011


# of cities returned

Saturday, April 9, 2011


val histogram =
metrics.histogram("response-sizes")

histogram.update(response.cities.size)

Saturday, April 9, 2011


val histogram =
metrics.histogram("response-sizes")

histogram.update(response.cities.size)

Saturday, April 9, 2011


val histogram =
metrics.histogram("response-sizes")

histogram.update(response.cities.size)

Saturday, April 9, 2011


minimum

Saturday, April 9, 2011


minimum
maximum

Saturday, April 9, 2011


minimum
maximum
mean

Saturday, April 9, 2011


minimum
maximum
mean
standard deviation

Saturday, April 9, 2011


Quantiles

Saturday, April 9, 2011


Quantiles
median

Saturday, April 9, 2011


Quantiles
median
75th percentile

Saturday, April 9, 2011


Quantiles
median
75th percentile
95th percentile

Saturday, April 9, 2011


Quantiles
median
75th percentile
95th percentile
98th percentile

Saturday, April 9, 2011


Quantiles
median
75th percentile
95th percentile
98th percentile
99th percentile

Saturday, April 9, 2011


Quantiles
median
75th percentile
95th percentile
98th percentile
99th percentile
99.9th percentile
Saturday, April 9, 2011
We can’t keep all of
these values.

Saturday, April 9, 2011


1,000 req/sec

Saturday, April 9, 2011


1,000 req/sec
×

Saturday, April 9, 2011


1,000 req/sec
×
1,000 actions/req

Saturday, April 9, 2011


1,000 req/sec
×
1,000 actions/req
×

Saturday, April 9, 2011


1,000 req/sec
×
1,000 actions/req
×
1 day

Saturday, April 9, 2011


1,000 req/sec
×
1,000 actions/req
×
1 day
=

Saturday, April 9, 2011


1,000 req/sec
×
1,000 actions/req
×
1 day
=
>86 billion values

Saturday, April 9, 2011


1,000 req/sec
×
1,000 actions/req
×
1 day
=
>86 billion values
>640GB of data/day

Saturday, April 9, 2011


1,000 req/sec
×
1,000 actions/req
×
1 day
=
>86 billion values
>640GB of data/day
Not gonna happen.
Saturday, April 9, 2011
COGNITIVE HAZARD

Saturday, April 9, 2011


Reservoir sampling.
Keep a statistically representative sample
of measurements as they happen.

Saturday, April 9, 2011


Vitter’s Algorithm R.

Vitter, J. (1985).
Random sampling with a reservoir.
ACM Transactions on Mathematical Software (TOMS), 11(1), 57.
Saturday, April 9, 2011
# of
cities

time
Saturday, April 9, 2011
# of
cities

time
Saturday, April 9, 2011
# of
cities

time
Saturday, April 9, 2011
# of
cities

time
Saturday, April 9, 2011
# of
cities

time
Saturday, April 9, 2011
MIND THE GAP

Saturday, April 9, 2011


Vitter’s Algorithm R
produces uniform
samples.

Saturday, April 9, 2011


Recency.

Saturday, April 9, 2011


SUPER-DUPER
COGNITIVE HAZARD

Saturday, April 9, 2011


Saturday, April 9, 2011
Forward-decaying
priority sampling.

Cormode, G., Shkapenyuk, V., Srivastava, D., & Xu, B. (2009).


Forward Decay: A Practical Time Decay Model for Streaming Systems.
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering.
Saturday, April 9, 2011
Maintain a statistically
representative sample
of the last 5 minutes.

Saturday, April 9, 2011


# of
cities

time
Saturday, April 9, 2011
# of
cities

time
Saturday, April 9, 2011
# of
cities

time
Saturday, April 9, 2011
# of
cities

time
Saturday, April 9, 2011
Uniform Biased
Saturday, April 9, 2011
“95% of autocomplete
results return 3 cities or
less.”

Saturday, April 9, 2011


Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Timer
A histogram of durations and
a meter of calls.

Saturday, April 9, 2011


# of ms to respond

Saturday, April 9, 2011


val timer = metrics.timer("requests",
MILLISECONDS,
SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011


val timer = metrics.timer("requests",
MILLISECONDS,
SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011


val timer = metrics.timer("requests",
MILLISECONDS,
SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011


val timer = metrics.timer("requests",
MILLISECONDS,
SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011


val timer = metrics.timer("requests",
MILLISECONDS,
SECONDS)

timer.time { handle(req, resp) }

Saturday, April 9, 2011


“At ~2,000 req/sec, our
99% latency jumps
from 13ms to 453ms.”

Saturday, April 9, 2011


Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Now what?

Saturday, April 9, 2011


Instrument it.

Saturday, April 9, 2011


Instrument it.
If it could affect your code’s
business value, add a metric.

Saturday, April 9, 2011


Instrument it.
If it could affect your code’s
business value, add a metric.
Our services have 40-50 metrics.

Saturday, April 9, 2011


Collect it.

Saturday, April 9, 2011


Collect it.
JSON via HTTP.

Saturday, April 9, 2011


Collect it.
JSON via HTTP.
Every minute.

Saturday, April 9, 2011


Monitor it.

Saturday, April 9, 2011


Monitor it.
Nagios/Zabbix/Whatever

Saturday, April 9, 2011


Monitor it.
Nagios/Zabbix/Whatever
If it affects business value,
someone should get woken up.

Saturday, April 9, 2011


Aggregate it.

Saturday, April 9, 2011


Aggregate it.
Ganglia/Graphite/Cacti/Whatever

Saturday, April 9, 2011


Aggregate it.
Ganglia/Graphite/Cacti/Whatever
Place current values in historical context.

Saturday, April 9, 2011


Aggregate it.
Ganglia/Graphite/Cacti/Whatever
Place current values in historical context.
See long-term patterns.

Saturday, April 9, 2011


Go faster.

Saturday, April 9, 2011


Shorten our
decision-making cycle.

Saturday, April 9, 2011


Observe

Saturday, April 9, 2011


Observe
Orient

Saturday, April 9, 2011


Observe
Orient
Decide

Saturday, April 9, 2011


Observe
Orient
Decide
Act

Saturday, April 9, 2011


Observe
Orient
Decide
Act

Saturday, April 9, 2011


Observe
What is the 99% latency of our
autocomplete service right now?

Saturday, April 9, 2011


Observe
What is the 99% latency of our
autocomplete service right now?

~500ms
Saturday, April 9, 2011
Orient
How does this compare to
other parts of our system,
both currently and historically?

Saturday, April 9, 2011


Orient
How does this compare to
other parts of our system,
both currently and historically?

way slower
Saturday, April 9, 2011
Decide
Should we make it faster?
Or should we add feature X?

Saturday, April 9, 2011


Decide
Should we make it faster?
Or should we add feature X?

make it faster
Saturday, April 9, 2011
Act!

Write some code.

Saturday, April 9, 2011


Act!

Write some code.


def sort_by(&blk)
#sleep(100) # WTF DUDE
super(&blk)
end

Saturday, April 9, 2011


10 Print "Rinse"
20 Print "Repeat"
30 Goto 10

Saturday, April 9, 2011


If we do this faster
we will win.

Saturday, April 9, 2011


Fewer bugs.

Saturday, April 9, 2011


More features.

Saturday, April 9, 2011


Happier
users.
Saturday, April 9, 2011
Money.
Saturday, April 9, 2011
tl;dr

Saturday, April 9, 2011


We might write code.

Saturday, April 9, 2011


We have to generate
business value.

Saturday, April 9, 2011


In order to know how well
our code is generating
business value, we need
metrics.

Saturday, April 9, 2011


Gauges
Counters
Meters
Histograms
Timers
Saturday, April 9, 2011
Monitor them for
current problems.

Saturday, April 9, 2011


Aggregate them for
historical perspective.

Saturday, April 9, 2011


map ≠ territory

Saturday, April 9, 2011


map → territory

Saturday, April 9, 2011


Improve our mental
model of our code.

Saturday, April 9, 2011


MIND THE GAP

Saturday, April 9, 2011


Observe
Orient
Decide
Act

Saturday, April 9, 2011


If you’re on the JVM,
use Metrics.

Saturday, April 9, 2011


If you’re on the JVM,
use Metrics.
github.com/codahale/metrics

Saturday, April 9, 2011


If not,
you can build this.

Saturday, April 9, 2011


Please build this.

Saturday, April 9, 2011


Make better decisions
by using numbers.

Saturday, April 9, 2011


Thank you.

Saturday, April 9, 2011


Saturday, April 9, 2011

Das könnte Ihnen auch gefallen