Sie sind auf Seite 1von 47

Profile Guided Optimizations

And other optimizations details

Shachar Shemesh
Lingnu Open Source Consulting Ltd.

http://lingnu.com
Credits and License
This lecture is free to use under the Creative
Commons Attribution Share Alike license (cc-by-
sa)
Please give credit to Shachar Shemesh and a link to
http://www.lingnu.com
All syntax highlighting curtsy of enscript.
http://www.iki.fi/~mtr/genscript/

http://lingnu.com
An Apology to People at Home
This lecture makes extensive use of "objdump" to
view the compilation results' assembly code.
There is no sane way to capture that short of
taking videos.
If you are reading the slides not during the lecture
– my apology.

http://lingnu.com
Optimization
Optimization – minimizing or maximizing a
certain program attribute (wikipedia)
Run time, memory usage, power consumption etc.
A crucial part in allowing production of readable
code.

http://lingnu.com
Platform Independent Optimizations

Optimizations that are independent of the platform


the program is compiled for

http://lingnu.com
example1.1.cpp

One Program – Unoptimized


#include <stdio.h>

int main( int argc, char *argv[] )


{
int i;
double f;

for( i=1; i<=10; ++i ) {


printf("%d\n", i);
}

for( f=0; f<=1; f+=0.1 ) {


printf("%.1f\n", f );
}
}

http://lingnu.com
example1.2.cpp

"Optimize for Memory Use"


#include <stdio.h>

int main( int argc, char *argv[] )


{
{
int i;
for( i=1; i<=10; ++i ) {
printf("%d\n", i);
}
}

{
double f;
for( f=0; f<=1; f+=0.1 ) {
printf("%.1f\n", f );
}

}
}
http://lingnu.com
example1.3.cpp

"Optimize for Speed"


#include <stdio.h>

int main( int argc, char *argv[] )


{
printf("1\n");
printf("2\n");
printf("3\n");
printf("4\n");
printf("5\n");
printf("6\n");
printf("7\n");
printf("8\n");
printf("9\n");
printf("10\n");
printf("0.0\n");
printf("0.1\n");
printf("0.2\n");
printf("0.3\n");
printf("0.4\n");
printf("0.5\n"); http://lingnu.com
example1.4.cpp

"Optimize for Speed" Even More

#include <stdio.h>

int main( int argc, char *argv[] )


{
printf("1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n"
"0.0\n0.1\n0.2\n0.3\n0.4\n0.5\n0.6\n0.7\n”
"0.8\n0.9\n1.0\n");

return 0;
}

http://lingnu.com
Purpose of Optimizer
The first two optimizations are automatically
done by the compiler, given the right compilation
flags.
The third is out of scope of almost any optimizer I
know.
Sort of.
A good optimizer allows producing reasonably
efficient code without changing coding style or
paradigm.

http://lingnu.com
<something> is not a Panacea
Will not fix inefficient algorithms
Will not fix bugs
In fact, may cause bugs!
May aggravate bugs that were, otherwise, minor

http://lingnu.com
Optimization's Affect on Debugging
Debugging an optimized binary with a debugger
can be very difficult:
Program flow is non-linear.
Inline functions "get in the way" all the time.
Worse for C++
Variables, local, static and global, may not be where
the debugger can even find them.
They can move around as the code progress.
And without a debugger:
Adding "debug printf" actually changes the
optimizer's output. http://lingnu.com
Optimizer's Limitations
Essentially a compile time/run time efficiency
trade off.
Lacking "total awareness", some supposedly
obvious optimizations are outside the compiler's
reach.

http://lingnu.com
example2.cpp

What Does This Program Do?


#include <stdio.h>
#include "custom_type.h"

int main(int argc, char *argv[] )


{
int a;
custom_type b;

a=5; If you don't know, how


a+=3; should the optimizer?
a+=2;

b=5;
b+=3;
b+=2;

printf("%d %d\n", a, (int)b );

return 0;
}
http://lingnu.com
custom_type.h

What it Really Does


#ifndef CUSTOM_TYPE_H
#define CUSTOM_TYPE_H

class custom_type {
int var;
public:
custom_type() {}
custom_type( int val ) : var(val) {}
operator int() const { return var; }

custom_type &operator+=( const custom_type &rhs )


{ var+=rhs.var+1; return *this; }
...
}; I cheated
#endif // CUSTOM_TYPE_H

http://lingnu.com
The Importance of Inline Functions
You couldn't have known about the cheat.
Neither could the optimizer.
Yet it did!
The "operator +=" method was inlined into
"main"
Once that happens, the optimizer has context for
the operation.
It can aggregate the entire set of operations, and
replace it with the final result.
Inline can also happen in C. http://lingnu.com
example3.1.c

The Great Divide

#include <stdio.h>

int process( int num )


{
return 4500/num;
}

int main( int argc, char *argv[] )


{
printf("%d\n", process(2) );

return 0;
}

http://lingnu.com
Dividing by Powers of Two
Most CPUs have assembly instruction for
dividing two integers (as well as two floating
points).
Dividing by a power of two can be done more
efficiently with a shift operation.
The compiler obviously needs to know it's a power of
two.
Why did it keep "process" around if it inlined it?

http://lingnu.com
example3.2.c

A Lesser Divide

#include <stdio.h>

static int process( int num )


{
return 4500/num;
}

int main( int argc, char *argv[] )


{
printf("%d\n", process(2) );

return 0;
}

http://lingnu.com
Static and Inline
A static function can only be used within the
same source file.
If the compiler sees that all uses have been inlined, it
will not bother emitting the original function.
If the program only has one file, you can pass
-fwhole-program to make it assume all functions
are static.
If the function is not defined in the same file, it
cannot be inlined at all.

http://lingnu.com
Platform Dependent Optimizations

Optimizations that take the CPU's internal structure


into account

http://lingnu.com
Revolution With a RISC
RISC – Reduced instruction set code.
Core idea – benchmark programs spend 90% of
their time executing the same 3 assembly
commands, 95% executing 5.
Leave only those 5. Make them very quick.

http://lingnu.com
Smoking Commands in a Pipe
Use a pipeline to execute the commands:
Split the entire command processing into distinct
parts.
Execute each part in a separate clock cycle
You can now reduce the time each clock cycle takes –
higher clock rate.
Start executing the next command as soon as the first
one is done with the first part of the.
Work on as many commands as there are pipe segments at
once.
Average throughput is 1 command per clock cycle!
http://lingnu.com
DLX – Deductive RISC Processor

IF ID EX MEM WB

IF – Instruction Fetch
ID – Instruction Decode
EX – ALU operations
MEM – Memory fetch
WB – Store to memory or registers

http://lingnu.com
A Few General Notes
An instruction stopped before it reaches WB has
no effect.
The design dictates the assembly.
Are the following commands possible?
store r2, (r3+r4)
Yes: ALU step before memory access step
load (r3+r4), r2
No, for precisely the same reason.

http://lingnu.com
Bubbles (Soft)
It can happen that a later command's operands
come from an earlier command's pipe step that
has not been performed yet.
add r2, r3
add r4, r3
In the above case, we can "short path" the data and
have it ready in time.
Most CPUs actually do that.

http://lingnu.com
Bubbles (Hard)
How about this sequence?
load (r3), r4
add r2, r4
The memory read of the first line happens in the same
cycle as the ALU for the second command.
The data is, physically, not present inside the CPU when
we need it.
Solution:
Delay the second command for one cycle until the
data is ready.
This is called a pipeline bubble.
http://lingnu.com
Optimizer as Bubble Popper
We expect the optimizer to minimize the bubbles
in the pipe.
Put an unrelated instruction between the two and
prevent a wasted cycle.
This requires that the optimizer know the precise
details of the CPU's pipeline.
RISC, in general, assumes a compiler. Efficient
manual assembly programming of RISC is
between very tough and impossible.

http://lingnu.com
The Branch Problem
Consider the following sequence:
compare r2, r3
beq location
The branch requires an ALU operation (though
DLX pretends that it doesn't).
We only know where to branch to at the end of
the third cycle.
We need to fetch the next instruction at the
beginning of the second cycle.
Two cycles of bubbles for each branch!
http://lingnu.com
Branching (cont.)
How serious is the problem?
Statistics claim that a branch happens every 4
assembly instructions, on average.
Turning every 4 instructions into 6 is a 50% slow
down!

http://lingnu.com
Branch Solutions:
Unconditional Execution
A solution employed by many RISC platforms:
Execute the instruction right after the branch –
always.
Fills an unconditional bubble with meaning.
Almost always:
Do not perform this fill if we are going to have a
bubble anyways.

http://lingnu.com
Branch Solutions:
Branch Prediction
Apriori, a branch pointing backwards has a 90%
chance of being taken (probably a loop)
Branches pointing forward have only a 50% chance.
The CPU can keep a list of branches, and where
they, likely, will go.
This list is called "branch prediction".
Some platforms have means of "helping" with
this guess.
If you know what will likely happen, you can code it
into the assembly.
http://lingnu.com
Branch Prediction and the Optimizer
How can the optimizer know what is likely to
happen?
Option 1 – guess.
Not really a wild guess. Uses static program flow analysis.
Option 2 – benchmark.
Run the program. For each branch, keep track of how many
times it was taken, and how many times it was not.
Compile the program again, using this information as a
optimization helper.

http://lingnu.com
Profile Guided Optimization
Optimization that eliminates guesses by using
real life data.
If done properly, can significantly speed up a
program.
If not done properly is useless at best.

http://lingnu.com
Cache Locality
CPU keeps code and data recently processed
inside an internal cache.
Works best if data is in proximity to other data.
PGO allows the optimizer to pick frequently used
and rarely used areas of code, and keep them
together.
Maximizing cache efficiency.
May even split single functions into different
ELF segments.
http://lingnu.com
Using PGO with GCC
To turn on all PGO collection compile with
-fprofile-generate.
Make sure to pass it during compilation AND linkage.
Run the program through a typical use scenario.
Do NOT run it through all program features. This will
actually hurt optimization.
Compile again, this time with -fprofile-use.
Again – during linkage as well.
Profit!
http://lingnu.com
Caveats
Build environment – of three projects I tried, only
one had a build environment where PGO could
be just plugged in.
Profile location – hard coded to source location.
Prevents use of ccache and other compiler wrappers.
Fixed in gcc 4.4.0 – may override path.
Test cases
Not always easy to find.
Sometimes interactive.
http://lingnu.com
PGO and Cross Compilation
The profile files are created in the same directory
as the source files.
If cross compiling, need to make sure these
directories exist.
Need to transfer result files back to build machine
for rebuild.

http://lingnu.com
PGO and the Kernel
The kernel is not compiled with PGO
Seems to be possible, but would require non-
trivial work.
Mostly in making sure the profile files are created
correctly.
Report from 2004 about someone running PGO
from Intel compiler and gaining 40%
performance.
Idea rejected because it is impossible to reproduce a
PGO kernel binary twice – debugging is hard.
http://lingnu.com
PGO Domain in GCC
Branch prediction statistics
Variable values statistics
Function use for increased cache locality

http://lingnu.com
The Intel Assembly Family
RISC in CISC Clothing
The Intel assembly today is still capable of
running programs written for the 8080 CPU:
First released April 1974.
8 bit CPU.
Accumulator based machine language
CISC
Runs Wordstar on CP/M

http://lingnu.com
Intel Assembly
Still contains many CISC constructs.
CPU has several internal pipelines internally (2
for the original Pentium).
When transferring commands into the cache, they
are translated into RISC.

http://lingnu.com
RISC and CPU Compiler Familiarity
RISC assumes intimate familiarity between the
compiler and the CPU.
Familiarity to the level that a minor CPU revision
may invalidate.
Sometimes this is feasible (embedded).
In modern PCs, not so much.
The CPU has its own optimizer in hardware, that
re-does some of the things the optimizer does.
That's why "memory barriers" exist.

http://lingnu.com
Optimizer Induced Limitations
Some optimization options assume attributes of
your program:
-fstrict-aliasing – compiler assumes that a pointer to a
type does not have a different type.
-fstrict-overflow – compiler assumes that no integer
overflow can happen.
If your program does not live up to those
assumptions, compiling with -O2 or -Os may
break your code.
Will try to issue a warning, but no promises....
http://lingnu.com
Subjects Not Covered
Tail recursion optimization
Copy constructor optimization

http://lingnu.com
Bibliography
GCC online manual:
http://gcc.gnu.org/onlinedocs/
Make sure you explicitly pick the version you are
using!

http://lingnu.com
Thank You

Visit us at http://www.lingnu.com

http://lingnu.com

Das könnte Ihnen auch gefallen