Sie sind auf Seite 1von 29

MATH49111/69111: Scientific Computing

Lecture 10
26th October 2017

Dr Chris Johnson
chris.johnson@manchester.ac.uk
Optimisation

. Arguments for and against optimisation

. Algorithms and complexity

. Measuring performance

. Worked example: Mandelbrot set

. Parallel code

. Memory and cache optimisation

Nothing in this lecture is essential knowledge for the projects.


Writing fast code

. Scientific programs can run for a long time (A 100 CPU-years)

. Therefore, worth investing time optimising them for speed


. Every level of a programs design affects its speed:
. Program architecture
. Algorithm choice
. Algorithm implementation
. Compilation / assembly language code
. Hardware

. Decisions to be taken both before we start writing code, and


during refactoring of existing code

. Optimisation does not respect abstraction: the fastest code


might not have the best structure.
Algorithms: time complexity

. The time an algorithm takes to execute is called its


time complexity

. This is approximated its number of operations, a count of the


number of basic scalar operations (, , , )

. We say that the time complexity of an algorithm Tˆn is

Oˆf ˆn (we say ”order f of n”)

where n is the size of the input to the algorithm, if

Tˆn
lim sup @ª
n ª fˆn

i.e. if Tˆn is no greater than a constant multiple of f ˆn.


Algorithms: time complexity

Example: for dense n  n matrices and n-vectors


. Vector dot product Oˆn operations
. Matrix-vector product Oˆn2  operations
. Matrix-matrix product Oˆn3  operations

More slowly growing complexity is (almost) always better:


. Strassen algorithm calculates matrix-matrix product in
Oˆnlog2 7 2.81...  operations
. In practice, slower than naïve algorithm unless n à 1000

Recursive algorithms (e.g. Fast Fourier Transform, heap sort, fast


multipole method) can often reduce Oˆn2  problems to Oˆn log n
Writing fast code: optimisation

. In most programs, the execution time is dominated by a small


number of lines of code

. For most lines of code the trade-off of clearest structure vs.


highest speed is not in favour of optimising for speed.
. Optimisation impacts adversely on code:
. portability
. flexibility
. maintainability

. Important to measure which lines of code are taking the time,


by profiling.
Mandelbrot set: an example of optimisation

The Mandelbrot set M is the set of complex numbers c > C such that
the iteration

z0 0,
zk1 z2k c

remains bounded as k ª.

How can we convert this ‘infinite’ problem into a finite one?

There are two ‘infinities’ in the definition of M:


. Szk S becomes infinite when zk diverges
. We are interested in the divergence for infinite k
Mandelbrot set: Divergence of zk
If Szk S A 2 and Szk S C ScS, then

S zk1 S z2k  cT C Szk S2  ScS C Szk S2  Szk S A 2Szk S  Szk S


T zk S ,
S

so c is not in the Mandelbrot set (c ¶ M) if

c A 2,
S S and/or Szk S A 2 for some k.

We approximate

zk remains bounded as k ª

by
zk S @ 2
S for k B K

for some (large) constant K, and check that our approximation to M


converges as K increases.
Mandelbrot set: unoptimised C++ code
#include <complex> // mandelbrot_1.cpp
#include <iostream>

int main()
{
for (int i=-500; i<=250; i++) // loop over real part of c
{
for (int j=-375; j<=375; j++) // loop over imag part of c
{
std::complex<double> z(0,0), c(i/250.0, j/250.0);
int k = 0; // loop up to 2500 times, or until |z|>2
for (; k < 2500 && std::abs(z)<2.0; k++)
z = z*z + c; // apply iteration
std::cout << k << " "; // output # of iterations
}
std::cout << std::endl;
}
return 0;
}
Mandelbrot set: Program output
Mandelbrot set: Run times (unoptimised)
Mandelbrot set run time (using single core of Core i7-4770K),

Language Time (s) Speedup


MATLAB (R2015a) 213.4 16.4  slower
Python (2.7.10) 116.4 8.94  slower
C++ (gcc 4.8.5) 13.02 (reference)

. MATLAB and Python are interpreted languages


. C++ is compiled to machine code

. Further optimisation possible with all languages


. We will profile and optimise the C++ code, by
1. Improving the algorithm, so that less work is required
2. Improving the code/compilation, so that the required work is
done more quickly
Timing code

. On Linux, time shows total program execution time:


$ time ./my_program

real 0m13.015s
user 0m12.998s
sys 0m0.004s

. Timer functions in C++ allow timing of program sections


. Platform-dependent timers
. std::chrono::high_resolution_clock::now() in
<chrono> header (C++11 only)
. See examples on course website

. Using a profiler gives a breakdown of the time taken by each


section of code
Profiling output
We can measure which parts of the code are using the most time
with a profiler (e.g. gprof (Linux), VS Diagnostic Tools (Windows))
total
time seconds calls name
16.87 1.75 239353549 abs<D>(complex<D> const&)
14.46 1.50 main
13.34 1.39 238883944 complex<D>::operator*=<D>(complex<D> const&)
11.99 1.25 239353549 __complex_abs(Dcomplex )
10.15 1.05 238883944 operator+<D>(complex<D> const&, complex<D> const&)
8.90 0.92 238883944 complex<D>::operator+=<D>(complex<D> const&)
7.74 0.80 238883944 operator*<D>(complex<D> const&, complex<D> const&)
5.61 0.58 477767888 complex<D>::imag() const
5.08 0.53 239353549 complex<D>::__rep() const
4.74 0.49 477767888 complex<D>::real() const
1.06 0.11 1128002 complex<D>::complex(D, D)

. Algorithm evaluated for  5.6  105 values of c


. Innermost loop repeated  2.3  108 times
. Most expensive single function is abs (used in abs(z)<2.0)
. Cost is due to evaluation of square root
Square-root free algorithm C++ code
. Replace abs(z)<2.0 test with the equivalent
z.real()*z.real()+z.imag()*z.imag()<4.0

for (int i=-500; i<=250; i++)


{
for (int j=-375; j<=375; j++)
{
std::complex<double> z(0,0), c(i/250.0, j/250.0);
int k = 0;
for (; k < 2500 &&
z.real()*z.real()+z.imag()*z.imag()<4.0; k++)
z = z*z + c;
std::cout << k << " ";
}
std::cout << std::endl;
}

. Program takes 8.472 seconds (1.54 faster)


Compiler flags
. When calling gcc we can set optimisation flags:
-O2 Set the Optimisation level to 2 (produces faster
code, but takes longer to compile)
-ffast-math Makes floating point maths faster, but doesn’t
guarantee floating point associativity rules,
behaviour of infinity, NaN, etc.
-march=native Optimises code for the CPU type, cache
size etc. of the current machine

. Compiled with
g++ -O2 -ffast-math -march=native mandelbrot_2.cpp

program takes 0.776s (10.9  speedup)


. Dramatic speedup mainly due to -O2 inlining functions.
. In Visual Studio, use ‘Release’ target to enable optimisations
Improving the algorithm: Periodicity checking

What happens to zk at values inside the Mandelbrot set?

k zk
0 0.000000  0.000000i
1 0.900000  0.100000i
2 0.100000  0.080000i
... ...
36 0.093627  0.123035i
37 0.906372  0.123039i
38 0.093629  0.123038i
39 0.906372  0.123040i
... ...

zk approaches an oscillation with period C 1


Improving the algorithm: Periodicity checking
. If zk zi for i @ k, iteration will never diverge.
. Abort the iteration if zk repeats an earlier value (to FP precision)
. Most efficient to check only a few i (here i is a power of 2)

std::complex<double> z(0,0), c(i/250.0, j/250.0), p(0,0);


int k = 0, pIndex = 2;
for (; k < maxIters && abs(z)<2.0; k++)
{
z = z*z + c;
if (z == p) // if we are in an orbit, quit
{k = maxIters; break;}
if (k == pIndex) // update p every 2^n iterations
{p = z; pIndex *= 2;}
}

. Reduces average iterations per c value from 424 to 92


. Program takes 0.188 seconds (4.12 speedup)
Mandelbrot set: optimisations so far
Optimisations Time (s) Speedup
No optimisations 13.02 (reference)
 square-root-free SzS2 test 8.472 1.54 
 -O2 -ffast-math -march=native 0.776 16.8 
 periodicity checking 0.188 69.3 

What more can we do to optimise?


. Vector SIMD instructions (beyond the scope of this course)
. Memory/cache optimisation
. Data layout in memory can affect speed by factor of A 10
. Not important for Mandelbrot calculation; see example later
. Parallelisation
. Extremely important, especially for large programs
Parallel programs
. Modern processors have several independent cores, each of
which runs one or more threads
. Each thread executes instructions independently
. All our code so far has executed on just one thread

1955–2003: Processors double in clock speed every three years


. A ‘free lunch’: all existing code gets faster
2003–: Speed increases come through more cores
. From  4 cores (PCs) to  106 cores
(supercomputers)
. Only efficiently parallelised code gets faster

. Threads are relatively slow at communicating with one another


. Challenge is to split algorithms into tasks for each thread
Parallelisation of Mandelbrot algorithm
Our Mandelbrot set program calculates the limit of an iteration
z0 , z1 , . . . for many values of c:
. For each iteration we need zk in order to calculate zk1
 cannot parallelise across each iteration of zk
. The iterations at different values of c are independent
 easy to parallelise across different values of c

Thread 1 Thread 2 Thread 3 Thread 4


.
c c1 c c2 c c3 c c4
z0 0 z0 0 z0 0 z0 0
time

z1 z20 c z1 z20 c z1 z20 c z1 z20  c


z2 z21 c z2 z21 c z2 z21 c z2 z21  c
... ... ... ...
Parallelisation example: C++11 threads
#include <thread>
#include <vector>

void DoWork(int threadID)


{ /* Parallel work done in this function body */ }

int main()
{
int nThreads = 8;
std::vector<std::thread> threads(nThreads);

// Start parallel work on each of 8 threads


for (int i=0; i<nThread; i++)
thread[i] = std::thread(DoWork, i);

// Wait here until all thread functions have returned


for (int i=0; i<nThread; i++)
thread[i].join();
}
Parallelisation example: Mandelbrot set
Use C++11 threads to parallelise the Mandelbrot set program
. Each thread calculates a subset of the values of c
(every Nth column, where we have N threads)

Thread 1
Thread 2
Thread 3
Thread 4
.

. Display of iteration counts must be single threaded.


We now store iteration counts in memory, then display at the
end of the program.
. Algorithm is otherwise the same as the single-threaded case.
Parallelisation example: Mandelbrot set
#include <complex>
#include <iostream>
#include <thread>
#include <vector>

int iSize=751, iStart=-500, jSize=751, jStart=-375; // loop lengths and start points
std::vector<int> iters(iSize*jSize); // storage for iteration counts

void CalcMandelbrot(int skip, int ofs) // skip = nThreads, ofs = threadID


{
for (int i=ofs; i<iSize; i+=skip)
for (int j=0; j<jSize; j++)
{
int maxIters = 2500;
std::complex<double> z(0,0), c((i+iStart)/250.0, (j+jStart)/250.0), p(0,0);
int k = 0, pIndex = 2;
for (; k < maxIters && z.real()*z.real()+z.imag()*z.imag()<4.0; k++)
{
z = z*z + c;
if (z == p) // if we are in an orbit, quit
{ k=maxIters; break; }
if (k == pIndex) // update p every 2^n iterations
{p = z; pIndex *= 2;}
}
iters[i*jSize + j] = k;
}
}
Parallelisation example: Mandelbrot set
int main()
{
int nThreads=2;
std::vector<std::thread> threads(nThreads);

// Start parallel work


for (int i=0; i<nThreads; i++)
threads[i] = std::thread(CalcMandelbrot, nThreads, i);

// Finish parallel work


for (int i=0; i<nThreads; i++)
threads[i].join();

// Display output in single-threaded code


for (int i=0; i<iSize; i++)
{
for (int j=0; j<jSize; j++)
std::cout <<iters[i*jSize + j] << " ";

std::cout << "\n";


}
return 0;
}

Optimised code is usually more complex!


Parallelisation example: Mandelbrot set
Threads Time (s) Speedup
1 0.182 (reference)
2 0.095 1.91 
3 0.067 2.71 
4 0.055 3.30 

. Speedup is a little less than the number of threads.


. Discrepancy due to remaining serial code (Amdahl’s law)

Optimisations Time (s) Speedup


No optimisations 13.02 (reference)
 square-root-free SzS2 test 8.472 1.54 
 -O2 -ffast-math -march=native 0.776 16.8 
 periodicity checking 0.188 69.3 
 parallelisation (4 cores) 0.055 236.7 
Memory optimisation
Computers have a hierarchy of memory, with latency/size tradeoff

Memory Size Latency (ns)


Registers  50 bytes 0
L1 Cache  32KB 1.5
L2 Cache  256KB 4
L3 Cache  8MB 25
RAM  4GB 100
Solid-state drive  256GB 16000
Hard disk drive  2TB 4000000

. These latencies improving only very slowly over time


. Need to arrange most frequently-used data in fastest memory
. Layout of data in memory affects the speed it can be accessed
Memory optimisation: caches
. The cache stores frequently-accessed data in memory
. Accessing a memory address puts the cache line (64 bytes of
adjacent memory) containing that address in the cache
. Once this is done, much faster to access data from this line
. ‘Nearby’ memory accesses much faster than random accesses

4
Time per access (ns)

2
L3 Cache
1 L1 Cache

0
256 1K 4K 16K 64K 256K 1M 4M 16M 64M 256M
Buffer size (bytes)
Memory optimisation example: matrix-vector product

b Ax bi Qa x
j
ij j

for (int i=0; i<A.rows(); i++


for (int j=0; j<A.cols(); j++)
b[i] += A(i, j)*x[j];

Faster1 to store A aij in memory in row-major order

a a a a . . . a21 a22 a23 a24 . . . a31 . . .


. 11 12 13 14

than column-major order

a a a a . . . a12 a22 a32 a42 . . . a13 . . .


. 11 21 31 41

N.B: The opposite is true for b xA. Data format must suit algorithm
1
I found 1.42ms (column-major) vs. 3.22ms (row-major) for 10242 matrix
Summary
. Optimisation has costs: only optimise where necessary

. Measure/profile to work out where optimisation is required

. Finding a better algorithm usually leads to greater speedup


than ‘micro-optimising’ individual lines of code

. Turn on compiler optimisations


. Three aims when optimising code:
. Use the minimum number of operations
. Use multiple cores effectively
. Use memory efficiently

. Further reading:
. Code Complete (S. McConnell) chapters 25 and 26
. What every programmer should know about memory
(U. Drepper) http://www.akkadia.org/drepper/cpumemory.pdf

Das könnte Ihnen auch gefallen