Compiler Support For Multicore

Compiler Support for Multi-Core
Stephen Blair-Chappell
12/13/2007 2
Copyright 2007, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners
Compiler Support For Multi-Core
Agenda
Optimisation
Security
Quality
Compatibility
12/13/2007 3
Optimisations
Global Compiler Options
Inter-procedural Optimisations
Profile Guided Optimisations
Vectorisation
Parallelisation
12/13/2007 4
Common Optimization Switches
-openmp /Qopenmp
OpenMP 2.5 support
-fast /fast
Optimize for speed, including IPO
-parallel /Qparallel
Automatic parallelization
-ipo /Qipo
Inter-procedural optimization
-prof-gen
-prof-use
/Qprof-gen
/Qprof-use
Profile guided optimization (muli-step build)
Linux & Mac OS* WINDOWS
/Zi
/O3
/O2
/O1
/Og
-g
Create symbols for debugging
-O3
High-level optimizer, including prefetch, unroll
-O2
Optimize for speed (default)
-O1
Optimize for speed (no code size increase)
-O0
Disable optimization
Itanium and the Intel logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States or other countries
12/13/2007 5
Optimisations
Vectorisation
Parallelisation
12/13/2007 6
Interprocedural Optimizations
Enables inlining, better register usage, dead
code elimination, etc.
usage:
icpc -ip: single file IPO
icpc -ipo: multi-file IPO
Link time code generation - increases build
time
IPO: Two Step Process
Usability Tips:
Try IPO on performance critical files/libs
Dont run ipo on 10,000s object files,
avoid unnecessary increased build time
Remember to link with -ipo option
Pass 1
Pass 2
ipo objects
executable
Compiling:
icpc -c -ipo a.cxx b.cxx
Linking:
icpc -ipo a.o b.o
12/13/2007 7
Interprocedural Optimization
Extends optimizations across file boundaries
Compile & Optimize Compile & Optimize
file1.c
file2.c
file3.c
file4.c
Without IPO
Without IPO
file1.c
file4.c file2.c
file3.c
With IPO
With IPO
Modules of multiple files/whole application -ipo
Only between modules of one source file -ip
12/13/2007 8
Optimisations
Vectorisation
Parallelisation
12/13/2007 9
Profile-Guided
Optimizations
Optimizing with runtime feedback
Enhances all optimizations, especially
IPO, register allocation, instruction
cache usage, switch statement
optimization, etc
Code-Coverage and Test-Prioritization
Tools uses PGO technology
Usability Tips:
- Run on typical input dataset(s)
- Each run generates a data file.
- Compiler calculates averages of all runs
12/13/2007 10
Profile-Guided Optimizations (PGO)
Use execution-time feedback to guide (final) optimization
Helps I-cache, paging, branch-prediction
Enabled optimizations:
Basic block ordering
Better register allocation
Better decision on which functions to inline
Function ordering
Switch-statement optimization
12/13/2007 11
Optimisations
Vectorisation
Parallelisation
12/13/2007 12
Automatic Compiler Vectorization
Processor Specific Optimizations
Automatically generate vector SSE/SSE2/SSE3/SSSE3/SSE4
Vector processing: Operate at once on:
4 floating point values
2 double precision floating point values
4 integer values
Etc
Optimal code generation and instruction scheduling
Large number of options for advanced control of vectorization
Specify trip count, ignore dependencies (ivdep), specify alignme Specify trip count, ignore dependencies (ivdep), specify alignment, nt,
disable vectorization, etc. disable vectorization, etc.
12/13/2007 13
Auto-Vectorization (IA-32 and Intel
64):
Optimizing Loops with SSE/SSE2/SSE3/SSSE3/SSE4
Your Task: convert this
$ cat w.c
void work( float* a, float *b, float *c, int MAX) {
for (int I=0;I<=MAX;I++)
c[I]=a[I]+b[I]; }
128-bit Registers
128-bit Registers
A[0]
B[0]
C[0]
+
+
+
+
+
+
+
+
A[1]
B[1]
C[1]
not used not used not used
12/13/2007 14
void work( float* a, float *b, float *c, int MAX) {

for (int I=0;I<=MAX;I++)
c[I]=a[I]+b[I]; }
$ icc w.c -c -xT
w.c(2) : (col. 3) remark: LOOP WAS VECTORIZED.
Auto-Vectorization (IA-32 and Intel 64)
128-bit Registers
128-bit Registers
A[3] A[2]
B[3] B[2]
C[3] C[2]
+
+
+
+
A[1] A[0]
B[1] B[0]
C[1] C[0]
+
+
+
+
12/13/2007 15
Vectorization Report
Existence of vector
dependence
Non-unit stride used
Mixed Data Types
Condition too Complex
Condition may protect
exception
Low trip count
Subscript too complex
Unsupported Loop Structure
Contains unvectorizable
statement at line XX
Not Inner Loop
"vectorization possible but
seems inefficient"
Operator unsuited for
vectorization
or other countries.
Loop was not vectorized because:
12/13/2007 16
Compiler Based Vectorization
Automatic Processor Dispatch ax[?]
Single executable
Optimized for Intel Core Duo processors and generic code that runs on all
IA32 processors.
For each target processor it uses:
Processor-specific instructions
Vectorization
Low overhead
Some increase in code size
12/13/2007 17
Processor Specific Options
QxO xO Generated SSE3 where possible on Intel and any Intel compatible
system, such as AMD* Opteron*, not using CPU-dispatch.
Applications will crash with illegal instruction on systems that dont
support SSE3/SSE2/SSE2. Will not utilize SSE4/SSSE3, and may
not be as optimal as axT or xT
QxS xS Generate SSE4 on future Intel processors code name Penryn
QxT xT Generate SSSE3 on supported Intel processors with Intel Core
Micro architecture
QaxT axT CPU Dispatch: Generate SSSE3 for supported Intel processors,
and generic Intel 64 processor, such as AMD* Opteron* via CPU
dispatch. Can use axS to generate SSE4 instructions.
Windows* Linux* and Mac OS* Processor Target
QxP xP Generate SSE3 on supported Intel processors
12/13/2007 18
Optimisations
Vectorisation
Parallelisation
Auto Parallelisation
OpenMP
12/13/2007 19
Auto-parallelization
Auto-parallelization: Automatic threading of loops without having to manually insert
OpenMP* directives.
Compiler can identify easy candidates for parallelization, but large
applications are difficult to analyze.
-par_report[n]
-parallel
Mac*
-par_report[n]
-parallel
Linux*
/Qpar_report[n]
/Qparallel
Windows*
12/13/2007 20
OpenMP* Support
Support OpenMP 2.5 standard for Fortran and C++
Higher level of abstraction to simplify generating multi-threaded
applications, compiler handles
Usage Model:
/Qopenmp_report[n] -openmp_report[n]
/Qopenmp -openmp
Windows* Linux*
12/13/2007 21
Cluster OpenMP
Since Intel Compilers 9.1: Cluster OpenMP*
Extends OpenMP* from Shared Memory Processors (SMP) to Distributed
Memory systems ( Clusters)
Not a single system image
Minor language extensions only one new directive (SHARABLE)
12/13/2007 22
Optimization Strategy
Turn on the reporting feature of the compiler
Use Representative workload
Use VTune Analyzer to find Hot Spots
Focus effort on Hot Spots
Try advanced compiler optimizations on Hot spots
Re-Run workload
Seeing expected benefits ? If not, look at optimization reports
12/13/2007 23
Which Option First?
First try compiler vectorization options
Try O3 for loop bound hot functions
Try Interprocedural (IPO) & Profile Guided Optimization (PGO)
Recommended use IPO on hot functions / libraries
Can use PGO on hot functions / libraries or entire application
12/13/2007 24
Compiler Optimization Reports
Tells what optimizations were done and most importantly hints on what prevented a
given optimization
Turn on Optimization Reports -opt-report
Can be read by VTunePerformance Analyzer
Default report verbose, recommend selecting optimization
Enable Vectorizer reports: -vec-report3
Enable Loop Optimizer (-O3): -opt-report-phase hlo
Vectorization Example: Aliasing problem prevented vectorization:
icc hpo.c -c -O3 -xT vec-report3
loop was not vectorized: existence of vector dependence.
vector dependence: proven FLOW dependence between a line 48, and b
line 48.
HLO Example Compiler able to optimize: generated multiple versions
of loop, did loop interchange:
icc hpo.c -c -O3 -fargument-noalias -xT -opt-report-phase hlo
LOOP DISTRIBUTION in doit at line 43
LOOP INTERCHANGE in loops at line: 43 47
Loopnest permutation ( 1 2 ) --> ( 2 1 )
Static Verifier
Stack Checking & Buffer Overflow
Detecting x87 FP Stack Corruption
Mudflap Support
Security
12/13/2007 26
Static Verifier
New in Intel C++ and Fortran Compilers version 10.0
Detects defects or questionable code for C, C++, Fortran & OpenMP*
- Can analyze mixed C/C++/Fortran applications
Multi-file analysis
Static Verifier analysis done at compile/link time, doesnt detect run-time errors, such as
passing incorrect parameters to a function
Defects Detected
- Inconsistent object or function declaration in different parts of the application, &
verifies function arguments,
- Uninitialized variables
- memory leaks & memory corruption
- incorrect usage of pointers and allocatable arrays
- Detects incorrect OpenMP usage.
12/13/2007 27
Detecting Buffer Overflow
$ icc Buffer_overflow.c
$ ./a.out AhhhBustMyBuffers
Segmentation fault
$ icc -fstack-security-check Buffer_overflow.c
$ ./a.out AhhhBustMeBuffers
Error: Buffer overrun occurred, forced exit
Compiler generates code to
detect some buffer overflows
that overwrite the return
address.
Helps prevent commonly
used security vulnerabilities
Compiler Options:
Linux* and Mac OS* X
icc -fstack-
security-check
Windows*
ICL /GS
Buffer Overflow Example
$ cat Buffer_overflow.c
#include "string.h"
void example(char *s) {
char buf[8];
strcpy(buf, s);}
int main(int argc, char **argv) {
example(argv[1]); }
12/13/2007 28
Improved Floating Point Model (C++)
Enabled Enabled Same as
fp:source
Same as fp:precise -fp:strict
Disabled Disabled Use real algebra Intermediate result precision,
rounding determined by the
compiler
-fp:fast
Disabled Disabled Same as
fp:source
Intermediate results evaluated
at register precision. Rounding
at assignment, type casting,
function call
-fp:precise
Disabled Disabled Use FP non-associative,
non-distributive algebra
Intermediate results in source
precision. Rounding after each
operation
-fp:source
FP
exception
FP Env
access
Algebraic Transform Rounding Model
Scott Meyers "Effective C++" Diagnostics
Porting from 32 to 64-bits
Assisting Threaded App Development
Code Coverage and Test Prioritization
Quality
12/13/2007 30
10.0: Better C++ diagnostics Effective C++
Based on:
Effective C++ Second Edition
50 Specific Ways to Improve Your Programs and Designs (Scott Meyers)
More Effective C++ - 35 New Ways to Improve Your Programs and Designs
(Scott Meyers)
Enabled via -Weffc++ ( /Qeffc++ )
Examples include
Use const and inline rather than #define
Use <iostream> rather than <stdio.h>.
Use new and delete rather than malloc and free
Use delete on pointer members in destructors (diagnoses any pointer that does
not have a delete)
have a user copy constructor and assignment operator in classes containing
pointers.
Use initialization rather than assignment to members in constructors
etc
12/13/2007 31
Porting from 32 to 64 bit
Moving from 32 to 64 bit can result in
porting error
-Wp64 : enables 64 bit porting
diagnostics
N/A LP64 ILP32 Mac OS*
10
LP64 LP64 ILP32 Linux*
P64 P64 ILP32 Windows*
Ia64 Intel 64 IA 32 Operating
System
Table of Programming Models
Key
ILP32: Integer, long and pointers are 32 bit
P64: Integer & long 32 bit; pointers 64 bit
LP64: Integer 32bit, long & pointers are 64 bit
12/13/2007 32
Threading Legacy Applications
Compiler Global Variable Accesses Diagnostic
Problem: Thread legacy code that contains large number of global variables. Need to
protect access to globals throughout application.
Intel C++ Compiler has compile time diagnostics to identify when global variable are
accessed, available since 9.0 release (2005)
Linux* / Mac OS*: Enabled via -ww1710,1711,1712 fsyntax-only
Windows*: /Qww1710,1711,1712 /Zs
Can enable each diagnostic separately:
1710 warns about reference to statically allocated variables
1711 warns about assignment to statically allocated variables
1712 warns about address taken of statically allocated variables
12/13/2007 33
Threading Legacy Applications
Identifying Global Variable Accesses
$ cat a.cpp
1: static int x;
2: void foo(int *);
3: void funcx(void){
4: int y;
5: x=2;
6: y=x;
7: foo(&x);
8: }
9:
10: extern int q;
11: int p;
12: void funcy(void) {
13: q=10;
14: p=5;
15: }
$ icc -ww1710,1711,1712 a.cpp
a.cpp(5): warning #1711: assignment to statically allocated
variable "x
x=2;
a.cpp(6): warning #1710: reference to statically allocated
variable "x
y=x;
a.cpp(7): warning #1712: address taken of statically allocated
variable "x
foo(&x);
variable "q
q=10;
variable "p
p=5;
12/13/2007 34
Intel Code Coverage Tool
Example of code coverage summary for
a project. The workload applied in this
test exercised 34 of 143 blocks,
representing 5 of 19 functions in 2 of 3
modules. In the file, SAMPLE.C, 4 of 5
functions were exercised
Clicking on SAMPLE.C produces a
listing that highlights the code that
was exercised. In this example,
the pink-highlighted code was
never exercised, the yellow was
run but not exercised by any of the
tests set up by the developer and
the beige was partially
covered.
12/13/2007 35
Intel Test Prioritization Tool
Helps guide and speed software testing,
Helps produce better code more quickly
Helps improve programmer productivity
Example:
These 3 achieve 52.17% block and 50.00% function coverage
Test 3 alone covers 45.65% of basic blocks or 87.50% of total block coverage from all
tests
By adding Test 2, cumulative block coverage goes to 52.17%, or 100% of the total
block coverage of Test 1, Test 2, and Test 3
Eliminating Test 1 has no negative impact on block coverage and saves time
Number
of Tests
%Rat
Cvrg
%Blk
Cvrg
%Func
Cvrg
Test Names
@ Options
1 87.50 45.65 37.50 Test3.dpi
2 100.00 52.17 50.00 Test2.dpi
Total Number of Tests = 3
Total Block Coverage ~ 52.17%
Total Function Coverage ~50.00%
Compatibility
12/13/2007 37
C++ Compatibility with Microsoft
Source & binary compatible with VC2003 with /Qvc71,
Source & binary compatible with w/ VC 2005 under /Qvc8.
Microsoft* & Intel OpenMP binaries are compatible.
Use the option
12/13/2007 38
Support for Code Targeting AMD*
Goal: Competitive on AMD*; Best on Intel
Compilers and Libraries support AMD* Opteron* processor-based systems
Our Analysis Tools (Intel VTuneAnalyzer and Threading Tools) do NOT support AMD*
processors
May use specific features present only on Intel processors
Intel Compilers and Performance Libraries offer
leadership performance on Intel processors;
competitive performance on AMD*.
Intel Compilers and Performance Libraries offer
leadership performance on Intel processors;
competitive performance on AMD*.
12/13/2007 39
Linux C/C++: Intel and GNU Compatibility History
Established C++ ABI Industry Group
Intel
Compiler for Linux* Version 5.0.1

C language binary compatibility, using glibc for C library
Versions 6.0 and 7.1
C++ ABI compliant
Subtle differences in ABI compliance with gcc prevent full binary compatibility
Version 8.0
Match gcc 3.2, 3.3, & 3.4 C++ ABI
Full C++ binary interoperability
Version 8.1
gcc binary compatibility is the default for gcc 3.2, 3.3, & 3.4
Version 9.0
No g++ compatibility changes required, adds gcc 4.0 support
Version 9.1
No g++ compatibility changes required, adds gcc 4.1 support
Version 10.0
Require g++ compatibility (Removed Intel provided C++ libraries), adds gcc 4.2 support
12/13/2007 40
10.0 OS Support Matrix
IA32
Red Hat EL3
SuSE SLES 10
SGI Propack v4.0
SGI Propack v5.0
Red Flag DC Server5.0
Red Hat EL4
Red Hat Fedora Core 5
Turbo Linux 10
Mandriva/Mandrake 10.1
Red Hat Fedora Core 4
Haansoft Linux 2006 Server
Miracle Linux v4.0
SuSE SLES9
Linux Distros
IPF Intel 64

12/13/2007 41
Additional Resources
Intel C++ Compiler for Linux* product website
http://www.intel.com/software/products/compilers/clin
Active User Forum http://softwarecommunity.intel.com/isn/Community/en-
US/forums/1016/ShowForum.aspx
C++ Compiler White Papers at http://www3.intel.com/cd/software/products/asmo-
na/eng/278608.htm
Useful White Papers
Quick Reference Guide White Paper - http://cache-
www.intel.com/cd/00/00/22/23/222300_222300.pdf
Optimization Guide White Paper - http://cache-
gcc/g++ compatibility White Paper - http://cache-
Code Coverage White Paper - http://cache-
www.intel.com/cd/00/00/21/92/219280_compiler_code-coverage.pdf
Security White Paper - : http://cache-

Compiler Support For Multicore

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Compiler Support For Multicore

Hochgeladen von

Copyright:

Verfügbare Formate

Compiler Support for Multi-Core

void work( float* a, float b, float c, int MAX) {

Compiler for Linux* Version 5.0.1

Das könnte Ihnen auch gefallen