Sie sind auf Seite 1von 120

GPU Tracking Activities at Jlich

PANDA Collaboration Meeting 3-2013 (Bochum)


11. September 2013 | Andreas Herten
Mitglied der Helmholtz-Gemeinschaft

Outline
Introduction & Status of Algorithms
Hough Transformation GPU Riemann Track Finder GPU Triplet Finder

Summaries

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

A quick reminder

GPUS

Mitglied der Helmholtz-Gemeinschaft

Bene ts of GPUs

Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up

Mitglied der Helmholtz-Gemeinschaft

Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up

Many computing cores Run in parallel Speed

ALU Control ALU

ALU ALU

Cache

CPU

DRAM

GPU

DRAM

Mitglied der Helmholtz-Gemeinschaft

Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up

Many computing cores Run in parallel Speed Active hardware (& software) development
Mitglied der Helmholtz-Gemeinschaft

ALU Control ALU

ALU ALU

Cache

CPU

DRAM

GPU

DRAM

Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up

Many computing cores Run in parallel Speed Active hardware (& software) development CUDA C
Mitglied der Helmholtz-Gemeinschaft

ALU Control ALU

ALU ALU

Cache

CPU

DRAM

GPU

DRAM

C/C++ extension Mixed host/device code OpenACC 2.0 parallelizer

__global__, __device__ Thrust, cuBLAS #pragma acc kernels


4

Mitglied der Helmholtz-Gemeinschaft

ALGORITHMS
Hough Transform

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm

Hough Transform Algorithm


1. Conformal Mapping

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm


1. Conformal Mapping 2. Hit point i (xi,yi,i) Line parameter rij
j (0,360] of desired granularity j

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm


1. Conformal Mapping 2. Hit point i (xi,yi,i) Line parameter rij
j (0,360] of desired granularity j
rij = cosj xi + sinj yi + i

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm


1. Conformal Mapping 2. Hit point i (xi,yi,i) Line parameter rij
j (0,360] of desired granularity j
rij = cosj xi + sinj yi + i

3. Fill rij into histogram

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm


1. Conformal Mapping 2. Hit point i (xi,yi,i) Line parameter rij
j (0,360] of desired granularity j
rij = cosj xi + sinj yi + i

3. Fill rij into histogram 4. Find peak

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm


1. Conformal Mapping 2. Hit point i (xi,yi,i) Line parameter rij
j (0,360] of desired granularity j
rij = cosj xi + sinj yi + i

3. Fill rij into histogram 4. Find peak 5. Extract track parameters (rj,j)

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

x*
(r, )
1

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

r
(r, )
1

x*

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

r
(r, )
1

x*

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

r
(r,
) (r,
2

x*
)
1

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

x*

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

x*

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

x*

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

x*

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

x*

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle


y*
rij = cosj xi + sinj yi + i

x*

Mitglied der Helmholtz-Gemeinschaft

Bin with highest multiplicity gives track parameters

Hough Transform Example


Implementation in Thrust
r Hough transformed
0.06 10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133 7 6 0.02 5 0 4 3 -0.02
Mitglied der Helmholtz-Gemeinschaft

0.04

2 1

-0.04 0

20

40

60

80

100

120

140

160 180 Angle /

PANDA STT
180 x 180 Grid
8

Hough Transform Example


Implementation in Thrust
r Hough transformed
0.06 10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133 7 6 0.02 5 0 4 3 -0.02
Mitglied der Helmholtz-Gemeinschaft

0.04

2 1

-0.04 0

20

40

60

80

100

120

140

160 180 Angle /

PANDA STT
180 x 180 Grid
8

Hough Transform Example


Implementation in Thrust
r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0

PANDA STT+MVD
1800 x 1800 Grid
8

Hough Transform Example


Implementation in Thrust
r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0

PANDA STT+MVD
1800 x 1800 Grid
8

Hough Transform Example


Implementation in Thrust
r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0

PANDA STT+MVD
1800 x 1800 Grid
8

Hough Transform Example


Implementation in Thrust
r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0

PANDA STT+MVD
1800 x 1800 Grid
8

Hough Transform Numbers


Thrust: < 3 ms/event Plain CUDA: 0.5 ms/event

Andrew Adinetz, NVIDIA Application Lab

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Notes


Thrust: < 3 ms/event
Running code Independent of granularity (occupancy; see below) Parallel
Parallel for all hits of one event Not parallel for all events

Reduced problem(s) to set of standard routines (~STL)


Fast (uses clever pre-made algorithms) In exible (has its limits, hard to customize)
Mitglied der Helmholtz-Gemeinschaft

No peak nding included


Even possible? Adds to time!
10

Hough Transform Notes


Thrust: < 3 ms/event Plain CUDA: 0.5 ms/event
Running code Parallel
Fully parallel Coarser grid faster computing times

Built completely from scratch


Customizable / tting to every problem A bit more complicated at parts
Mitglied der Helmholtz-Gemeinschaft

Simple peak nder implemented (threshold)

11

Hough Transform Summary


Running code from Andrew Adinetz & me Big issue: Multipeak nder

Mitglied der Helmholtz-Gemeinschaft

12

Hough Transform Summary


Running code from Andrew Adinetz & me Big issue: Multipeak nder
To Dos / Plans Advantages
of HT algorithm

Draw Backs

Problems Challenges

Parallelism beyond
events Thrust Isochrones (fully) Integrate to PandaRoot Time-based

Easy algorithm With grid

Stuck in
Thrust infrastructure

Multipeak
nder/ Aliasing
(Algorithm out of image processing usually used to detect continuous lines)

Mitglied der Helmholtz-Gemeinschaft

granularity parallelism increases Flexibility of HTed Plain Isochrones (fully) Integrate to PandaRoot equation CUDA Time-based

12

Hough Transform Summary


Running code from Andrew Adinetz & me Big issue: Multipeak nder
To Dos / Plans Advantages
of HT algorithm

Draw Backs

Problems Challenges

Parallelism beyond
events Thrust Isochrones (fully) Integrate to PandaRoot Time-based

Easy algorithm With grid

Stuck in
Thrust infrastructure

Multipeak
nder/ Aliasing
(Algorithm out of image processing usually used to detect continuous lines)

Mitglied der Helmholtz-Gemeinschaft

granularity parallelism increases Flexibility of HTed Plain Isochrones (fully) Integrate to PandaRoot equation CUDA Time-based

Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuHoughTransform https://github.com/AndiH/CUDA/

12

Mitglied der Helmholtz-Gemeinschaft

ALGORITHMS
Riemann Track Finder

13

Mitglied der Helmholtz-Gemeinschaft

Riemann Algorithm

14

Riemann Algorithm
1

Create triplet of MVD hit points


All possible three hit combinations need to become triplets

Mitglied der Helmholtz-Gemeinschaft

14

Riemann Algorithm
1

Create triplet of MVD hit points


All possible three hit combinations need to become triplets

Grow triplets to tracks: Continuously test next hit if it ts to triplet track


Circle t in (x,y) plane
Project hit points of triplet to Riemann paraboloid Create plane (track) Test closeness of new hit: good add hit; bad dismiss hit

Mitglied der Helmholtz-Gemeinschaft

Continue with next hit

Helix t: arc length s vs. z position


14

1 Triplets Riemann Algorithm 1

Mitglied der Helmholtz-Gemeinschaft

15

1 Triplets Riemann Algorithm 1

Mitglied der Helmholtz-Gemeinschaft

15

1 Triplets Riemann Algorithm 1

3 11 21 31

Mitglied der Helmholtz-Gemeinschaft

15

1 Triplets Riemann Algorithm 1

3 11 11 2 21 31 31 41

Mitglied der Helmholtz-Gemeinschaft

15

1 Triplets Riemann Algorithm 1

3 11 11 2 11 21 31 31 31 41 32

Mitglied der Helmholtz-Gemeinschaft

15

1 Triplets Riemann Algorithm 1

3 11 11 2 11 21 31 31 31 41 32

Mitglied der Helmholtz-Gemeinschaft

15

Mitglied der Helmholtz-Gemeinschaft

2 Expansion Riemann Algorithm 1

16

2 Expansion Riemann Algorithm 1

x
x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

16

2 Expansion Riemann Algorithm 1

x x
x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm 1

x x
x x

x x

Mitglied der Helmholtz-Gemeinschaft

Expand to z

Riemann Surface (paraboloid)

16

Riemann Algorithm Triplet Generation


CPU

Three loops to generate triplets serially


for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } }

Mitglied der Helmholtz-Gemeinschaft

17

Riemann Algorithm Triplet Generation


CPU GPU

Three loops to generate triplets serially


for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } }

Loops are not good parallelizable! Needed: Mapping of inherent GPU indexing variable to triplet index
int ijk = threadIdx.x + blockIdx.x * blockDim.x;

Mitglied der Helmholtz-Gemeinschaft

17

Riemann Algorithm GPU Version


Triplet generation: transition via equations
for () {for () {for () {}}} int ijk = threadIdx.x + blockIdx.x * blockDim.x;

nLayerx =

8x + 1 1 2 p p p 3 3 243x2 1 + 27x 1 p pos(nLayerx ) = +p p p 2 / 3 3 3 3 3 3 243x2

1 p

1 + 27x

Work by Jonathan Timcheck


RISE summer student of Ohio State University at FZJ
Mitglied der Helmholtz-Gemeinschaft

18

Mitglied der Helmholtz-Gemeinschaft

GPU Riemann Algorithm

19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm


Triplet generation
Layer triplet: Combinations of three layers, where each layer has at least one hit in it. Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.

Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds

Expand triplets to tracks

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds

Expand triplets to tracks


Add hit one by one if quality criteria are passed (from new layer; distance to Riemann plane; s-z quality)
19

Mitglied der Helmholtz-Gemeinschaft

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds

Expand triplets to tracks


Add hit one by one if quality criteria are passed (from new layer; distance to Riemann plane; s-z quality)
19

Mitglied der Helmholtz-Gemeinschaft

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds

Expand triplets to tracks


Add hit one by one if quality criteria are passed (from new layer; distance to Riemann plane; s-z quality)

Mitglied der Helmholtz-Gemeinschaft

Save successfully found track


19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread 1 thread = 1 layer triplet creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds

Expand triplets to tracks


Add hit one by one if quality criteria are passed (from new layer; distance to Riemann plane; s-z quality)

Mitglied der Helmholtz-Gemeinschaft

Save successfully found track


19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread 1 thread = 1 layer triplet creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds

Expand triplets to tracks


Add hit one by one if quality criteria are passed (from new layer; distance to Riemann plane; s-z quality)

1 thread = 1 hit triplet

Mitglied der Helmholtz-Gemeinschaft

Save successfully found track


19

GPU Riemann Algorithm


Triplet generation
Start with layer triplet: Each thread 1 thread = 1 layer triplet creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds

Expand triplets to tracks

1 thread = 1 hit triplet

Mitglied der Helmholtz-Gemeinschaft

Add hit one by one if quality criteria are passed 1 thread = 1 hit triplet seed (from new layer; distance to Riemann plane; s-z quality)

Save successfully found track


19

GPU Riemann Algorithm Performance


Eciency & Ghost ratio
Eciency | Ghost Ratio Dist. 0.06 Cut 0.05 Param 0.04 0.575 | 0.205 0.562 | 0.205 0.541 | 0.204 0.04 s-z Chi Square Cut Parameter 0.601 | 0.255 0.589 | 0.254 0.566 | 0.254 0.05 0.623 | 0.306 0.612 | 0.304 0.588 | 0.303 0.06 4.0 3.0 2.0 0.877 | 4.652 0.881 | 4.032 0.881 | 3.360 0.5 0.902 | 7.621 0.910 | 6.730 0.914 | 5.708 1.0 0.897 | 10.08 0.913 | 9.030 0.918 | 7.793 1.5

CPU Riemann Cuts

Benchmark Extreme Cuts

Mitglied der Helmholtz-Gemeinschaft

20

GPU Riemann Algorithm Performance


Eciency & Ghost ratio
Eciency | Ghost Ratio Dist. 0.06 Cut 0.05 Param 0.04 0.575 | 0.205 0.562 | 0.205 0.541 | 0.204 0.04 s-z Chi Square Cut Parameter 0.601 | 0.255 0.589 | 0.254 0.566 | 0.254 0.05 0.623 | 0.306 0.612 | 0.304 0.588 | 0.303 0.06 4.0 3.0 2.0 0.877 | 4.652 0.881 | 4.032 0.881 | 3.360 0.5 0.902 | 7.621 0.910 | 6.730 0.914 | 5.708 1.0 0.897 | 10.08 0.913 | 9.030 0.918 | 7.793 1.5

CPU Riemann Cuts

Benchmark Extreme Cuts

Time for one event (NV Pro ler; @Juhydra: Tesla K20X)
Time(%) 75.55% 5.96% 4.36% 4.26% 2.57% 2.44% 1.30% 1.11% 1.11% 0.89% 0.45% Time 439.49us 34.656us 25.344us 24.800us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 5.1520us 2.6240us Calls 1 4 1 6 1 1 1 1 1 5 1 Avg 439.49us 8.6640us 25.344us 4.1330us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 1.0300us 2.6240us Min 439.49us 2.3360us 25.344us 3.7760us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 928ns 2.6240us Max 439.49us 22.432us 25.344us 5.3440us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 1.3440us 2.6240us Name extend_cut_hit_triplets_k [CUDA memcpy DtoH] cut_hit_triplets_k [CUDA memset] generate_hit_triplet generate_layer_triplets void thrust void thrust void thrust [CUDA memcpy HtoD] project_onto_paraboloid_k
20

Mitglied der Helmholtz-Gemeinschaft

GPU Riemann Algorithm Summary


Running port of CPU Riemann to GPU (J. Timcheck)
+ Improvements / needed changes wrt to CPU version

Mitglied der Helmholtz-Gemeinschaft

21

GPU Riemann Algorithm Summary


Running port of CPU Riemann to GPU (J. Timcheck)
+ Improvements / needed changes wrt to CPU version
To Dos / Plans
of Riemann algorithm

Advantages

Draw Backs

Problems Challenges

Measurement
Uncertainties Cuts (Extension: Hit to close, Zero crossing) Parallelism (32 threads per seed, not 1) Track merger Integrate to PandaRoot
Mitglied der Helmholtz-Gemeinschaft

Secondaries Combinatorically Runs also only with explosive MVD Many combinations (if Basis for more
sophisticated algorithms Uncertainties Fast track tter (track parameters) used bluntly) - Esp. if used as nder - Pre-steps needed Jonathans internship is over

Include more
subdetectors Make timebased

21

GPU Riemann Algorithm Summary


Running port of CPU Riemann to GPU (J. Timcheck)
+ Improvements / needed changes wrt to CPU version
To Dos / Plans
of Riemann algorithm

Advantages

Draw Backs

Problems Challenges

Measurement
Uncertainties Cuts (Extension: Hit to close, Zero crossing) Parallelism (32 threads per seed, not 1) Track merger Integrate to PandaRoot
Mitglied der Helmholtz-Gemeinschaft

Secondaries Combinatorically Runs also only with explosive MVD Many combinations (if Basis for more
sophisticated algorithms Uncertainties Fast track tter (track parameters) used bluntly) - Esp. if used as nder - Pre-steps needed Jonathans internship is over

Include more
subdetectors Make timebased

Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuRiemann Extensive documentation at: http://panda-wiki.gsi.de/cgi-bin/view/Computing/ RiemannTrackFinder (+Summary of theory behind Riemann algorithm)
21

Mitglied der Helmholtz-Gemeinschaft

ALGORITHMS
Triplet Finder

22

Mitglied der Helmholtz-Gemeinschaft

Triplet Finder Method


STT

23

Mitglied der Helmholtz-Gemeinschaft

Triplet Finder Method


STT

23

Mitglied der Helmholtz-Gemeinschaft

Triplet Finder Method


STT

23

Mitglied der Helmholtz-Gemeinschaft

Triplet Finder Method


STT

23

Triplet Finder Method


STT hit in pivot straw
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder Method


STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog)
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder Method


STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog) Combine with
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder Method


STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog) Combine with
1.Second STT pivot-cog virtual hit
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder Method


STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog) Combine with
1.Second STT pivot-cog virtual hit
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder Method


STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog) Combine with
1.Second STT pivot-cog virtual hit
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder Method


STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog) Combine with
1.Second STT pivot-cog virtual hit 2.Interaction point
Interaction Point

STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder Method


STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog) Combine with
1.Second STT pivot-cog virtual hit 2.Interaction point
STT

Mitglied der Helmholtz-Gemeinschaft

Calculate circle through three points

Interaction Point

23

Triplet Finder Method


STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog) Combine with
1.Second STT pivot-cog virtual hit 2.Interaction point
STT

Mitglied der Helmholtz-Gemeinschaft

Calculate circle through three points Track Candidate

Interaction Point

23

Triplet Finder Animation


Isochrone early Isochrone early & skewed Isochrone close Isochrone late MVD hit Triplet Track current Track timed out

Mitglied der Helmholtz-Gemeinschaft

24

Triplet Finder GPU Port


Original algorithm by Marius C. Mertens et al.
Implemented in PandaRoot (also using skewed straws) Reconstruction eciency & other quality criteria: see Marius talks on Triplet Finder (eg. last PANDA meeting) Features
Fast & robust algorithm No isochrones needed Many tuning possibilities

Ported to GPU by Andrew A. Adinetz (NVIDIA Application Lab)


Mitglied der Helmholtz-Gemeinschaft

Thrust, CUDA, some dynamic parallelism Supporting skewed straws Quality of results comparable to CPU version ( oat vs. double)
25

GPU Triplet Finder Speed


1 Burst

Processing Time
3,00

2,25

Time / s
Mitglied der Helmholtz-Gemeinschaft

1,50

0,75

0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000

#Hits

26

GPU Triplet Finder Speed


1 Burst

Performance
0,500

0,375

Performance / Mhits/s
Mitglied der Helmholtz-Gemeinschaft

0,250

0,125

0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000

#Hits

26

GPU Triplet Finder Speed


1 Burst

Performance
0,500

0,375

Algorithm: O(n O 2)n2 Bunching (look at subset of hits; slice all hits in pieces Burst)

Performance / Mhits/s
Mitglied der Helmholtz-Gemeinschaft

0,250

0,125

0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000

#Hits

26

GPU Triplet Finder Summary


Running port of CPU Triplet Finder to GPU by Andrew Adinetz (NVIDIA Application Lab)

Mitglied der Helmholtz-Gemeinschaft

27

GPU Triplet Finder Summary


Running port of CPU Triplet Finder to GPU by Andrew Adinetz (NVIDIA Application Lab)
To Dos / Plans Advantages Draw Backs Problems Challenges

Bunching (As

exible wrapper, to be used for dierent algorithms) Integrate to PandaRoot

Fast No isochrones
needed Already built for time-based hits (no events)

Needs compute capability


3.5

Mitglied der Helmholtz-Gemeinschaft

27

GPU Triplet Finder Summary


Running port of CPU Triplet Finder to GPU by Andrew Adinetz (NVIDIA Application Lab)
To Dos / Plans Advantages Draw Backs Problems Challenges

Bunching (As

exible wrapper, to be used for dierent algorithms) Integrate to PandaRoot

Fast No isochrones
needed Already built for time-based hits (no events)

Needs compute capability


3.5

Mitglied der Helmholtz-Gemeinschaft

Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuTripletFinder

27

Mitglied der Helmholtz-Gemeinschaft

SUMMARY

28

Summary General
Three algorithms under GPU investigation
All are fast and GPU-suited Dierent advantages & disadvantages

Mitglied der Helmholtz-Gemeinschaft

Great help by Andrew Adinetz & others of NVIDIA Application Lab Not yet compared in terms of /purity/ghost/etc. Tobias talk @discussion round Not yet looked into GPUs beyond algorithmic stage (data distribution etc.)

29

Summary General
Three algorithms under GPU investigation
All are fast and GPU-suited Dierent advantages & disadvantages

Mitglied der Helmholtz-Gemeinschaft

Great help by Andrew Adinetz & others of NVIDIA Application Lab Not yet compared in terms of /purity/ghost/etc. Tobias talk @discussion round Not yet looked into GPUs beyond algorithmic stage (data distribution etc.)
GTLI (General Topics to Look Into) Bunching Time-based structure Track Merger Which algorithm for which sub-detector Algorithm interplay / hybridization Speci cally test of sub-parts of the algorithm run faster on CPUs (CPU-GPU re-distribution) Quality analytics
29

Summary Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133

Algorithm

Note Fast but in exible

Biggest Challenge

Detectors used MVD STT MVD MVD GEM STT

7 Hough Transform 6 Thrust

Hough Transform 4 Plain CUDA


3 1 of dynamic (with parallelism) 2

Fast Promising initial implementation Fast, robust, scalable

No satisfying multipeak nder

100

120
y

140

160 180 Angle /

GPU1 Riemann Track Finder 0 GPU Triplet Finder

Testing, Combinatorics Manpower

x
x

Mitglied der Helmholtz-Gemeinschaft

GPU STT Cell Finder

see next talk by Tobias

30

Summary Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133

Algorithm

Note Fast but in exible

Biggest Challenge

Detectors used MVD STT MVD MVD GEM STT

7 Hough Transform 6 Thrust

Hough Transform 4 Plain CUDA


3 1 of dynamic (with parallelism) 2

Fast Promising initial implementation Fast, robust, scalable

No satisfying multipeak nder

100

120
y

140

160 180 Angle /

GPU1 Riemann Track Finder 0 GPU Triplet Finder

Testing, Combinatorics Manpower

x
x

Mitglied der Helmholtz-Gemeinschaft

GPU STT Cell Finder

see next talk by Tobias

30

Summary Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133

Algorithm

Note Fast but in exible

Biggest Challenge

Detectors used MVD STT MVD MVD GEM STT

7 Hough Transform 6 Thrust

Hough Transform 4 Plain CUDA


3 1 of dynamic (with parallelism) 2

Fast Promising initial implementation Fast, robust, scalable

No satisfying multipeak nder

100

120
y

140

160 180 Angle /

GPU1 Riemann Track Finder 0 GPU Triplet Finder

Testing, Combinatorics Manpower

x
x

Mitglied der Helmholtz-Gemeinschaft

GPU STT Cell Finder

see next talk by Tobias

! u o y k n Tha
rten Andreas He elich.de ju z f @ n e t r a.he
30

Resources Used in This Talk


[4] NVIDIA CARMA development kit [4] GFLOPS graph by NVIDIA (CUDA documentation) [24] Hit animation by Marius C. Mertens [26] Performance graphs by Andrew Adinetz The rest is mine

Mitglied der Helmholtz-Gemeinschaft

31

Mitglied der Helmholtz-Gemeinschaft

APPENDIX

32

GTLI (General Topics to Look Into)


Bunching Time-based structure Track Merger Which algorithm for which sub-detector Algorithm interplay / hybridization Speci cally test of sub-parts of the algorithm run faster on CPUs (CPU-GPU re-distribution) Quality analytics
Mitglied der Helmholtz-Gemeinschaft

33

Other Works NVIDIA App Lab


Hough transform (r >= 0)
HT size size (n (n x xn) n) threshold threshold # rec. tracks tracks # tracks # failed failed tracks # positives # false false positives % % succ succ % % fail fail % pos % false false pos time, s time, s time/event, s time/event, s

% fail % false pos time, s time/event, s

52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152

15,7% 0,6% 6,69 0,001338

L E R P
Mitglied der Helmholtz-Gemeinschaft

Hough transform in shared memory

R E T N I , Y R A N I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05

256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044

L A N

HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s

T A H W

34

Other Works NVIDIA App Lab


Hough transform (r >= 0)
HT size size (n (n x xn) n) threshold threshold # rec. tracks tracks # tracks # failed failed tracks # positives # false false positives % % succ succ % % fail fail % pos % false false pos time, s time, s time/event, s time/event, s

% fail % false pos time, s time/event, s

52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152

15,7% 0,6% 6,69 0,001338

L E R P
Mitglied der Helmholtz-Gemeinschaft

Hough transform in shared memory

R E T N I , Y R A N I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05

256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044

L A N

HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s

T A H W

34

Other Works NVIDIA App Lab


Hough transform (r >= 0)
HT size size (n (n x xn) n) threshold threshold # rec. tracks tracks # tracks # failed failed tracks # positives # false false positives % % succ succ % % fail fail % pos % false false pos time, s time, s time/event, s time/event, s

% fail % false pos time, s time/event, s

52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152

15,7% 0,6% 6,69 0,001338

L E R P
Mitglied der Helmholtz-Gemeinschaft

Hough transform in shared memory

R E T N I , Y R A N I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05

256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044

L A N

HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s

T A H W

34