You are on page 1of 123

Status of GPU Tracking Algorithms

IKP-1 Group Meeting Seminar
19. September 2013 | Andreas Herten
Mitglied der Helmholtz-Gemeinschaft

1

Mitglied der Helmholtz-Gemeinschaft

PANDA & GPUS
A quick reminder

2

Motivation
• PANDA: Triggerless read out
– Many benchmark channels – Background & signal similar → No hardware (L1/HLT) trigger

Mitglied der Helmholtz-Gemeinschaft

3

Motivation
• PANDA: Triggerless read out
– Many benchmark channels – Background & signal similar → No hardware (L1/HLT) trigger

Rates

20 MEvents/s 300 GB/s

Reduction
Mitglied der Helmholtz-Gemeinschaft

~1/1000
(60 000 CPU cores)

Offline Storage

20 kEvents/s

<1 GB/s
3

Motivation
• PANDA: Triggerless read out
– Many benchmark channels – Background & signal similar → No hardware (L1/HLT) trigger

Rates

20 MEvents/s 300 GB/s
GPU Tracking

Reduction
Mitglied der Helmholtz-Gemeinschaft

~1/1000

(60 000 CPU cores)

Offline Storage

20 kEvents/s

<1 GB/s
3

Mitglied der Helmholtz-Gemeinschaft

Bene ts of GPUs

4

Bene ts of GPUs
• Extension card to PC
Although stand-alone GPU boards are coming up

Mitglied der Helmholtz-Gemeinschaft

4

Bene ts of GPUs
• Extension card to PC
Although stand-alone GPU boards are coming up

• Many computing cores Run in parallel → Speed

ALU Control ALU

ALU ALU

Cache

CPU

DRAM

GPU

DRAM

Mitglied der Helmholtz-Gemeinschaft

4

Bene ts of GPUs
• Extension card to PC
Although stand-alone GPU boards are coming up

• Many computing cores Run in parallel → Speed • Active hardware (& software) development
Mitglied der Helmholtz-Gemeinschaft

ALU Control ALU

ALU ALU

Cache

CPU

DRAM

GPU

DRAM

4

Bene ts of GPUs
• Extension card to PC
Although stand-alone GPU boards are coming up

• Many computing cores Run in parallel → Speed • Active hardware (& software) development • CUDA C
Mitglied der Helmholtz-Gemeinschaft

ALU Control ALU

ALU ALU

Cache

CPU

DRAM

GPU

DRAM

– C/C++ extension – Mixed host/device code – OpenACC 2.0 parallelizer

__global__, __device__ Thrust, cuBLAS #pragma acc kernels
4

Mitglied der Helmholtz-Gemeinschaft

ALGORITHMS
Hough Transform

5

Hough Transform — Algorithm Recipe

Mitglied der Helmholtz-Gemeinschaft

6

Hough Transform — Algorithm Recipe
1. Conformal Mapping

Mitglied der Helmholtz-Gemeinschaft

6

Hough Transform — Algorithm Recipe
1. Conformal Mapping 2. Hit point i (xi,yi,ρi) → Line parameter rij
αj ∈ (0°,360°] of desired granularity j

Mitglied der Helmholtz-Gemeinschaft

6

Hough Transform — Algorithm Recipe
1. Conformal Mapping 2. Hit point i (xi,yi,ρi) → Line parameter rij
αj ∈ (0°,360°] of desired granularity j
rij = cos↵j · xi + sin↵j · yi + ⇢i

Mitglied der Helmholtz-Gemeinschaft

6

Hough Transform — Algorithm Recipe
1. Conformal Mapping 2. Hit point i (xi,yi,ρi) → Line parameter rij
αj ∈ (0°,360°] of desired granularity j
rij = cos↵j · xi + sin↵j · yi + ⇢i

3. Fill rij into histogram

Mitglied der Helmholtz-Gemeinschaft

6

Hough Transform — Algorithm Recipe
1. Conformal Mapping 2. Hit point i (xi,yi,ρi) → Line parameter rij
αj ∈ (0°,360°] of desired granularity j
rij = cos↵j · xi + sin↵j · yi + ⇢i

3. Fill rij into histogram 4. Find peak

Mitglied der Helmholtz-Gemeinschaft

6

Hough Transform — Algorithm Recipe
1. Conformal Mapping 2. Hit point i (xi,yi,ρi) → Line parameter rij
αj ∈ (0°,360°] of desired granularity j
rij = cos↵j · xi + sin↵j · yi + ⇢i

3. Fill rij into histogram 4. Find peak 5. Extract track parameters (rj,αj)

Mitglied der Helmholtz-Gemeinschaft

6

Mitglied der Helmholtz-Gemeinschaft

Hough Transform — Principle

7

Mitglied der Helmholtz-Gemeinschaft

Hough Transform — Principle
y

x

7

Mitglied der Helmholtz-Gemeinschaft

Hough Transform — Principle
y

x

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

x*
(r, α)
1

Mitglied der Helmholtz-Gemeinschaft

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r
(r, α)
1

x*

Mitglied der Helmholtz-Gemeinschaft

α

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r
(r, α)
1

x*

Mitglied der Helmholtz-Gemeinschaft

α

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r
(r,
) (r, α
2

x*
α)
1

Mitglied der Helmholtz-Gemeinschaft

α

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r

x*

Mitglied der Helmholtz-Gemeinschaft

α

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r

x*

Mitglied der Helmholtz-Gemeinschaft

α

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r

x*

Mitglied der Helmholtz-Gemeinschaft

α

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r

x*

Mitglied der Helmholtz-Gemeinschaft

α

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r

x*

Mitglied der Helmholtz-Gemeinschaft

α

7

Hough Transform — Principle
y*
rij = cos↵j · xi + sin↵j · yi + ⇢i

r

x*

Mitglied der Helmholtz-Gemeinschaft

→ Bin with highest multiplicity gives track parameters
α
7

Hough Transform — Example
• Implementation in Thrust
r Hough transformed
0.06 10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133 7 6 0.02 5 0 4 3 -0.02
Mitglied der Helmholtz-Gemeinschaft

0

0.04

2 1

-0.04 0

20

40

60

80

100

120

140

160 180 α Angle / °

0

PANDA STT
180 x 180 Grid
8

Hough Transform — Example
• Implementation in Thrust
r Hough transformed
0.06 10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133 7 6 0.02 5 0 4 3 -0.02
Mitglied der Helmholtz-Gemeinschaft

0

0.04

2 1

-0.04 0

20

40

60

80

100

120

140

160 180 α Angle / °

0

PANDA STT
180 x 180 Grid
8

Hough Transform — Example
• Implementation in Thrust
r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

5

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 α Angle / ° 0

PANDA STT+MVD
1800 x 1800 Grid
8

Hough Transform — Example
• Implementation in Thrust
r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

5

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 α Angle / ° 0

PANDA STT+MVD
1800 x 1800 Grid
8

Hough Transform — Example
• Implementation in Thrust
r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

5

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 α Angle / ° 0

PANDA STT+MVD
1800 x 1800 Grid
8

Hough Transform — Example
• Implementation in Thrust
r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
Mitglied der Helmholtz-Gemeinschaft

68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20

15

10

5

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 α Angle / ° 0

PANDA STT+MVD
1800 x 1800 Grid
8

Hough Transform — Numbers
• Thrust: < 3 ms/event • Plain CUDA: 0.5 ms/event

Andrew Adinetz, JSC NVIDIA Application Lab

Mitglied der Helmholtz-Gemeinschaft

9

Hough Transform — Notes
• Thrust: < 3 ms/event
– Running code – Independent of α granularity (occupancy; see below) – Parallel
• Parallel for all hits of one event • Not parallel for all events

– Reduced problem(s) to set of standard routines (~STL)
• Fast (uses clever pre-made algorithms) • In exible (has it‘s limits, hard to customize)
Mitglied der Helmholtz-Gemeinschaft

– No peak nding included
• Even possible? • Adds to time!

HT histogram
r
15
2 2 1 2 3 1 3 1 1 3 2 2 2 1 1 1 2 2 1 2 1 1 1 2 2 1 2 1 1 1 2 2 1 2 2 3 1 1 2 2 2 1 1 2 1 3 1 1

HoughHist
1 2 2 2 1

2 1 2 1 2 2

10
3 1 1 2 1 2 2 3 1 3 1 1 2 1 2 2 3 1 3 2 1 1 2 1 2 3 1 3 1 3 1

2 1 2 2 1 3 2 3 2 1 1

2 1 2 3 1 1

2 1 2 3 1 1

2 2 2 2 2

2 2 2 2 2 4 1 6 5 3 1

5

0

1 7 3 1

2 3 6 5 2 3 2 1

4 6 2

2 9 8 3 2

3 2 2 1 2 3 10 9

3 1 3 2 7 4

2 2 2 2 6 3 4 4 2 3

3 3 6 8

4 4 8 4

4 4 9 3

2 3 4 4 2 4 3 1 3 3 1

5 4 2 5 5 1 3 3 2

3 1 2 5 5 5 1 4 3 1

2 3 2 1 2 4 3 8 4 4 4 2 3 2 2 4 3 2 4 1

2 2 1 2 4 1 1 4 3 2 2 4 2

2 4 2 2 3 2 2 3 3 1 1 4 1 3 4 2 3 2 3 1 5 5 2

3 1

3 3 1 1

3 1

3 2 2 1

2 2 2 2

3 1 3 1

2 2 2 2

1 2 1 3 1

3 1 2 2

1 2 1 3 1

3 1 2 1 1

1 2 1 3 1 1 1

Entries Mean x Mean y 1 1 3 2 RMS 3 2 1x 1 1 1 1 2 RMS 2 3 2y 1
1 1 1

12 10800 87.77 5.13 10 10.24 5.33 8

1 4 4 1 2 6

6
2 3 4 2 4 3 2 3 5 2 6 2 4 5 4 3 2 4 6 6

2 5 2 7 6 3 4 10 9 4 6 12 7 2 2 11 7 9

4
5 2 5 5 8 3 8

-5

4 4 3 7

2

70

75

80

85

90

95

100

105 α/ °

0

10

Hough Transform — Notes
• Thrust: < 3 ms/event • Plain CUDA: 0.5 ms/event
– Running code – Parallel
• Fully parallel • Coarser grid → faster computing times

– Built completely from scratch
• Customizable / tting to every problem • A bit more complicated at parts
Mitglied der Helmholtz-Gemeinschaft

– Simple peak nder implemented (threshold)

11

Hough Transform — Summary
• Running code from Andrew Adinetz & me • Big issue: Multipeak nder

Mitglied der Helmholtz-Gemeinschaft

12

Hough Transform — Summary
• Running code from Andrew Adinetz & me • Big issue: Multipeak nder
Advantages
of HT algorithm

To Dos / Plans

Draw Backs

Problems Challenges

• Parallelism beyond
Thrust • Easy algorithm

• With grid

Mitglied der Helmholtz-Gemeinschaft

CUDA

granularity parallelism increases Plain • Flexibility of HT‘ed equation

events • Isochrones (fully) • Integrate to PandaRoot • Time-based

• Stuck in
Thrust infrastructure

• Multipeak
nder/ Aliasing
(Algorithm out of image processing – usually used to detect continuous lines)

• Isochrones (fully) • Integrate to PandaRoot • Time-based

12

Hough Transform — Summary
• Running code from Andrew Adinetz & me • Big issue: Multipeak nder
Advantages
of HT algorithm

To Dos / Plans

Draw Backs

Problems Challenges

• Parallelism beyond
Thrust • Easy algorithm

• With grid

Mitglied der Helmholtz-Gemeinschaft

CUDA

granularity parallelism increases Plain • Flexibility of HT‘ed equation

events • Isochrones (fully) • Integrate to PandaRoot • Time-based

• Stuck in
Thrust infrastructure

• Multipeak
nder/ Aliasing
(Algorithm out of image processing – usually used to detect continuous lines)

• Isochrones (fully) • Integrate to PandaRoot • Time-based

• Code at: – https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuHoughTransform – https://github.com/AndiH/CUDA/

12

Mitglied der Helmholtz-Gemeinschaft

ALGORITHMS
Riemann Track Finder

13

Mitglied der Helmholtz-Gemeinschaft

Riemann Algorithm

14

Riemann Algorithm
1•

Create triplet of MVD hit points
– All possible three hit combinations need to become triplets

Mitglied der Helmholtz-Gemeinschaft

14

Riemann Algorithm
1•

Create triplet of MVD hit points
– All possible three hit combinations need to become triplets

2•

Grow triplets to tracks: Continuously test next hit if it ts to triplet track
– Use Riemann paraboloid to circle t track
• Test closeness of new hit: good → add hit; bad → dismiss hit • Continue with next hit

– Helix t: arc length s vs. z position
Mitglied der Helmholtz-Gemeinschaft

14

1 Triplets Riemann Algorithm — 1

5

4

3

2

1

Mitglied der Helmholtz-Gemeinschaft

Layer number

1

2

3

4

5

15

1 Triplets Riemann Algorithm — 1

5

4

3

2

1

Mitglied der Helmholtz-Gemeinschaft

Layer number

1

2

3

4

5

15

1 Triplets Riemann Algorithm — 1

5

4

3

2

1

Mitglied der Helmholtz-Gemeinschaft

Layer number

1

2

3

4

5

15

1 Triplets Riemann Algorithm — 1

5

4

3 11 21 31

2

1

Mitglied der Helmholtz-Gemeinschaft

Layer number

1

2

3

4

5

15

1 Triplets Riemann Algorithm — 1

5

4

3 11 11 2 21 31 31 41

1

Mitglied der Helmholtz-Gemeinschaft

Layer number

1

2

3

4

5

15

1 Triplets Riemann Algorithm — 1

5

4

3 11 11 2 11 21 31 31 31 41 32

1

Mitglied der Helmholtz-Gemeinschaft

Layer number

1

2

3

4

5

15

1 Triplets Riemann Algorithm — 1

5

4

3 11 11 2 11 21 31 31 31 41 32

1

Mitglied der Helmholtz-Gemeinschaft

Layer number

1

2

3

4

5

15

Mitglied der Helmholtz-Gemeinschaft

2 Expansion Riemann Algorithm — 1

16

2 Expansion Riemann Algorithm — 1

z‘

x

x
x

x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

2 Expansion Riemann Algorithm — 1

z‘

x

x x
x x

x

x

y

x x

y

Mitglied der Helmholtz-Gemeinschaft

Expand to z‘

Riemann Surface (paraboloid)

16

Riemann Algorithm — Triplet Generation
CPU

• Three loops to generate triplets serially
for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } }

Mitglied der Helmholtz-Gemeinschaft

17

Riemann Algorithm — Triplet Generation
CPU GPU

• Three loops to generate triplets serially
for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } }

• Loops are not good parallelizable! • Needed: Mapping of inherent GPU indexing variable to triplet index
int ijk = threadIdx.x + blockIdx.x * blockDim.x;

Mitglied der Helmholtz-Gemeinschaft

17

Riemann Algorithm — GPU Version
• Triplet generation: transition via equations
for () {for () {for () {}}} int ijk = threadIdx.x + blockIdx.x * blockDim.x;

nLayerx =

8x + 1 1 2 p p p 3 3 243x2 1 + 27x 1 p pos(nLayerx ) = +p p p 2 / 3 3 3 3 3 3 243x2

⇣ 1 p

1 + 27x

1

• Work by Jonathan Timcheck
– RISE summer student of Ohio State University at FZJ
Mitglied der Helmholtz-Gemeinschaft

18

Mitglied der Helmholtz-Gemeinschaft

GPU Riemann Algorithm

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

– Grow to unique hit triplets
Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

– Grow to unique hit triplets
Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

– Grow to unique hit triplets
Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.

→ List of hit triplets (= track seeds)

Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

– Grow to unique hit triplets
Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.

→ List of hit triplets (= track seeds)

• Expand triplets to tracks
Mitglied der Helmholtz-Gemeinschaft

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

– Grow to unique hit triplets
Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.

→ List of hit triplets (= track seeds)

• Expand triplets to tracks
Mitglied der Helmholtz-Gemeinschaft

– Add hit one by one if quality criteria are passed

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

– Grow to unique hit triplets
Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.

→ List of hit triplets (= track seeds)

• Expand triplets to tracks
Mitglied der Helmholtz-Gemeinschaft

– Add hit one by one if quality criteria are passed

19

GPU Riemann Algorithm
• Triplet generation
– Start with unique layer triplets
Layer triplet: Combinations of three layers, where each layer has at least one hit in it.

– Grow to unique hit triplets
Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.

→ List of hit triplets (= track seeds)

• Expand triplets to tracks
Mitglied der Helmholtz-Gemeinschaft

– Add hit one by one if quality criteria are passed

• Save successfully found track
19

GPU Riemann Algorithm — Performance
• Efficiency & Ghost ratio
Efficiency | Ghost Ratio Dist. 0.06 Cut 0.05 Param 0.04 0.575 | 0.205 0.562 | 0.205 0.541 | 0.204 0.04 s-z Chi Square Cut Parameter 0.601 | 0.255 0.589 | 0.254 0.566 | 0.254 0.05 0.623 | 0.306 0.612 | 0.304 0.588 | 0.303 0.06 4.0 3.0 2.0 0.877 | 4.652 0.881 | 4.032 0.881 | 3.360 0.5 0.902 | 7.621 0.910 | 6.730 0.914 | 5.708 1.0 0.897 | 10.08 0.913 | 9.030 0.918 | 7.793 1.5

CPU Riemann Cuts

Benchmark Extreme Cuts

Mitglied der Helmholtz-Gemeinschaft

20

GPU Riemann Algorithm — Performance
• Efficiency & Ghost ratio
Efficiency | Ghost Ratio Dist. 0.06 Cut 0.05 Param 0.04 0.575 | 0.205 0.562 | 0.205 0.541 | 0.204 0.04 s-z Chi Square Cut Parameter 0.601 | 0.255 0.589 | 0.254 0.566 | 0.254 0.05 0.623 | 0.306 0.612 | 0.304 0.588 | 0.303 0.06 4.0 3.0 2.0 0.877 | 4.652 0.881 | 4.032 0.881 | 3.360 0.5 0.902 | 7.621 0.910 | 6.730 0.914 | 5.708 1.0 0.897 | 10.08 0.913 | 9.030 0.918 | 7.793 1.5

CPU Riemann Cuts

Benchmark Extreme Cuts

• Time for one event (NV Pro ler; @Juhydra: Tesla K20X)
Time(%) 75.55% 5.96% 4.36% 4.26% 2.57% 2.44% 1.30% 1.11% 1.11% 0.89% 0.45% Time 439.49us 34.656us 25.344us 24.800us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 5.1520us 2.6240us Calls 1 4 1 6 1 1 1 1 1 5 1 Avg 439.49us 8.6640us 25.344us 4.1330us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 1.0300us 2.6240us Min 439.49us 2.3360us 25.344us 3.7760us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 928ns 2.6240us Max 439.49us 22.432us 25.344us 5.3440us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 1.3440us 2.6240us Name extend_cut_hit_triplets_k [CUDA memcpy DtoH] cut_hit_triplets_k [CUDA memset] generate_hit_triplet generate_layer_triplets void thrust void thrust void thrust [CUDA memcpy HtoD] project_onto_paraboloid_k
20

Mitglied der Helmholtz-Gemeinschaft

GPU Riemann Algorithm — Summary
• Running port of CPU Riemann to GPU (J. Timcheck)
+ Improvements / needed changes wrt to CPU version

Mitglied der Helmholtz-Gemeinschaft

21

GPU Riemann Algorithm — Summary
• Running port of CPU Riemann to GPU (J. Timcheck)
+ Improvements / needed changes wrt to CPU version
of Riemann algorithm

Advantages

To Dos / Plans

Draw Backs

Problems Challenges

• Secondaries • Runs also only
with MVD • Basis for more sophisticated algorithms • Uncertainties • Fast track tter (track parameters)

• Measurement
Uncertainties • Cuts (Extension: Hit to close, Zero crossing) • Parallelism (32 threads per seed, not 1) • Track merger • Integrate to PandaRoot

• Combinatorically
explosive - Many combinations (if used bluntly) - Esp. if used as nder - Pre-steps needed • Jonathan‘s internship is over…

• Include more
subdetectors • Make timebased

Mitglied der Helmholtz-Gemeinschaft

21

GPU Riemann Algorithm — Summary
• Running port of CPU Riemann to GPU (J. Timcheck)
+ Improvements / needed changes wrt to CPU version
of Riemann algorithm

Advantages

To Dos / Plans

Draw Backs

Problems Challenges

• Secondaries • Runs also only
with MVD • Basis for more sophisticated algorithms • Uncertainties • Fast track tter (track parameters)

• Measurement
Uncertainties • Cuts (Extension: Hit to close, Zero crossing) • Parallelism (32 threads per seed, not 1) • Track merger • Integrate to PandaRoot

• Combinatorically
explosive - Many combinations (if used bluntly) - Esp. if used as nder - Pre-steps needed • Jonathan‘s internship is over…

• Include more
subdetectors • Make timebased

Mitglied der Helmholtz-Gemeinschaft

• Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuRiemann • Extensive documentation at: http://panda-wiki.gsi.de/cgi-bin/view/Computing/ RiemannTrackFinder (+Summary of theory behind Riemann algorithm)
21

Mitglied der Helmholtz-Gemeinschaft

ALGORITHMS
Triplet Finder

22

Mitglied der Helmholtz-Gemeinschaft

Triplet Finder — Method
STT

23

Mitglied der Helmholtz-Gemeinschaft

Triplet Finder — Method
STT

23

Mitglied der Helmholtz-Gemeinschaft

Triplet Finder — Method
STT

23

Mitglied der Helmholtz-Gemeinschaft

Triplet Finder — Method
STT

23

Triplet Finder — Method
• STT hit in pivot straw
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder — Method
• STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog)
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder — Method
• STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder — Method
• STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with
1.Second STT pivot-cog virtual hit
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder — Method
• STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with
1.Second STT pivot-cog virtual hit
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder — Method
• STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with
1.Second STT pivot-cog virtual hit
STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder — Method
• STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with
1.Second STT pivot-cog virtual hit 2.Interaction point
Interaction Point

STT

Mitglied der Helmholtz-Gemeinschaft

23

Triplet Finder — Method
• STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with
1.Second STT pivot-cog virtual hit 2.Interaction point
STT

Mitglied der Helmholtz-Gemeinschaft

• Calculate circle through three points

Interaction Point

23

Triplet Finder — Method
• STT hit in pivot straw • Find surrounding hits → Create virtual hit (triplet) at center of gravity (cog) • Combine with
1.Second STT pivot-cog virtual hit 2.Interaction point
STT

Mitglied der Helmholtz-Gemeinschaft

• Calculate circle through three points → Track Candidate

Interaction Point

23

Triplet Finder — Animation
Isochrone early Isochrone early & skewed Isochrone close Isochrone late MVD hit Triplet Track current Track timed out

Mitglied der Helmholtz-Gemeinschaft

24

Triplet Finder — GPU Port
• Original algorithm by Marius C. Mertens et al.
– Implemented in PandaRoot (also using skewed straws) – Reconstruction efficiency & other quality criteria: see Marius‘ talks on Triplet Finder (eg. last PANDA meeting) – Features
• Fast & robust algorithm • No isochrones needed • Many tuning possibilities

• Ported to GPU by Andrew A. Adinetz (NVIDIA Application Lab)
Mitglied der Helmholtz-Gemeinschaft

– Thrust, CUDA, some dynamic parallelism – Supporting skewed straws – Quality of results comparable to CPU version ( oat vs. double)
25

GPU Triplet Finder — Speed
1 Burst

Processing Time
3,00

2,25

Time / s
Mitglied der Helmholtz-Gemeinschaft

1,50

0,75

0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000

#Hits

26

GPU Triplet Finder — Speed
1 Burst

Performance
0,500

0,375

Performance / Mhits/s
Mitglied der Helmholtz-Gemeinschaft

0,250

0,125

0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000

#Hits

26

GPU Triplet Finder — Speed
1 Burst

Performance
0,500

0,375

Algorithm: O(n O 2)n2 → Bunching (look at subset of hits; slice all hits in pieces → Burst)

Performance / Mhits/s
Mitglied der Helmholtz-Gemeinschaft

0,250

0,125

0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000

#Hits

26

GPU Triplet Finder — Summary
• Running port of CPU Triplet Finder to GPU by Andrew Adinetz (NVIDIA Application Lab)

Mitglied der Helmholtz-Gemeinschaft

27

GPU Triplet Finder — Summary
• Running port of CPU Triplet Finder to GPU by Andrew Adinetz (NVIDIA Application Lab)
Advantages To Dos / Plans Draw Backs Problems Challenges

• Fast • Bunching (As exible • No isochrones needed wrapper, to be used for • Already built for time- different algorithms) based hits (no events) • Integrate to PandaRoot

• Needs compute
capability 3.5

Mitglied der Helmholtz-Gemeinschaft

27

GPU Triplet Finder — Summary
• Running port of CPU Triplet Finder to GPU by Andrew Adinetz (NVIDIA Application Lab)
Advantages To Dos / Plans Draw Backs Problems Challenges

• Fast • Bunching (As exible • No isochrones needed wrapper, to be used for • Already built for time- different algorithms) based hits (no events) • Integrate to PandaRoot

• Needs compute
capability 3.5

Mitglied der Helmholtz-Gemeinschaft

• Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuTripletFinder

27

Mitglied der Helmholtz-Gemeinschaft

SUMMARY

28

Summary — General
• Three algorithms under GPU investigation
– All are fast and GPU-suited – Different advantages & disadvantages

• Great help by Andrew Adinetz & others of NVIDIA Application Lab • Not yet compared in terms of ε/purity/ghost/etc. • Not yet looked into GPUs beyond algorithmic stage (data distribution etc.)
Mitglied der Helmholtz-Gemeinschaft

29

Summary — General
• Three algorithms under GPU investigation
– All are fast and GPU-suited – Different advantages & disadvantages

• Great help by Andrew Adinetz & others of NVIDIA Application Lab • Not yet compared in terms of ε/purity/ghost/etc. • Not yet looked into GPUs beyond algorithmic stage (data distribution etc.)
Mitglied der Helmholtz-Gemeinschaft

GTLI (General Topics to Look Into) • Bunching • Time-based structure • Track Merger • Which algorithm for which sub-detector

• Algorithm interplay / hybridization • Speci cally test of sub-parts of the algorithm run faster on CPUs (CPU-GPU re-distribution) • Quality analytics
29

Summary — Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133

0

Algorithm

Note Fast but in exible

Biggest Challenge

Detectors used MVD STT MVD MVD GEM STT

7 Hough Transform — 6 Thrust

Hough Transform — 4 Plain CUDA
3 1° of dynamic (with parallelism) 2

5

Fast Promising initial implementation Fast, robust, scalable

No satisfying multipeak nder

x

100

120
y

140

160 180 Angle / °

GPU1 Riemann Track Finder 0 GPU Triplet Finder

Testing, Combinatorics Manpower Future!

x
x

x

Mitglied der Helmholtz-Gemeinschaft

GPU STT Cell Finder

30

Summary — Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133

0

Algorithm

Note Fast but in exible

Biggest Challenge

Detectors used MVD STT MVD MVD GEM STT

7 Hough Transform — 6 Thrust

Hough Transform — 4 Plain CUDA
3 1° of dynamic (with parallelism) 2

5

Fast Promising initial implementation Fast, robust, scalable

No satisfying multipeak nder

x

100

120
y

140

160 180 Angle / °

GPU1 Riemann Track Finder 0 GPU Triplet Finder

Testing, Combinatorics Manpower Future!

x
x

x

Mitglied der Helmholtz-Gemeinschaft

GPU STT Cell Finder

30

Mitglied der Helmholtz-Gemeinschaft

31

Mitglied der Helmholtz-Gemeinschaft

! u o y k n Tha
rten Andreas He lich.de e ju z f @ n e t a.her
31

Resources Used in This Talk
• • • • • • [4] NVIDIA CARMA development kit [4] GFLOPS graph by NVIDIA (CUDA documentation) [15] Explosion by Bohdan Burmich / Noun Project [24] Hit animation by Marius C. Mertens [26] Performance graphs by Andrew Adinetz The rest is mine

Mitglied der Helmholtz-Gemeinschaft

32

Mitglied der Helmholtz-Gemeinschaft

APPENDIX

33

Mitglied der Helmholtz-Gemeinschaft

CPU & GPU

CPU

• Powerful • Flexible

34

CPU & GPU

Mitglied der Helmholtz-Gemeinschaft

CPU

• Powerful • Flexible

GPU

• Stupid • In exible

34

CPU & GPU

Mitglied der Helmholtz-Gemeinschaft

CPU

• Powerful • Flexible

GPU

• Stupid • In exible • Poor power
34

CPU & GPU

Mitglied der Helmholtz-Gemeinschaft

CPU

• Powerful • Flexible

GPU

• Stupid • In exible • Poor power • But massively
parallelizable
34

GTLI (General Topics to Look Into)
• • • • • • Bunching Time-based structure Track Merger Which algorithm for which sub-detector Algorithm interplay / hybridization Speci cally test of sub-parts of the algorithm run faster on CPUs (CPU-GPU re-distribution) • Quality analytics
Mitglied der Helmholtz-Gemeinschaft

35

Other Works — NVIDIA App Lab
Hough transform (r >= 0)
HT size size (n (n x xn) n) threshold threshold # rec. tracks tracks # tracks # failed failed tracks # positives # false false positives % % succ succ % % fail fail % pos % false false pos time, s time, s time/event, s time/event, s

% fail % false pos time, s time/event, s

52,3 % 44,8 % 34,2 % 24,0 % 31,9 % 14,8 % 8,2 % 4,3 % 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152

15,7 % 0,6 % 6,69 0,001338

L E R P
Mitglied der Helmholtz-Gemeinschaft

Hough transform in shared memory

R E T N I , Y R A N … I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0  % % 41,0  % % 55,4  % % 76,0  % % 84,4  % % 67,0 59,0 44,6 24,0 15,6 67,0  % % 59,0  % % 44,6  % % 24,0  % % 15,6  % % 27,4 14,6 7,2 3,5 0,6 27,4  % % 14,6  % % 7,2  % % 3,5  % % 0,6  % % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05

256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0  % % 41,0  % % 55,4  % % 76,0  % % 84,4  % % 95,2  % % 67,0 59,0 44,6 24,0 15,6 4,8 67,0  % % 59,0  % % 44,6  % % 24,0  % % 15,6  % % 4,8  % % 27,4 14,6 7,2 3,5 0,6 0,2 27,4  % % 14,6  % % 7,2  % % 3,5  % % 0,6  % % 0,2  % % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044

L A N

HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s

T A H W

36

Other Works — NVIDIA App Lab
Hough transform (r >= 0)
HT size size (n (n x xn) n) threshold threshold # rec. tracks tracks # tracks # failed failed tracks # positives # false false positives % % succ succ % % fail fail % pos % false false pos time, s time, s time/event, s time/event, s

% fail % false pos time, s time/event, s

52,3 % 44,8 % 34,2 % 24,0 % 31,9 % 14,8 % 8,2 % 4,3 % 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152

15,7 % 0,6 % 6,69 0,001338

L E R P
Mitglied der Helmholtz-Gemeinschaft

Hough transform in shared memory

R E T N I , Y R A N … I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0  % % 41,0  % % 55,4  % % 76,0  % % 84,4  % % 67,0 59,0 44,6 24,0 15,6 67,0  % % 59,0  % % 44,6  % % 24,0  % % 15,6  % % 27,4 14,6 7,2 3,5 0,6 27,4  % % 14,6  % % 7,2  % % 3,5  % % 0,6  % % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05

256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0  % % 41,0  % % 55,4  % % 76,0  % % 84,4  % % 95,2  % % 67,0 59,0 44,6 24,0 15,6 4,8 67,0  % % 59,0  % % 44,6  % % 24,0  % % 15,6  % % 4,8  % % 27,4 14,6 7,2 3,5 0,6 0,2 27,4  % % 14,6  % % 7,2  % % 3,5  % % 0,6  % % 0,2  % % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044

L A N

HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s

T A H W

36

Other Works — NVIDIA App Lab
Hough transform (r >= 0)
HT size size (n (n x xn) n) threshold threshold # rec. tracks tracks # tracks # failed failed tracks # positives # false false positives % % succ succ % % fail fail % pos % false false pos time, s time, s time/event, s time/event, s

% fail % false pos time, s time/event, s

52,3 % 44,8 % 34,2 % 24,0 % 31,9 % 14,8 % 8,2 % 4,3 % 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152

15,7 % 0,6 % 6,69 0,001338

L E R P
Mitglied der Helmholtz-Gemeinschaft

Hough transform in shared memory

R E T N I , Y R A N … I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0  % % 41,0  % % 55,4  % % 76,0  % % 84,4  % % 67,0 59,0 44,6 24,0 15,6 67,0  % % 59,0  % % 44,6  % % 24,0  % % 15,6  % % 27,4 14,6 7,2 3,5 0,6 27,4  % % 14,6  % % 7,2  % % 3,5  % % 0,6  % % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05

256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0  % % 41,0  % % 55,4  % % 76,0  % % 84,4  % % 95,2  % % 67,0 59,0 44,6 24,0 15,6 4,8 67,0  % % 59,0  % % 44,6  % % 24,0  % % 15,6  % % 4,8  % % 27,4 14,6 7,2 3,5 0,6 0,2 27,4  % % 14,6  % % 7,2  % % 3,5  % % 0,6  % % 0,2  % % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044

L A N

HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s

T A H W

36