Beruflich Dokumente
Kultur Dokumente
Outline
Introduction & Status of Algorithms
Hough Transformation GPU Riemann Track Finder GPU Triplet Finder
Summaries
A quick reminder
GPUS
Bene ts of GPUs
Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up
Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up
ALU ALU
Cache
CPU
DRAM
GPU
DRAM
Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up
Many computing cores Run in parallel Speed Active hardware (& software) development
Mitglied der Helmholtz-Gemeinschaft
ALU ALU
Cache
CPU
DRAM
GPU
DRAM
Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up
Many computing cores Run in parallel Speed Active hardware (& software) development CUDA C
Mitglied der Helmholtz-Gemeinschaft
ALU ALU
Cache
CPU
DRAM
GPU
DRAM
ALGORITHMS
Hough Transform
3. Fill rij into histogram 4. Find peak 5. Extract track parameters (rj,j)
x*
(r, )
1
r
(r, )
1
x*
r
(r, )
1
x*
r
(r,
) (r,
2
x*
)
1
x*
x*
x*
x*
x*
x*
0.04
2 1
-0.04 0
20
40
60
80
100
120
140
PANDA STT
180 x 180 Grid
8
0.04
2 1
-0.04 0
20
40
60
80
100
120
140
PANDA STT
180 x 180 Grid
8
68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20
15
10
PANDA STT+MVD
1800 x 1800 Grid
8
68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20
15
10
PANDA STT+MVD
1800 x 1800 Grid
8
68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20
15
10
PANDA STT+MVD
1800 x 1800 Grid
8
68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20
15
10
PANDA STT+MVD
1800 x 1800 Grid
8
11
12
Draw Backs
Problems Challenges
Parallelism beyond
events Thrust Isochrones (fully) Integrate to PandaRoot Time-based
Stuck in
Thrust infrastructure
Multipeak
nder/ Aliasing
(Algorithm out of image processing usually used to detect continuous lines)
granularity parallelism increases Flexibility of HTed Plain Isochrones (fully) Integrate to PandaRoot equation CUDA Time-based
12
Draw Backs
Problems Challenges
Parallelism beyond
events Thrust Isochrones (fully) Integrate to PandaRoot Time-based
Stuck in
Thrust infrastructure
Multipeak
nder/ Aliasing
(Algorithm out of image processing usually used to detect continuous lines)
granularity parallelism increases Flexibility of HTed Plain Isochrones (fully) Integrate to PandaRoot equation CUDA Time-based
12
ALGORITHMS
Riemann Track Finder
13
Riemann Algorithm
14
Riemann Algorithm
1
14
Riemann Algorithm
1
15
15
3 11 21 31
15
3 11 11 2 21 31 31 41
15
3 11 11 2 11 21 31 31 31 41 32
15
3 11 11 2 11 21 31 31 31 41 32
15
16
x
x
Expand to z
16
x x
x x
Expand to z
16
x x
x x
Expand to z
16
x x
x x
Expand to z
16
x x
x x
Expand to z
16
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
17
Loops are not good parallelizable! Needed: Mapping of inherent GPU indexing variable to triplet index
int ijk = threadIdx.x + blockIdx.x * blockDim.x;
17
nLayerx =
1 p
1 + 27x
18
19
19
19
19
19
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet
19
19
19
19
Add hit one by one if quality criteria are passed 1 thread = 1 hit triplet seed (from new layer; distance to Riemann plane; s-z quality)
20
Time for one event (NV Pro ler; @Juhydra: Tesla K20X)
Time(%) 75.55% 5.96% 4.36% 4.26% 2.57% 2.44% 1.30% 1.11% 1.11% 0.89% 0.45% Time 439.49us 34.656us 25.344us 24.800us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 5.1520us 2.6240us Calls 1 4 1 6 1 1 1 1 1 5 1 Avg 439.49us 8.6640us 25.344us 4.1330us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 1.0300us 2.6240us Min 439.49us 2.3360us 25.344us 3.7760us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 928ns 2.6240us Max 439.49us 22.432us 25.344us 5.3440us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 1.3440us 2.6240us Name extend_cut_hit_triplets_k [CUDA memcpy DtoH] cut_hit_triplets_k [CUDA memset] generate_hit_triplet generate_layer_triplets void thrust void thrust void thrust [CUDA memcpy HtoD] project_onto_paraboloid_k
20
21
Advantages
Draw Backs
Problems Challenges
Measurement
Uncertainties Cuts (Extension: Hit to close, Zero crossing) Parallelism (32 threads per seed, not 1) Track merger Integrate to PandaRoot
Mitglied der Helmholtz-Gemeinschaft
Secondaries Combinatorically Runs also only with explosive MVD Many combinations (if Basis for more
sophisticated algorithms Uncertainties Fast track tter (track parameters) used bluntly) - Esp. if used as nder - Pre-steps needed Jonathans internship is over
Include more
subdetectors Make timebased
21
Advantages
Draw Backs
Problems Challenges
Measurement
Uncertainties Cuts (Extension: Hit to close, Zero crossing) Parallelism (32 threads per seed, not 1) Track merger Integrate to PandaRoot
Mitglied der Helmholtz-Gemeinschaft
Secondaries Combinatorically Runs also only with explosive MVD Many combinations (if Basis for more
sophisticated algorithms Uncertainties Fast track tter (track parameters) used bluntly) - Esp. if used as nder - Pre-steps needed Jonathans internship is over
Include more
subdetectors Make timebased
Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuRiemann Extensive documentation at: http://panda-wiki.gsi.de/cgi-bin/view/Computing/ RiemannTrackFinder (+Summary of theory behind Riemann algorithm)
21
ALGORITHMS
Triplet Finder
22
23
23
23
23
23
23
23
23
23
23
STT
23
Interaction Point
23
Interaction Point
23
24
Thrust, CUDA, some dynamic parallelism Supporting skewed straws Quality of results comparable to CPU version ( oat vs. double)
25
Processing Time
3,00
2,25
Time / s
Mitglied der Helmholtz-Gemeinschaft
1,50
0,75
0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000
#Hits
26
Performance
0,500
0,375
Performance / Mhits/s
Mitglied der Helmholtz-Gemeinschaft
0,250
0,125
0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000
#Hits
26
Performance
0,500
0,375
Algorithm: O(n O 2)n2 Bunching (look at subset of hits; slice all hits in pieces Burst)
Performance / Mhits/s
Mitglied der Helmholtz-Gemeinschaft
0,250
0,125
0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000
#Hits
26
27
Bunching (As
Fast No isochrones
needed Already built for time-based hits (no events)
27
Bunching (As
Fast No isochrones
needed Already built for time-based hits (no events)
27
SUMMARY
28
Summary General
Three algorithms under GPU investigation
All are fast and GPU-suited Dierent advantages & disadvantages
Great help by Andrew Adinetz & others of NVIDIA Application Lab Not yet compared in terms of /purity/ghost/etc. Tobias talk @discussion round Not yet looked into GPUs beyond algorithmic stage (data distribution etc.)
29
Summary General
Three algorithms under GPU investigation
All are fast and GPU-suited Dierent advantages & disadvantages
Great help by Andrew Adinetz & others of NVIDIA Application Lab Not yet compared in terms of /purity/ghost/etc. Tobias talk @discussion round Not yet looked into GPUs beyond algorithmic stage (data distribution etc.)
GTLI (General Topics to Look Into) Bunching Time-based structure Track Merger Which algorithm for which sub-detector Algorithm interplay / hybridization Speci cally test of sub-parts of the algorithm run faster on CPUs (CPU-GPU re-distribution) Quality analytics
29
Summary Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133
Algorithm
Biggest Challenge
100
120
y
140
x
x
30
Summary Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133
Algorithm
Biggest Challenge
100
120
y
140
x
x
30
Summary Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133
Algorithm
Biggest Challenge
100
120
y
140
x
x
! u o y k n Tha
rten Andreas He elich.de ju z f @ n e t r a.he
30
31
APPENDIX
32
33
52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152
L E R P
Mitglied der Helmholtz-Gemeinschaft
R E T N I , Y R A N I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044
L A N
HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s
T A H W
34
52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152
L E R P
Mitglied der Helmholtz-Gemeinschaft
R E T N I , Y R A N I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044
L A N
HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s
T A H W
34
52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152
L E R P
Mitglied der Helmholtz-Gemeinschaft
R E T N I , Y R A N I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044
L A N
HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s
T A H W
34