aherten-gpu_tracking_juelich.pdf

GPU Tracking Activities at Jlich
PANDA Collaboration Meeting 3-2013 (Bochum)

11. September 2013 | Andreas Herten
Mitglied der Helmholtz-Gemeinschaft
Outline
Introduction & Status of Algorithms
Hough Transformation GPU Riemann Track Finder GPU Triplet Finder
Summaries
A quick reminder
GPUS
Bene ts of GPUs
Bene ts of GPUs
Extension card to PC
Although stand-alone GPU boards are coming up
Bene ts of GPUs
Many computing cores Run in parallel Speed
ALU Control ALU
ALU ALU
Cache
CPU
DRAM
GPU
DRAM
Bene ts of GPUs
Many computing cores Run in parallel Speed Active hardware (& software) development
ALU Control ALU
ALU ALU
Cache
CPU
DRAM
GPU
DRAM
Bene ts of GPUs
Many computing cores Run in parallel Speed Active hardware (& software) development CUDA C
ALU Control ALU
ALU ALU
Cache
CPU
DRAM
GPU
DRAM
C/C++ extension Mixed host/device code OpenACC 2.0 parallelizer
__global__, __device__ Thrust, cuBLAS #pragma acc kernels

4
ALGORITHMS
Hough Transform
Hough Transform Algorithm

1. Conformal Mapping

1. Conformal Mapping 2. Hit point i (xi,yi,i) Line parameter rij
j (0,360] of desired granularity j

rij = cosj xi + sinj yi + i

3. Fill rij into histogram

3. Fill rij into histogram 4. Find peak

3. Fill rij into histogram 4. Find peak 5. Extract track parameters (rj,j)
Hough Transform Principle

y

y

y*
x*
(r, )
1

y*
r
(r, )
1
x*

y*
r
(r, )
1
x*

y*
r
(r,
) (r,
2
x*
)
1

y*
x*

y*
x*

y*
x*

y*
x*

y*
x*

y*
x*
Bin with highest multiplicity gives track parameters
Hough Transform Example

Implementation in Thrust
r Hough transformed
0.06 10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133 7 6 0.02 5 0 4 3 -0.02
0.04
2 1
-0.04 0
20
40
60
80
100
120
140
160 180 Angle /
PANDA STT
180 x 180 Grid
8

r Hough transformed
0.06 10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133 7 6 0.02 5 0 4 3 -0.02
0.04
2 1
-0.04 0
20
40
60
80
100
120
140
160 180 Angle /
PANDA STT
180 x 180 Grid
8

r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
68 (x,y) 0 points
Entries Mean x Mean y RMS x RMS y 2.2356e+08 25 90 0.02905 51.96 0.1063 20
15
10
-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0
PANDA STT+MVD
1800 x 1800 Grid
8

r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
68 (x,y) 0 points
15
10
-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0
PANDA STT+MVD
1800 x 1800 Grid
8

r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
68 (x,y) 0 points
15
10
-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0
PANDA STT+MVD
1800 x 1800 Grid
8

r Hough transformed
0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2
68 (x,y) 0 points
15
10
-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0
PANDA STT+MVD
1800 x 1800 Grid
8
Hough Transform Numbers

Thrust: < 3 ms/event Plain CUDA: 0.5 ms/event
Andrew Adinetz, NVIDIA Application Lab
Hough Transform Notes

Thrust: < 3 ms/event
Running code Independent of granularity (occupancy; see below) Parallel
Parallel for all hits of one event Not parallel for all events
Reduced problem(s) to set of standard routines (~STL)

Fast (uses clever pre-made algorithms) In exible (has its limits, hard to customize)
No peak nding included

Even possible? Adds to time!
10
Hough Transform Notes

Thrust: < 3 ms/event Plain CUDA: 0.5 ms/event
Running code Parallel
Fully parallel Coarser grid faster computing times
Built completely from scratch

Customizable / tting to every problem A bit more complicated at parts
Simple peak nder implemented (threshold)
11
Hough Transform Summary

Running code from Andrew Adinetz & me Big issue: Multipeak nder
12

To Dos / Plans Advantages
of HT algorithm
Draw Backs
Problems Challenges
Parallelism beyond
events Thrust Isochrones (fully) Integrate to PandaRoot Time-based
Easy algorithm With grid
Stuck in
Thrust infrastructure
Multipeak
nder/ Aliasing
(Algorithm out of image processing usually used to detect continuous lines)
granularity parallelism increases Flexibility of HTed Plain Isochrones (fully) Integrate to PandaRoot equation CUDA Time-based
12

To Dos / Plans Advantages
of HT algorithm
Draw Backs
Problems Challenges
Parallelism beyond
events Thrust Isochrones (fully) Integrate to PandaRoot Time-based
Easy algorithm With grid
Stuck in
Thrust infrastructure
Multipeak
nder/ Aliasing
(Algorithm out of image processing usually used to detect continuous lines)
granularity parallelism increases Flexibility of HTed Plain Isochrones (fully) Integrate to PandaRoot equation CUDA Time-based
Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuHoughTransform https://github.com/AndiH/CUDA/
12
ALGORITHMS
Riemann Track Finder
13
Riemann Algorithm
14
Riemann Algorithm
1
Create triplet of MVD hit points

All possible three hit combinations need to become triplets
14
Riemann Algorithm
1
Create triplet of MVD hit points

All possible three hit combinations need to become triplets
Grow triplets to tracks: Continuously test next hit if it ts to triplet track

Circle t in (x,y) plane
Project hit points of triplet to Riemann paraboloid Create plane (track) Test closeness of new hit: good add hit; bad dismiss hit
Continue with next hit
Helix t: arc length s vs. z position

14
1 Triplets Riemann Algorithm 1
15
15
3 11 21 31
15
3 11 11 2 21 31 31 41
15
3 11 11 2 11 21 31 31 31 41 32
15
3 11 11 2 11 21 31 31 31 41 32
15
2 Expansion Riemann Algorithm 1
16
x
x
Expand to z
16
x x
x x
Expand to z
Riemann Surface (paraboloid)
16
x x
x x
Expand to z
16
x x
x x
Expand to z
16
x x
x x
Expand to z
16
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
x x
x x
x x
Expand to z
16
Riemann Algorithm Triplet Generation

CPU
Three loops to generate triplets serially

for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } }
17
Riemann Algorithm Triplet Generation

CPU GPU
Three loops to generate triplets serially

for (int i = 0; i < hitsInLayerOne.size(); i++) { for (int j = 0; j < hitsInLayerTwo.size(); j++) { for (int k = 0; k < hitsInLayerThree.size(); k++) { /* Triplet Generation */ } } }
Loops are not good parallelizable! Needed: Mapping of inherent GPU indexing variable to triplet index
int ijk = threadIdx.x + blockIdx.x * blockDim.x;
17
Riemann Algorithm GPU Version

Triplet generation: transition via equations
for () {for () {for () {}}} int ijk = threadIdx.x + blockIdx.x * blockDim.x;
nLayerx =
8x + 1 1 2 p p p 3 3 243x2 1 + 27x 1 p pos(nLayerx ) = +p p p 2 / 3 3 3 3 3 3 243x2
1 p
1 + 27x
Work by Jonathan Timcheck

RISE summer student of Ohio State University at FZJ
18
GPU Riemann Algorithm
19

Triplet generation
Start with layer triplet: Each thread creates unique layer triplet
19

Triplet generation
Start with layer triplet: Each thread creates unique layer triplet
19

Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet
19

Triplet generation
19

Triplet generation
Layer triplet: Combinations of three layers, where each layer has at least one hit in it. Hit triplet: Zoom in to layer triplets. All three hit combinations of succeeding layers.
19

Triplet generation
19

Triplet generation
Start with layer triplet: Each thread creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds
19

Triplet generation
Expand triplets to tracks
19

Triplet generation

Add hit one by one if quality criteria are passed (from new layer; distance to Riemann plane; s-z quality)
19

Triplet generation

19

Triplet generation

Save successfully found track

19

Triplet generation
Start with layer triplet: Each thread 1 thread = 1 layer triplet creates unique layer triplet Grow to hit triplet: Each thread expands layer triplet to unique hit triplet List of hit triplets as seeds


19

Triplet generation

1 thread = 1 hit triplet

19

Triplet generation
1 thread = 1 hit triplet
Add hit one by one if quality criteria are passed 1 thread = 1 hit triplet seed (from new layer; distance to Riemann plane; s-z quality)

19
GPU Riemann Algorithm Performance

Eciency & Ghost ratio
Eciency | Ghost Ratio Dist. 0.06 Cut 0.05 Param 0.04 0.575 | 0.205 0.562 | 0.205 0.541 | 0.204 0.04 s-z Chi Square Cut Parameter 0.601 | 0.255 0.589 | 0.254 0.566 | 0.254 0.05 0.623 | 0.306 0.612 | 0.304 0.588 | 0.303 0.06 4.0 3.0 2.0 0.877 | 4.652 0.881 | 4.032 0.881 | 3.360 0.5 0.902 | 7.621 0.910 | 6.730 0.914 | 5.708 1.0 0.897 | 10.08 0.913 | 9.030 0.918 | 7.793 1.5
CPU Riemann Cuts
Benchmark Extreme Cuts
20
GPU Riemann Algorithm Performance

Eciency & Ghost ratio
Eciency | Ghost Ratio Dist. 0.06 Cut 0.05 Param 0.04 0.575 | 0.205 0.562 | 0.205 0.541 | 0.204 0.04 s-z Chi Square Cut Parameter 0.601 | 0.255 0.589 | 0.254 0.566 | 0.254 0.05 0.623 | 0.306 0.612 | 0.304 0.588 | 0.303 0.06 4.0 3.0 2.0 0.877 | 4.652 0.881 | 4.032 0.881 | 3.360 0.5 0.902 | 7.621 0.910 | 6.730 0.914 | 5.708 1.0 0.897 | 10.08 0.913 | 9.030 0.918 | 7.793 1.5
CPU Riemann Cuts
Benchmark Extreme Cuts
Time for one event (NV Pro ler; @Juhydra: Tesla K20X)
Time(%) 75.55% 5.96% 4.36% 4.26% 2.57% 2.44% 1.30% 1.11% 1.11% 0.89% 0.45% Time 439.49us 34.656us 25.344us 24.800us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 5.1520us 2.6240us Calls 1 4 1 6 1 1 1 1 1 5 1 Avg 439.49us 8.6640us 25.344us 4.1330us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 1.0300us 2.6240us Min 439.49us 2.3360us 25.344us 3.7760us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 928ns 2.6240us Max 439.49us 22.432us 25.344us 5.3440us 14.976us 14.176us 7.5520us 6.4640us 6.4640us 1.3440us 2.6240us Name extend_cut_hit_triplets_k [CUDA memcpy DtoH] cut_hit_triplets_k [CUDA memset] generate_hit_triplet generate_layer_triplets void thrust void thrust void thrust [CUDA memcpy HtoD] project_onto_paraboloid_k
20
GPU Riemann Algorithm Summary

Running port of CPU Riemann to GPU (J. Timcheck)
+ Improvements / needed changes wrt to CPU version
21

To Dos / Plans
of Riemann algorithm
Advantages
Draw Backs
Problems Challenges
Measurement
Uncertainties Cuts (Extension: Hit to close, Zero crossing) Parallelism (32 threads per seed, not 1) Track merger Integrate to PandaRoot
Secondaries Combinatorically Runs also only with explosive MVD Many combinations (if Basis for more
sophisticated algorithms Uncertainties Fast track tter (track parameters) used bluntly) - Esp. if used as nder - Pre-steps needed Jonathans internship is over
Include more
subdetectors Make timebased
21

To Dos / Plans
of Riemann algorithm
Advantages
Draw Backs
Problems Challenges
Measurement
Uncertainties Cuts (Extension: Hit to close, Zero crossing) Parallelism (32 threads per seed, not 1) Track merger Integrate to PandaRoot
Secondaries Combinatorically Runs also only with explosive MVD Many combinations (if Basis for more
sophisticated algorithms Uncertainties Fast track tter (track parameters) used bluntly) - Esp. if used as nder - Pre-steps needed Jonathans internship is over
Include more
subdetectors Make timebased
Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuRiemann Extensive documentation at: http://panda-wiki.gsi.de/cgi-bin/view/Computing/ RiemannTrackFinder (+Summary of theory behind Riemann algorithm)
21
ALGORITHMS
Triplet Finder
22
Triplet Finder Method

STT
23

STT
23

STT
23

STT
23

STT hit in pivot straw
STT
23

STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog)
STT
23

STT hit in pivot straw Find surrounding hits Create virtual hit (triplet) at center of gravity (cog) Combine with
STT
23

1.Second STT pivot-cog virtual hit
STT
23

STT
23

STT
23

1.Second STT pivot-cog virtual hit 2.Interaction point
Interaction Point
STT
23

STT
Calculate circle through three points
Interaction Point
23

STT
Calculate circle through three points Track Candidate
Interaction Point
23
Triplet Finder Animation

Isochrone early Isochrone early & skewed Isochrone close Isochrone late MVD hit Triplet Track current Track timed out
24
Triplet Finder GPU Port

Original algorithm by Marius C. Mertens et al.
Implemented in PandaRoot (also using skewed straws) Reconstruction eciency & other quality criteria: see Marius talks on Triplet Finder (eg. last PANDA meeting) Features
Fast & robust algorithm No isochrones needed Many tuning possibilities
Ported to GPU by Andrew A. Adinetz (NVIDIA Application Lab)

Thrust, CUDA, some dynamic parallelism Supporting skewed straws Quality of results comparable to CPU version ( oat vs. double)
25
GPU Triplet Finder Speed

1 Burst
Processing Time
3,00
2,25
Time / s
1,50
0,75
0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000
#Hits
26

1 Burst
Performance
0,500
0,375
Performance / Mhits/s
0,250
0,125
0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000
#Hits
26

1 Burst
Performance
0,500
0,375
Algorithm: O(n O 2)n2 Bunching (look at subset of hits; slice all hits in pieces Burst)
Performance / Mhits/s
0,250
0,125
0 500 9000 17500 26000 34500 43000 51500 60000 68500 77000 85500 94000 102500 111000 119500 128000 136500 145000
#Hits
26
GPU Triplet Finder Summary

Running port of CPU Triplet Finder to GPU by Andrew Adinetz (NVIDIA Application Lab)
27

To Dos / Plans Advantages Draw Backs Problems Challenges
Bunching (As
exible wrapper, to be used for dierent algorithms) Integrate to PandaRoot
Fast No isochrones
needed Already built for time-based hits (no events)
Needs compute capability

3.5
27

To Dos / Plans Advantages Draw Backs Problems Challenges
Bunching (As
exible wrapper, to be used for dierent algorithms) Integrate to PandaRoot
Fast No isochrones
needed Already built for time-based hits (no events)
Needs compute capability

3.5
Code at: https://subversion.gsi.de/trac/fairroot/browser/pandaroot/development/aherten/ GpuTripletFinder
27
SUMMARY
28
Summary General
Three algorithms under GPU investigation
All are fast and GPU-suited Dierent advantages & disadvantages
Great help by Andrew Adinetz & others of NVIDIA Application Lab Not yet compared in terms of /purity/ghost/etc. Tobias talk @discussion round Not yet looked into GPUs beyond algorithmic stage (data distribution etc.)
29
Summary General
Three algorithms under GPU investigation
All are fast and GPU-suited Dierent advantages & disadvantages
Great help by Andrew Adinetz & others of NVIDIA Application Lab Not yet compared in terms of /purity/ghost/etc. Tobias talk @discussion round Not yet looked into GPUs beyond algorithmic stage (data distribution etc.)
GTLI (General Topics to Look Into) Bunching Time-based structure Track Merger Which algorithm for which sub-detector Algorithm interplay / hybridization Speci cally test of sub-parts of the algorithm run faster on CPUs (CPU-GPU re-distribution) Quality analytics
29
Summary Algorithm
10 Entries 324000 Mean x 90 9 Mean y 0.02791 RMS x 51.96 8 RMS y 0.02133
Algorithm
Note Fast but in exible
Biggest Challenge
Detectors used MVD STT MVD MVD GEM STT
7 Hough Transform 6 Thrust
Hough Transform 4 Plain CUDA

3 1 of dynamic (with parallelism) 2
Fast Promising initial implementation Fast, robust, scalable
No satisfying multipeak nder
100
120
y
140
160 180 Angle /
GPU1 Riemann Track Finder 0 GPU Triplet Finder
Testing, Combinatorics Manpower
x
x
GPU STT Cell Finder
see next talk by Tobias
30
Summary Algorithm
Algorithm
Biggest Challenge

100
120
y
140
160 180 Angle /
x
x
GPU STT Cell Finder
30
Summary Algorithm
Algorithm
Biggest Challenge

100
120
y
140
160 180 Angle /
x
x
GPU STT Cell Finder
! u o y k n Tha
rten Andreas He elich.de ju z f @ n e t r a.he
30
Resources Used in This Talk

[4] NVIDIA CARMA development kit [4] GFLOPS graph by NVIDIA (CUDA documentation) [24] Hit animation by Marius C. Mertens [26] Performance graphs by Andrew Adinetz The rest is mine
31
APPENDIX
32
GTLI (General Topics to Look Into)

Bunching Time-based structure Track Merger Which algorithm for which sub-detector Algorithm interplay / hybridization Speci cally test of sub-parts of the algorithm run faster on CPUs (CPU-GPU re-distribution) Quality analytics
33
Other Works NVIDIA App Lab

Hough transform (r >= 0)
HT size size (n (n x xn) n) threshold threshold # rec. tracks tracks # tracks # failed failed tracks # positives # false false positives % % succ succ % % fail fail % pos % false false pos time, s time, s time/event, s time/event, s
% fail % false pos time, s time/event, s
52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152
15,7% 0,6% 6,69 0,001338
L E R P
Hough transform in shared memory
R E T N I , Y R A N I T IM NO
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044
L A N
HT size size (n (n x xn) n) threads/block # threads/block threshold threshold # rec. tracks tracks failed tracks # failed tracks # false positives positives succ % succ % fail % false pos false pos time, s s time/event, s time/event, s
T A H W
34

52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152
15,7% 0,6% 6,69 0,001338
L E R P
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044
L A N
T A H W
34

52,3% 44,8% 34,2% 24,0% 31,9% 14,8% 8,2% 4,3% 0,051526 0,159233 0,535796 1,907615 0,00001031 0,00003185 0,00010716 0,00038152
15,7% 0,6% 6,69 0,001338
L E R P
D N A
256 512 1024 2048 4096 256 512 1024 2048 4096 256 256 256 512 1024 256 256 256 512 1024 14 14 13 11 10 14 14 13 11 10 6616 8218 11109 15246 16930 6616 8218 11109 15246 16930 13440 11838 8947 4810 3126 13440 11838 8947 4810 3126 5496 2925 1444 698 115 5496 2925 1444 698 115 33,0 41,0 55,4 76,0 84,4 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 67,0 59,0 44,6 24,0 15,6 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 27,4 14,6 7,2 3,5 0,6 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,030202 0,064211 0,150491 0,585119 2,536495 0,030202 0,064211 0,150491 0,585119 2,536495 6,0404E-06 0,0000301 0,000117024 0,00011702 0,000507299 0,0005073 6,0404E-06 0,00001284 1,28422E-05 3,00982E-05
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192 14 14 13 11 10 8 14 14 13 11 10 6616 8218 11109 15246 16930 19088 6616 8218 11109 15246 16930 19088 13440 11838 8947 4810 3126 968 13440 11838 8947 4810 3126 968 5496 2925 1444 698 115 33 5496 2925 1444 698 115 33 33,0 41,0 55,4 76,0 84,4 95,2 33,0% % 41,0% % 55,4% % 76,0% % 84,4% % 95,2% % 67,0 59,0 44,6 24,0 15,6 4,8 67,0% % 59,0% % 44,6% % 24,0% % 15,6% % 4,8% % 27,4 14,6 7,2 3,5 0,6 0,2 27,4% % 14,6% % 7,2% % 3,5% % 0,6% % 0,2% % 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 0,047084 0,141719 0,491206 1,706892 6,018243 22,202198 9,4168E-06 0,00002834 9,82412E-05 0,00009824 0,000341378 0,00034138 0,001203649 0,00120365 0,00444044 9,4168E-06 2,83438E-05 0,00444044
L A N
T A H W
34

aherten-gpu_tracking_juelich.pdf

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

aherten-gpu_tracking_juelich.pdf

Hochgeladen von

Copyright:

Verfügbare Formate

GPU Tracking Activities at Jlich

PANDA Collaboration Meeting 3-2013 (Bochum)

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

Many computing cores Run in parallel Speed

ALU Control ALU

Mitglied der Helmholtz-Gemeinschaft

ALU Control ALU

ALU Control ALU

C/C++ extension Mixed host/device code OpenACC 2.0 parallelizer

__global__, __device__ Thrust, cuBLAS #pragma acc kernels

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm

Hough Transform Algorithm

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm

3. Fill rij into histogram

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm

3. Fill rij into histogram 4. Find peak

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Algorithm

Mitglied der Helmholtz-Gemeinschaft

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Hough Transform Principle

Mitglied der Helmholtz-Gemeinschaft

Bin with highest multiplicity gives track parameters

Hough Transform Example

160 180 Angle /

Hough Transform Example

160 180 Angle /

Hough Transform Example

-0.3 -0.4 0 20 40 60 80 100 120 140 160 180 Angle / 0

Hough Transform Example

global, device Thrust, cuBLAS #pragma acc kernels