Post on 14-May-2020
transcript
1 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
Institut für Theoretische Informatik - Algorithmik II
GPU Sample SortNikolaj Leischner, Vitaly Osipov, Peter Sanders
KIT – Universität des Landes Baden-Württemberg undnationales Grossforschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
Overview
2 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
IntroductionTesla architectureComputing Unified Device Architecture ModelPerformance GuidelinesSample Sort Algorithm OverviewHigh Level GPU Algorithm DesignFlavor of Implementation DetailsExperimental EvaluationFuture Trends
Introductionmulti-way sorting algorithms
3 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
Sorting is importantDivide-and-Conquer approaches:
recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results
Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results
Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine
Multiway approaches are benifitial when the memory bandwidth isan issue!
Introductionmulti-way sorting algorithms
3 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
Sorting is importantDivide-and-Conquer approaches:
recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results
Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results
Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine
Multiway approaches are benifitial when the memory bandwidth isan issue!
Introductionmulti-way sorting algorithms
3 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
Sorting is importantDivide-and-Conquer approaches:
recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results
Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results
Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine
Multiway approaches are benifitial when the memory bandwidth isan issue!
Introductionmulti-way sorting algorithms
3 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
Sorting is importantDivide-and-Conquer approaches:
recursively split the input in tiles until the tile size is M (e.g cache size)sort each tile independentlycombine intermidiate results
Two-way approaches:two-way distribution - quicksort log2 (n/M) scans to partition the inputtwo-way merge sort log2 (n/M) scans to combine intermidiate results
Multi-way approaches:k -way distribution - sample sort only logk (n/M) scans to partitionk -way merge sort only logk (n/M) scans to combine
Multiway approaches are benifitial when the memory bandwidth isan issue!
NVidia Tesla Architecture
4 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
30 Streaming Processors (SM) × 8 Scalar Processors (SP) eachoverall 240 physical cores16KB shared memory per SM similar to CPU L1 cache4GB global device memory
Computing Unified DeviceArchitecture Model
5 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
tl tl tl tl
Input
ThreadBlocks
PrefixSum
01
k-1
01
k-1
Input
0 1 2 k-1
Output
Bucket indices
ThreadBlocks
......
...
...
SP
shared
SP SP SPSP SP SP SP
virtualizes
TBlock(0,0) TBlock(0,1) TBlock(0,2)
TBlock(1,0) TBlock(1,1) TBlock(1,2)
Grid
Global Memory
C codeint main {//serial
//parallelKernel<<>>
//serial}
Similar to SPMD(single-programmultiple-data) model
block of concurrentthreads execute ascalar sequentialprogram, a kernelthread blocksconstitute a grid
Performance Guidelines
6 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
General pattern in GPU algorithm designdecompose the problem into many data-independent sub-problems
solve sub-problems by blocks of cooperative parallel threads
Performance Guidelinesconditional branching- follow the same execution path
shared memory- exploit fast on-chip memory
coalesced global memory operations- load/store requests to the same memory block fewer memory accesses
Algorithm Overview
7 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
SampleSort(e = 〈e1, . . . , en〉, k)begin
if n < M then return SmallSort(e)choose a random sample S = S1, . . . , Sak−1 of eSort(S)
〈s0, s1, . . . , sk 〉 = 〈−∞, Sa, . . . , Sa(k−1), ∞〉for 1 ≤ i ≤ n do
find j ∈ {1, . . . , k}, such that sj−1 ≤ ei ≤ sjplace ei in bucket bjreturn Concatenate(SampleSort(b1, k), . . . , SampleSort(bk , k))
endend
Algorithm 1: Serial Sample Sort
High Level GPU Algorithm Design
8 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
tl tl tl tl
Input
ThreadBlocks
PrefixSum
01
k-1
01
k-1
Input
0 1 2 k-1
Output
Bucket indices
ThreadBlocks
......
...
...
Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)
Phase 1. Choose splittersPhase 2. Each of p TB:
computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM
Phase 3. Prefix sum over thek × p table global offsetsPhase 4.
as in Phase 2 localoffsetslocal + global offsets finalpositions
High Level GPU Algorithm Design
8 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
tl tl tl tl
Input
ThreadBlocks
PrefixSum
01
k-1
01
k-1
Input
0 1 2 k-1
Output
Bucket indices
ThreadBlocks
......
...
...
Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)
Phase 1. Choose splittersPhase 2. Each of p TB:
computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM
Phase 3. Prefix sum over thek × p table global offsetsPhase 4.
as in Phase 2 localoffsetslocal + global offsets finalpositions
High Level GPU Algorithm Design
8 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
tl tl tl tl
Input
ThreadBlocks
PrefixSum
01
k-1
01
k-1
Input
0 1 2 k-1
Output
Bucket indices
ThreadBlocks
......
...
...
Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)
Phase 1. Choose splittersPhase 2. Each of p TB:
computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM
Phase 3. Prefix sum over thek × p table global offsetsPhase 4.
as in Phase 2 localoffsetslocal + global offsets finalpositions
High Level GPU Algorithm Design
8 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
tl tl tl tl
Input
ThreadBlocks
PrefixSum
01
k-1
01
k-1
Input
0 1 2 k-1
Output
Bucket indices
ThreadBlocks
......
...
...
Parameters:distribution degree k = 128threads per block t = 256elements per thread l = 8number of blocks p = n/(t · l)
Phase 1. Choose splittersPhase 2. Each of p TB:
computes its elementsbucket indicesid , 0 ≤ id ≤ k − 1stores the bucket sizes inDRAM
Phase 3. Prefix sum over thek × p table global offsetsPhase 4.
as in Phase 2 localoffsetslocal + global offsets finalpositions
Flavor of Implementation Detailscomputing element bucket indices
9 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)
beginj := 1// go left or right?
repeat log k timesj := 2j + (ei > bt [j ])// bucket index
j := j − k + 1end
3k/8sk/8s 5k/8s 7k/8s
k/4s 3k/4s
k/2s
< <
<
1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >
>>
>
Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop
Flavor of Implementation Detailscomputing element bucket indices
9 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)
beginj := 1// go left or right?
repeat log k timesj := 2j + (ei > bt [j ])// bucket index
j := j − k + 1end
3k/8sk/8s 5k/8s 7k/8s
k/4s 3k/4s
k/2s
< <
<
1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >
>>
>
Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop
Flavor of Implementation Detailscomputing element bucket indices
9 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)
beginj := 1// go left or right?
repeat log k timesj := 2j + (ei > bt [j ])// bucket index
j := j − k + 1end
3k/8sk/8s 5k/8s 7k/8s
k/4s 3k/4s
k/2s
< <
<
1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >
>>
>
Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop
Flavor of Implementation Detailscomputing element bucket indices
9 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
bt = 〈sk/2, sk/4, s3k/4, sk/8, s3k/8, s5k/8, s7k/8 . . .〉TraverseTree(ei)
beginj := 1// go left or right?
repeat log k timesj := 2j + (ei > bt [j ])// bucket index
j := j − k + 1end
3k/8sk/8s 5k/8s 7k/8s
k/4s 3k/4s
k/2s
< <
<
1b 2b 3b 4b 5b 6b 7b 8b< < < <> > > >
>>
>
Srore the tree in fast shared memoryUse predicated instructions no path divergenceUnroll the loop
Experimental Evaluation
10 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
NVidia Tesla C106030 SMs x 8 SPs = 240 cores4GB RAM
Data types32- and 64-bit integerskey-value pairs
DistributionsUniformGausianBucket SortedStaggeredDeterministic Duplicates
GPU sorting AlgorithmsCUDPP and THRUST radix sortTHRUST merge sortquicksorthybrid sortbbsort
Experimental EvaluationUniform 32-bit integers
11 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
0
20
40
60
80
100
120
140
160
180
217 218 219 220 221 222 223 224 225 226 227 228
sort
ed e
lem
ents
/ tim
e [
µs]
number of elements
cudpp radixthrust radix
quick
bbsorthybrid (float)
sample
Experimental EvaluationUniform key-value pairs
12 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
20
30
40
50
60
70
80
90
100
110
120
130
219 220 221 222 223 224 225 226 227
sort
ed e
lem
en
ts /
tim
e [
µs]
number of elements
thrust radixcudpp radix
thrust mergesample
Experimental EvaluationUniform 64-bit integers
13 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
20
30
40
50
60
70
80
217 218 219 220 221 222 223 224 225 226 227
sort
ed
ele
me
nts
/ tim
e [
µs]
number of elements
thrust radix sample
Future TrendsFermi architecture
14 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
11
bandwidth constrained. For existing applications that use Shared memory as software managed cache, code can be streamlined to take advantage of the hardware caching system, while still having access to at least 16 KB of shared memory for explicit thread cooperation. Best of all, applications that do not use Shared memory automatically benefit from the L1 cache, allowing high performance CUDA programs to be built with minimum time and effort.
Summary Table
GPU G80 GT200 FermiTransistors 681 million 1.4 billion 3.0 billionCUDA Cores 128 240 512Double Precision Floating Point Capability
None 30 FMA ops / clock 256 FMA ops /clock
Single Precision Floating Point Capability
128 MAD ops/clock
240 MAD ops / clock
512 FMA ops /clock
Special Function Units (SFUs) / SM
2 2 4
Warp schedulers (per SM) 1 1 2Shared Memory (per SM) 16 KB 16 KB Configurable 48 KB or
16 KB L1 Cache (per SM) None None Configurable 16 KB or
48 KB L2 Cache None None 768 KBECC Memory Support No No YesConcurrent Kernels No No Up to 16Load/Store Address Width 32-bit 32-bit 64-bit
Second Generation Parallel Thread Execution ISA
Fermi is the first architecture to support the new Parallel Thread eXecution (PTX) 2.0 instruction set. PTX is a low level virtual machine and ISA designed to support the operations of a parallel thread processor. At program install time, PTX instructions are translated to machine instructions by the GPU driver.
The primary goals of PTX are:
p Provide a stable ISA that spans multiple GPU generations
p Achieve full GPU performance in compiled applications
p Provide a machine-independent ISA for C, C++, Fortran, and other compiler targets.
p Provide a code distribution ISA for application and middleware developers
p Provide a common ISA for optimizing code generators and translators, which map PTX to specific target machines.
p Facilitate hand-coding of libraries and performance kernels
p Provide a scalable programming model that spans GPU sizes from a few cores to many parallel cores
What about memory bandwidth? No significant improvements?multi-way approaches are likely to be even more beneficialmulti-way merge sort?
15 Vitaly Osipov:GPU Sample Sort
Fakultät für InformatikInstitut für Theoretische Informatik
Thank you!