Post on 29-Jul-2018
transcript
Ruprecht-Karls-Universität Heidelberg
Institut für Informatik
Lehrstuhl für Parallele und Verteilte Systeme
Bachelorthesis
Integrating GPU accelerated Matrixoperations into the Groovy programming
language
Name: Robert Kühl
Matrikelnummer: 2551645
Betreuer: Arthur Andrzejak
Datum der Abgabe: 30.06.2013
Ich versichere, dass ich diese Bachelor-Arbeit selbstständig verfasst und nur die
angegebenen Quellen und Hilfsmittel verwendet habe.
Heidelberg, 27.06.2013
Abstract
Due to the demand of more and more realistic graphics in the consumer market, graphic
cards have seen a huge increase in performance since mid 2000. Because of thi and the
advent of programmable shaders as well later Cuda and OpenCL, they became more
and more interesting for high performance computing. This thesis examines how GPU
computing could be integrated into the Groovy-Matrix project. Groovy-Matrix is a
general purpose matrix class for the Groovy programming language, which o�ers easy
to use matrix operations.
The integration is achieved by creating a Java back-end which utilizes the JavaCL-API
in order to access the graphics-hardware. The matrix operations fall into four di�erent
categories: pure data parallel operations, reduction, searching and sorting. While the
�rst two categories greatly bene�t from the GPU integration, the other two only show
little to no speedup, depending on the matrix structure.
Zusammenfassung
Angesichts der zunehmenden Nachfrage des Marktes nach immer realistischerer Gra�k
ist die Performanz von Gra�kkarten seit Mitte der 2000er Jahre erheblich angestiegen.
Aufgrund dessen sowie der Einführung von programmierbaren Shadern und später von
Cuda und OpenCL wurden sie zunehmend interessanter für High Performance Com-
puting. Diese Arbeit untersucht, wie das Rechnen auf Gra�kkarten in das Groovy-
Matrix-Projekt integriert werden kann. Groovy-Matrix ist eine allgemein verwendbare
Matrixklasse für die Programmiersprache Groovy, die einfach zu benutzende Matrix-
operationen anbietet.
Das Einbinden wird erreicht durch die Erstellung eines Java back-ends, das mit Hilfe
der JavaCL-API die Gra�k-Hardware anspricht. Die Matrixoperationen lassen sich in
vier Kategorien unterteilen: rein Daten-parallele Operationen, Reduktion, Suchen und
Sortieren. Während die ersten beiden Kategorien stark von der GPU-Integration pro-
�tieren, zeigen die anderen beiden nur wenig bis keinen Speedup, abhängig von der
Struktur der Matrix.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Similar Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Groovy-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 GPU Programming 7
3.1 OpenCL / JavaCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 JavaCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.3 General Code Structure . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.4 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 GPU Implementation 15
4.1 Limitations of the Current Implementation . . . . . . . . . . . . . . . . . 15
4.2 Internal Data-Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Kernel Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.2 Pure Data Parallel Operations . . . . . . . . . . . . . . . . . . . . 19
4.3.3 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
Contents
4.3.4 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.5 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Performance 26
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Pure Data Parallel Operations . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.1 Best-Case-Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.2 Worst-Case-Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.6 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Conclusion 42
Bibliography 43
vi
1 Introduction
1.1 Motivation
Groovy is an agile and dynamic language for the Java Virtual Machine (JVM). Its syntax
is close to the Java programming language, but adds several features on top of it. The
main features are the inclusion of closures, dynamic typing and the ability to run a
program either as a script or as compiled byte code. While mostly compatible to Java,
Groovy avoids many unnecessary language constructs, and is often shorter and easier to
read. For example, Groovy does not need a class with an encapsulated main-method,
or semicolons at the end of a command. So a typical �Hello World� program in Java:
Listing 1.1: Java Hello World
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello World!");
}
}
can be written in Groovy as:
Listing 1.2: Groovy Hello World
println ’Hello World!’
This, however, comes with a price: performance. While the performance gap between
Java and Groovy was somewhat closed in 2012 with the release of Groovy 2.0 and the
inclusion of static typing, Java still outperforms Groovy in the average case, especially
when dynamic typing is used. This makes Groovy interesting for cases where perfor-
mance limitations are not an issue like web applications, which are often bound by the
IO-bandwidth. This is not acceptable in many scienti�c �elds, where typically large
numerical systems have to be solved. Here, low level programming languages like C
or C++ are dominant because they o�er maximum performances with little overhead.
1
1 Introduction
With the advent of Cuda and OpenCL, GPGPU-computing became more and more
important because traditional CPUs are nowadays heavily outperformed by GPUs.
Yet this comes with a problem. GPU-programming is not very user-friendly and can
use up a considerable amount of time. Parallelizing a problem, determining the correct
ND-Ranges, taking care of a correct memory usage and memory access patterns can
be a very intimidating task for the average user, who has no prior experience in high
performance computing.
It would be desirable to combine these two worlds: the ease of use of the Groovy pro-
gramming language and the e�ciency of GPU-programming. Currently, the only practi-
cal way to achieve this is through APIs. The user is given an easy to use front-end, where
only certain functions are available. These are then executed transparently for the user
on the GPU. As part of the diploma thesis �Improving Programming E�ciency via Rich
Data Structures�, Aram Altschudjian created the Groovy-Matrix project. The goal of
this project was to provide easy to handle matrix and vector structures for the Groovy,
Scala and Java programming languages. At this time the Groovy-Matrix runs solely on
the CPU. Since matrix operations perform generally very well on GPUs, it would be
interesting to examine if it is possible to integrate GPU support into the Groovy-Matrix.
1.2 Goals
The goal of this thesis is to enhance the Groovy-Matrix such that the various available
operations are executed transparently for the user on the GPU. To achieve this goal, it
is necessary to analyze the structure of the GPU hardware. Since GPU programming
is not natively supported by the Java programming language, a way has to be found to
integrate it.
To test the performance, benchmarks have to be made in order to decide whether the
Groovy-Matrix bene�ted from the GPU integration or not.
2
1 Introduction
1.3 Thesis Structure
Chapter 2 gives an overview over similar projects and introduces the Groovy-Matrix.
Then, the principals of the GPU architecture as well as the GPU programming with
OpenCL and JavaCL are discussed in Chapter 3. The GPU implementation of the
Groovy-Matrix is detailed in Chapter 4. Chapter 5 shows how well the GPU imple-
mentation performs in comparison to the CPU version. In Chapter 6 the conclusion is
given.
3
2 Background
2.1 Similar Works
Although the Groovy-Matrix is the �rst GPU accelerated matrix class for the Groovy
programming language, there are multiple similar projects in other programmong lan-
guages. The most similar to this project is the �Universal Java Matrix Package�
(UJMP) [1], which is currently integrated into to the JavaCL API [2], yet at the cur-
rent stage only a few operations are supported. A less user friendly, but more powerful
GPU matrix integration into Java is JCublas from the JCuda-API [3]. For the C/C++
programming language, there are various projects of this kind. Most famous would be
CUBLAS from Nvidia [4], which is also utilized in JCublas. Other noteworthy libraries
are KALDI [5], MAGMA [6] and CUSP [7].
2.2 Groovy-Matrix
2.2.1 Functionality
At the current stage, the Groovy-Matrix supports only basic arithmetic functions like
matrix additions or piecewise matrix multiplication. More complex functions like a true
matrix multiplication, matrix inversion or calculation of the eigenvalues and eigenvectors
are not available. Instead, there are various operations to reorganize the matrix, like
adding or removing dimensions, rows and columns. The user has the option to �lter out
unique values or search for speci�c elements. Furthermore, typical reduce operations
exist. It is possible to sum up each entry or search for the minimum or maximum entry
of the matrix. Additionally, it is possible to label the rows and columns, which gives an
alternate way to access each entry. A full list of each operation is given in Chapter 4
(Table 4.1).
4
2 Background
Listing 2.1: Sample code of the Groovy-Matrix
//creating two matrices
mat = new MatrixJC()
mat.init(5.0f, [5, 5])
mat2 = new MatrixJC()
mat2.init(2.0f, [5,5])
//summation of two matrices
mat3 = mat + mat2
//append 2 entries to dimension 1
mat3.extend(1,2)
//print unique values of mat3
print mat3.unique()
2.2.2 Structure
The interface de�ning the methods of Groovy-Matrix is simply called �Matrix� and
is implemented by the class MatrixJC in the Groovy, Scala and Java programming
languages. For performance reasons, the core functionality of the Groovy-Matrix is
implemented in the Java version of the MatrixJC. The Groovy and Scala version of the
MatrixJC are facade classes which give the user a nice interface, but bind internally into
the java version.
The data of the matrix is simply stored into a multidimensional ArrayList, which makes
the Groovy-Matrix more suitable for dense matrices.
5
2 Background
Figure 2.1: Class hierachy of the Groovy-Matrix
6
3 GPU Programming
3.1 OpenCL / JavaCL
3.1.1 OpenCL
The Open Computing Language (OpenCL) is an open standard for cross-platform par-
allel programming. The goal of OpenCL is to unify the programming of di�erent accel-
erators (such as CPU, GPU, CELL/B.E. etc) in one language. This is done by splitting
the source-code in two parts: host- and kernel-code.
The host-code contains the main program and runs typically on the CPU. The host-code
is normally written in the C programming language, but there are also several bindings
for other programming languages. The kernel-code contains the code-sections for the
accelerator. It is written in a OpenCL-speci�c, C-like, language.
Host
Compute DeviceCompute Unit
ProcessingElement
Figure 3.1: The OpenCL platform model [8]
7
3 GPU Programming
3.1.2 JavaCL
To integrate OpenCL code into the Groovy-Matrix, it is necessary to create bindings
between Java and OpenCL. This is already done by the JavaCL project. JavaCL is an
API which wraps the OpenCL library in Java code via the JNA framework. JavaCL
currently fully supports OpenCL version 1.0. Furthermore, JavaCL adds various tools,
like an OpenCL code generator, a random number generator and multiple parallel reduce
functions. However, the code for the kernels still has to be written in OpenCL.
3.1.3 General Code Structure
Creating an OpenCL application can be divided in di�erent steps: setup, building the
kernel, executing the kernel and cleanup. The Code samples are taken from the getting
started guide of JavaCL [9].
Setup
At �rst, the application has to be prepared to run an OpenCL kernel. In this step, basic
IO-Bu�ers have to be created to transfer data between the host and the kernel device.
Listing 3.1: setup phase
CLContext context = JavaCL.createBestContext();
CLQueue queue = context.createDefaultQueue();
ByteOrder byteOrder = context.getByteOrder();
int n = 1024;
Pointer<Float>
aPtr = allocateFloats(n).order(byteOrder),
bPtr = allocateFloats(n).order(byteOrder);
for (int i = 0; i < n; i++) {
aPtr.set(i, (float)cos(i));
bPtr.set(i, (float)sin(i));
}
Building the Kernel
In order to build a kernel, the API has to know for which device and in what kind of
context it has to be build. In OpenCL it is necessary to select a device and a context
8
3 GPU Programming
manually via the corresponding API calls. However, in JavaCL it is possible to let
JavaCL create them automatically with the JavaCL.createBestContext() function. This
function takes a �device feature� as argument, which allows the user to prefer certain
devices like the GPU. This is very useful when it comes to integrating JavaCL into
the Groovy-Matrix, since it is unknown which devices the end-user has. Therefore, the
enhanced Groovy-Matrix can be run on any devices supported by OpenCL.
Once this is done, the kernel-code has to be loaded as a string. The string can be
passed to the context to build a programm and �nally this programm can create the
kernel.
Listing 3.2: kernel building phase
// Read the program sources and compile them :
String src = IOUtils.readText(JavaCLTutorial1.class.getResource("
TutorialKernels.cl"));
CLProgram program = context.createProgram(src);
CLKernel addFloatsKernel = program.createKernel("add_floats");
Executing the Kernel
At �rst, memory has to be allocated on the device. This is done by calling the create-
Bu�er functions from the context. Additionally, a command-queue has to be created
to issue commands to the device. Then the kernel-arguments have to be set and the
kernel is executed by enqueueing it into the command-queue. After the kernel execution
is �nished, the result can be read by enquiring the command queue.
Listing 3.3: execution phase
// Create OpenCL input buffers (using the native memory pointers aPtr
and bPtr) :
CLBuffer<Float>
a = context.createBuffer(Usage.Input, aPtr),
b = context.createBuffer(Usage.Input, bPtr);
// Create an OpenCL output buffer :
CLBuffer<Float> out = context.createBuffer(Usage.Output, n);
//set Arguments and call the kernel
addFloatsKernel.setArgs(a, b, out, n);
9
3 GPU Programming
CLEvent addEvt = addFloatsKernel.enqueueNDRange(queue, new int[] { n })
;
cleanup
The cleanup phase consists of transferring memory back to the original data-structure
and cleaning up the bu�ers from memory.
Listing 3.4: cleanup phase
Pointer<Float> outPtr = out.read(queue, addEvt); // blocks until
add_floats finished
//cleaning of the buffers is done by the garbage collector
3.1.4 Programming Models
OpenCL features two programming models: data parallelism and task parallelism. There
is no strict separation between the two models, so hybrid solutions are also possible.
However, the primary model behind OpenCL is data parallel, since it is best supported
by GPUs [8].
Data Parallel Programming Model
The data parallel model is the main focus of OpenCL and it is the best supported
one by GPUs. In this model a series of instructions are applied in parallel to di�erent
data, similar to the SIMD architecture of Flynn's taxonomies. This is done in OpenCL
by splitting a task into multiple work-items. Each work-item runs the same kernel
code in parallel and has its own set of data. Work-items can be grouped into work-
groups. This can be done explicitly, where the programmer determines how many work-
items are executed in parallel and how the work-items are distributed into the work-
groups. Alternatively, the programmer only speci�es how many work-items are executed
in parallel. The distribution into the work-groups is then done implicitly by the OpenCL
implementation. A typical example for a data parallel operation would be a vector
addition since the addition of one vector-element is independent of the addition of the
others and could therefore easily be done in parallel.
10
3 GPU Programming
Task Parallel Programming Model
In the task parallel programming model, a program would be divided into multiple
concurrent tasks. Each task is implemented in its own kernel. Each kernel is run on its
own compute unit, with a single work-group containing one work-item.
An example for this would be video processing, where the audio and video processing
is done in parallel. This programming model though is not recommended for GPU
programming because calling work-groups with only one work-item is generally very
ine�cient, since then only warps with size one could be executed and concurrency is
lost.
3.2 GPU Architecture
This section discusses basic GPU architecture principles. At the current time, there are
two di�erent main vendors producing GPGPUs: NVIDIA and AMD. Since it is unknown
what kind of GPU the end-user has, only fundamental principles will be discussed. This
chapter is based on the NVIDIA OpenCL programming guide [10] and therefore will use
the NVIDIA terminology, but it is also easily applicable to the AMD architecture [11].
3.2.1 Basic Architecture
A basic model for a GPU can be divided into two parts: main memory and computation
units. The main memory is just a large DRAM and is globally accessible by all compu-
tation units. The computation units consists of an array of streaming multiprocessors,
each having their own shared memory and cache. Each streaming multiprocessor is
capable of performing SIMD vector operations.
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU
DRAM
GPU
Figure 3.2: Comparison between CPUs and GPUs [10]
11
3 GPU Programming
3.2.2 Execution Model
The NVIDIA programming guide describes their GPU-architecture as �single instruction,
multiple threads� (SIMT), analogue to the SIMD model from Flynn's taxonomy. In
the SIMT model, one set of instructions is executed concurrently by multiple threads.
Contrary to SIMD, the execution path of the threads may diverge. Generally, this should
be avoided, since a branching execution path of concurrent threads is serialized, and is
not executed in parallel.
A single thread is represented in OpenCL as a work-item. Multiple work-items can be
bundled into a work-group, which can be further bundled into a grid. A work-group is
assigned to a single streaming multiprocessor, on which a set of work-items is processed
concurrently. This set is called warp, and is typically the size of 32. This model makes
a GPU especially suitable for data-parallel tasks because only then the full warp-sizes
can be utilized.
Grid
Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Block (2, 1)Block (1, 1)Block (0, 1)
Block (2, 0)Block (1, 0)Block (0, 0)
Figure 3.3: Thread organization in OpenCL [10]
3.2.3 Memory Model
The memory on the GPU is divided into multiple parts: host memory, global memory,
constant memory and local memory.
12
3 GPU Programming
Compute Device
Compute unit 1 Compute unit N
Private memory 1
Private memory M
PE 1 PE M
Localmemory 1
Global/Constant Memory Data Cache
Global Memory
Compute Device Memory
Constant Memory
Private memory 1
Private memory M
PE 1 PE M
Localmemory N
Figure 3.4: OpenCL memory hierarchy [8]
Host Memory
The Host Memory is the main memory of the CPU. The GPU has no direct access to
it, so in order to access the data, it has to be transfered to the GPU �rst. This memory
transaction can only be issued by the host.
Global Memory
The global memory is in the main memory of the GPU. Therefore, it is the biggest in size
and has a high bandwidth, but also a high latency. The global memory is generally not
cached, except for GPUs with a higher computing capability (CUDA 2.0 and upward).
The host as well as the kernel has full read and write access to the global memory, but
new memory can only be allocated by the host. Every data transfer between CPU and
GPU is done through the global memory.
Constant Memory
The constant memory also resides in the main memory of the GPU, but has di�erent
characteristics than the global memory. First of all, reads from constant memory are
cached. Secondly, only the host has full read and write privileges and can allocate
memory during runtime. The kernel has read-only access, but is allowed to allocate
memory statically during compile time.
13
3 GPU Programming
Local Memory
The local memory is in the shared memory of a computation unit. Accessing it is fast,
but on the �ip-side it also very limited in size. Reading and writing is only permitted
to the work-group which is currently active on the computation unit, so data exchanges
over multiple work-groups have to be done over the global memory. The host may
allocate local memory, but can neither read nor write to it.
Private Memory
The private memory belongs to single work-items, which have exclusive read and write
privileges. If the memory needed is small enough, it is cached locally on the computation
unit. Otherwise, the data will be allocated in the global memory.
14
4 GPU Implementation
4.1 Limitations of the Current Implementation
Due to the time constrains of a Bachelor-Thesis, the current implementation of the GPU
version of the Groovy-Matrix has the following limitations:
• only n×m matrices are available,
• the data type of the matrix is limited to ��oat�,
• kernels are not fully optimized.
4.2 Internal Data-Structure
Structurally, the GPU implementation of the Groovy-Matrix is very similar to the CPU
implementation. The data-storage changed from an ArrayList to a CLBu�er. Since the
Groovy-Matrix is for general purpose matrix operations, the whole matrix is stored as
a �at array.
The second di�erence is the inclusion of a new class, JCLMKernels, which manages the
GPU speci�c operations. The device, context and command-queue is independent of
the concrete matrices. Therefore, it is stored as a static member. The JCLMKernels
class also manages the building of the OpenCL kernels. This has the advantage that
every kernel has to be build only once for every matrix.
15
4 GPU Implementation
Figure 4.1: Class hierachy of the GPU implementation
4.3 Kernel Implementations
The di�erent methods of the Groovy-Matrix can be classi�ed into a set of problems.
Table 4.1 lists all methods with their classi�cation. This chapter will discuss how these
classes are implemented in the GPU version of the Groovy-Matrix.
Table 4.1: Groovy-Matrix methods and classi�cation
Category Methods Description
Initialization constructor
init(...) : void
reset() : void
extend(...) : void
delete(...) : void
setOrder(...) : void
assign() : void
setSize(...) : void
Methods in this category require
a (re)initialization of the com-
plete matrix.
Pure Data Par-
allel Operations
plus(...) : Matrix
minus(...) : Matrix
mult(...) : Matrix
div(...) : Matrix
log() : Matrix
subFrom(...) : Matrix
One operation is executed concur-
rently over multiple data. Every
operation is independent of the
others.
16
4 GPU Implementation
Reduction equals(...) : boolean
max() : Object
maxIndex() : int
min() : Object
minIndex() : int
sum() : Object
sumAll() : Object
count() : int
These operations take a set as in-
put and reduce it down to a single
value with a reduction function.
Sorting unique() : ArrayList<> unique returns a sorted subset of
distinct values from a given set.
Searching �nd(entry) : ArrayList<>
n�nd(entry) : ArrayList<>
�nd (n�nd) returns all indizes of
the matrix (not) matching the en-
try.
Other data() : Object
val() : Object
valid() : boolean
entries() : int
indices() : ArrayList<>
get() : Matrix
put() : void
getDim() : int
size(...) : int
list() : ArrayList<>
isEmpty() : boolean
setName(...) : void
setNames(...) : void
getNames : HashMap<>
resolve(...) : int
isAll(...) : boolean
convert(...) : Integer[ ]
toString() : String
These operations do not invoke
any operations on the GPU.
17
4 GPU Implementation
4.3.1 Initialization
Methods of this class either initialize a new matrix or reinitialize an existing matrix.
This is done in two steps:
• create a new Bu�er on the GPU,
• either �ll the bu�er with a default value or copy data appropriately from an old
bu�er.
The implementation of this class of algorithms is trivial. In case of �rst initialization,
the data is simply uploaded into the GPU bu�er. Otherwise the data has to be copied
from an existing bu�er. In case of extend() and delete() the size of the bu�er can vary,
which has to be considered. Listing 4.1 shows the implementation of the extend-kernel,
which is taken as an example of this classi�cation. The goal of this function is to append
new rows or columns to a matrix. The existing rows and columns are to be retained
and the new ones should be �lled with a default value.
Listing 4.1: Extension Kernel
__kernel void extend(__global float* Old, __global float* New, int oldr
, int oldc, int newr, int newc, float defaultval)
{
int i = get_global_id(0);
if(i >= newr*newc)
return;
int row = i % newc;
int col = i / newc;
if(row < oldr && col < oldc)
New[i] = Old[col * oldr + row];
else
New[i] = defaultval;
}
The basic idea is to launch one work-item per entry of the new matrix. This can
be problematic in certain cases. For example, if the new matrix has a size of 1 × p,
where p is prime and bigger than the maximum work-group size, it is impossible to
distribute the work-items e�ectively into multiple work-groups. The result would be
that p work-groups would have to be created, each only containing one work-item, and
18
4 GPU Implementation
a lot of concurrency would be lost. To solve this problem, the amount of work-items
is increased to either match the work-group size or the warp size. In chapter 5 the
di�erent approaches were tested with the log function. The padding to full work-group
size showed a slightly better result and is therefore used.
4.3.2 Pure Data Parallel Operations
Methods of this class execute a single operation over one dataset. Each operation has
no inner data dependencies, so the implementation is trivial.
Listing 4.2: Addition Kernel
__kernel void AplBeqC(__global float* A, __global float* B, __global
float* C, int n)
{
int i = get_global_id(0);
if(i >= n)
return;
C[i] = A[i] + B[i];
}
This is very similar to the initialization class with the advantage that the input and
output bu�er have always the same size. As in the initialization class, a bad matrix
size can a�ect the performance, so the same solutions can be applied here. Since the
operations are light weight, the performance is heavily bandwidth-bound.
4.3.3 Reduction
Reduction operations take a data set and reduce it to a single value with a given com-
mutative function. The JavaCL library already implemented reductors for minimum,
maximum, sum and multiplication, so these are used if possible.
Listing 4.3: JavaCL reduction
Reductor<Float> reductor = ReductionUtils.createReductor(
Kernels.getContext(), ReductionUtils.Operation.Add,
OpenCLType.Float, 1);
Float Out = reductor.reduce(Kernels.getQueue(), Buffer).get();
19
4 GPU Implementation
In the remaining cases, it is necessary to create OpenCL Kernels. GPU accelerated
reduction is a well-known problem. [12] discusses a principal reduction algorithm and
various optimizations for it. The basic idea is that every work-item takes two values
of the input array and reduces them to a single value with a given function, which is
written to the output array, e�ectively halving the input size. The same operation is
now repeated recursively on the output array until a single value is left.
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2Values (shared memory)
0 1 2 3 4 5 6 7
8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2Values
0 1 2 3
8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values
0 1
21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values
0
41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values
Thread IDs
Step 1 Stride 8
Step 2 Stride 4
Step 3 Stride 2
Step 4 Stride 1
Thread IDs
Thread IDs
Thread IDs
Figure 4.2: Basic reduction without bank con�icts [12]
In order to e�ciently parallelize it on the GPU, the input is split into multiple sub-
arrays, one for each work-group. Every work-group loads its sub array into the local
memory and reduces it to a single value. Then, each work-group writes its result into
the output-array in the global memory, which has the size of the input-array divided
by the work-groups. If the output-array is su�ciently small, the �nal reduction is done
on the CPU. Otherwise, the output-array is taken as the new input and the process is
repeated.
Listing 4.4: Equals Kernel
__kernel void ReduceEquals(const __global float* In,
const uint n,
const __global float* Comp,
__global int* Out,
__local int* localMem)
{
20
4 GPU Implementation
if(get_global_id(0) < n)
localMem[get_local_id(0)] = (In[get_global_id(0)] == Comp[
get_global_id(0)]);
else
localMem[get_local_id(0)] = true;
barrier(CLK_LOCAL_MEM_FENCE);
for (uint stride = get_local_size(0)/2; stride > 0; stride /= 2) {
if (get_local_id(0) < stride) {
localMem[get_local_id(0)] &=
localMem[get_local_id(0) + stride];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (get_local_id(0) == 0)
Out[get_group_id(0)] = localMem[0];
}
Listing 4.4 shows the reduction kernel for the equals operation. Since it is the idea of
the reduction-algorithm to always halve the input size, the input should be of the size
2n. This, however, may not be the case. This problem is solved by running the kernel
with k2work-items, where k is the next higher power of 2 of the input size. The work-
items which access valid data �ll the local memory with the correct data. The other
work-items �ll the local memory with neutral elements of the reduce operation, in this
case with the value �true�. Since now the data is padded to ful�ll the requirements, the
reduction can now proceed without any special cases. In the end, the �rst work-item
from each work-group writes its result back into the global memory.
4.3.4 Sorting
The only method in this class is the unique() function. The goal of this function is to
extract every unique value out of the matrix into a new vector. In the original Groovy-
Matrix this is done by building a tree-set �rst and then save the nodes back into an
ArrayList. The ArrayList contains then only sorted and unique values.
21
4 GPU Implementation
This method is not feasible on the GPU and an alternative way has to be found. Since
the output should be sorted in the end, the idea is to modify a known sorting-algorithm
to achieve this. Currently, there are three prominent algorithms: radix-sort, merge-sort
and sorting networks.
Out of these three, merge-sort is the most promising because it is easy to implement
and needs only small modi�cations to create the desired output. Merge-sort takes two
sorted sets as input. The output are the two sets merged together as a new sorted set.
For the unique operation, only a small change is necessary. Instead of a sorted set, the
new requirement is a sorted set containing unique values. To achieve this, only a single
line of code has to be changed (Listings 4.5 and 4.6).
Listing 4.5: Mergesort pseudocode
void merge(IN set1, IN set2, OUT outset)
{
for(i,j = 0; i<set1.size() && j <set2.size(); )
{
outset.add(min(set1[i],set2[j]));
set1[i] <= set2[j] ? ++i : ++j;
}
//fill outset with remaining values from set1 or set2
}
Listing 4.6: Unique Mergesort pseudocode
void merge(IN set1, IN set2, OUT outset)
{
for(i,j = 0; i<set1.size() && j <set2.size(); )
{
outset.add(min(set1[i],set2[j]));
if(set1[i] <= set2[j]) ++i;
if(set2[j] <= set1[i]) ++j;
}
//fill outset with remaining values from set1 or set2
}
This can be easily ported onto the GPU. Each work-item merges two sets, similar to
the reduction operation. But since now the merging incorporates sets instead of single
items, this can be very ine�cient when the matrix has many unique values. In the worst
case, the sets double in size after each merge. At the same time, the concurrency after
each merge is halved.
22
4 GPU Implementation
6 4 3 3 6 6 1 4
4 6 3 X 6 X 1 4
3 4 6 X 1 4 6 X
1 3 4 6 X X X X
Figure 4.3: Example of the unique() merging
This problem is well-known, and papers [13], [14], [15] and [16] o�er di�erent solutions
for general parallel search algorithms, yet they all require that the search space size does
not change, which is not the case. Since the expected output of the unique() operation
is an ArrayList on the host side, it could be considered to sort the entries of the matrix
on the GPU with a sorting algorithm from one of the aforementioned papers. During
the copy process from the bu�er to the ArrayList, duplicate entries can then easily
identi�ed and left out. This would, however, introduce the memory transactions from
the GPU to the CPU and from the GPU bu�er to the ArrayList as major bottlenecks.
It is questionable if more performance could be achieved (see Section 5.5). Due to the
time constrains, only the simple version of the merge sort was implemented and tested.
4.3.5 Searching
The �nd() function of the Groovy-Matrix falls into this category. It returns all indices
of the matrix which match a given value. Since there is no further information on the
structure of the matrix, the fastest search algorithm would be a linear search. In order
to parallelize the search, the matrix-array is broken down into equal sized subgroups
which are searched in parallel.
Because it is not possible to allocate memory dynamically on the GPU, it is necessary
to preallocate the memory for the output. This is done by a pre-processing step, where
every matching item per subgroup is counted. The counting kernel creates exactly the
desired output. By calling it only once, an array is created in the global memory on
the GPU containing the counted elements for each work-group. If the subgroup size for
23
4 GPU Implementation
the search is also chosen as the work-group size, the newly created array has the exact
count of matches per subgroup. Furthermore, each work-item is able to calculate its
writing o�set, by summing up all entries of the array whose index is smaller than the
global id of the work-item. The total amount of memory necessary for the output can
be calculated by summing up the array.
Output size = 5
Work-group: 0 1 2 3
Work-item: 0 1 2 3
Output:
5 2 3 4 3 2 3 2 8 2 1 4 9 2 3 3Data:
Count: 1 2 0 2
5 2 3 4 3 2 3 2 8 2 1 4 9 2 3 3
2 4 6 1415
Figure 4.4: Example of �nding the Element 3 in an array
Listing 4.7: Find Kernel
__kernel void findIndices(const __global float* In,
const uint n,
const uint itemsPerGroup,
const float comp,
__global int* Out,
__global int* numEntries)
{
if(numEntries[get_global_id(0)] == 0)
return;
int startIndex = itemsPerGroup * get_global_id(0);
int found = 0;
int writeIndex = 0;
24
4 GPU Implementation
for(int i = 0; i< get_global_id(0); ++i)
writeIndex += numEntries[i];
for(int i = 0; i < itemsPerGroup && (startIndex + i) < n; ++i)
{
if( In[(startIndex + i)] == comp )
{
Out[writeIndex++] = startIndex + i;
}
}
}
25
5 Performance
This chapter examines the performance of the GPU implementation of the Groovy-
Matrix in comparison to the CPU version. The performance is measured by the speedup
which is calculated by s = tCPU
tGPU, where tCPU is the execution time of the CPU version
of the Groovy-Matrix, and tGPU the execution time of the GPU version. In chapter 4,
classi�cations of algorithms were introduced. Since it is expected that each algorithm
in a class will behave similarly, only one representative of each class was tested.
5.1 Experimental Setup
The benchmarks were run on a AMD Opteron 2220 SE CPU having a Nvidia Quadro
FX 5400 graphics card with CUDA compute capability of 1.0. The Java version used
was 1.6.0_27 with the OpenJDK runtime environment. The used JavaCL version was
the javacl-1.0 snapshot-20120722 build.
The comparison between the GPU and CPU were done with multiple matrix entry-sizes.
The tested entry-sizes were 2n, n ∈ {2, ..., 24}, because most kernels require a input sizeof a power of 2 and are padded to match it otherwise.
One problem with making benchmarks in Java is that the JVM has a warm-up phase
in which a program is pro�led to make code optimizations. To avoid this problem, the
execution time of an operation was measured for each entry-size one hundred times and
the best time out of the series was taken for the benchmark. However, in some cases
this is still visible for the smallest entry-sizes in the CPU execution times. In order to
get the most accurate results, the benchmarks were done directly with the Java version
of the Groovy-Matrix.
5.2 Initialization
Initialization is not performance-critical and for this reason only the execution times
were measured (Figure 5.1). The focus of this section is to explore the general behavior
26
5 Performance
of Java and JavaCL. The benchmarks were done with the copy-constructor, mainly
because of its similarity to the Pure Data Parallel Operations.
The �rst observed behavior is that JavaCL kernel-calls are asynchronous, which has
to be considered for the benchmarks, otherwise the wrong time would be measured.
Figure 5.1 shows two measurements. One was made by calling the copy-constructor
normally (non blocking), the other one had a barrier implemented which waited until
the kernel �nished (blocking).
One thing to note is that the JVM pro�les a program during execution and is able to
optimize code sections during runtime, which is why the times decrease a bit until the
matrix reaches 217 entries.
The sudden increase of the execution time after the matrix exceeds 217 elements is odd.
Since the blocking and non blocking case are a�ected by this, it is assumed that this is
caused by the enqueueNDRange() function of the JavaCL-API.
Up until 214 matrix elements, the execution time of the copy-constructor is not a�ected
by the matrix size because until then the maximum concurrency of the GPU is not
reached. The kernel copies one entry per work-item from the old matrix to the new one,
such that for a matrix size of 214 = 16384 the same number of work-items are created.
The Nvidia Quadro FX 5400 can execute 24 warps with warp-size 32 concurrently per
core. The total number of cores of the Quadro FX 5400 is 16, which means that a total
number of 24 ∗ 32 ∗ 16 = 12288 work-items can be executed in parallel, which is right
between 213 and 214.
27
5 Performance
10
100
1000
10000
26 28 210 212 214 216 218 220 222 224
time
in m
s
Matrix entries
time copy constructor operation
not blockingblocking
Figure 5.1: Comparison between the blocking and non blocking copy constructor
5.3 Pure Data Parallel Operations
The log() function is taken as the only representative of this class since all operations of
this class performed very similarly. Two sets of measurements were taken, the �rst with
2n matrix entries, and the second with p matrix entries, were p is a prime number,
p ∈ {3581, 8831, 47533, 105767, 285721, 610639, 1299821, 3496217, 7367197, 13832281}.The goal of the �rst measurement (Figures 5.2 and 5.3) is to examine the speedup
under optimal conditions. The second one (Figures 5.4 and 5.5) has the goal to exam-
ine how the performance is a�ected in the worst case scenario and how much in�uence
di�erent padding strategies have.
5.3.1 Best-Case-Scenario
The best-case-scenario shows little to no variation compared to the copy-constructor
(Figures 5.2 and 5.3) and the same e�ects a�ecting the performance are also present
in this case. Overall, the log() operation performs very well on the GPU and for big
matrices a speedup of almost a factor 1000 is achieved.
28
5 Performance
10
100
1000
10000
100000
1e+06
1e+07
26 28 210 212 214 216 218 220 222 224
time
in m
s
Matrix entries
time log operation
CPU timesGPU times
Figure 5.2: Time used by the log() operation
0.1
1
10
100
1000
26 28 210 212 214 216 218 220 222 224
Spe
edup
Matrix entries
Speedup log operation
speedup
Figure 5.3: Speedup of the log() operation
29
5 Performance
5.3.2 Worst-Case-Scenario
For the worst-case, three di�erent padding strategies were tested for the GPU version:
no padding, padding to match a multiple of the maximum work-group size and padding
to match the warp-size.
The a�ects of a bad matrix size are clearly visible (Figures 5.4 and 5.5) if no padding
is used. Because the matrix-size is a prime number, it is impossible to have more than
one work-item per work-group. As a consequence, each warp contains also only a single
work-item and maximum concurrency is already reached at 24 warps ∗ 16 cores = 384
work-items. Therefore, only a low amount of speedup is achieved.
If padding is used both strategies show good results, except for 3496217 and 13832281
entries, where the padding to the warp-size performs worse than the padding to the work-
group size. The distribution of the work-items into work-groups is left to the JavaCL-API
which creates apparently less e�cient work-group sizes in these cases. Therefore, the
work-group padding is going to be used for the pure data parallel operations.
10
100
1000
10000
100000
1e+06
1e+07
3581 8831
47533
105767
285721
610639
1.29982e+06
3.49622e+06
7.3672e+06
1.38323e+07
time
in m
s
Matrix entries
time log (prime entries) operation
CPUGPU, no padding
GPU, workg. paddingGPU, warp padding
Figure 5.4: Time comparison of di�erent padding strategies
30
5 Performance
1
10
100
1000
10000
3581 8831
47533
105767
285721
610639
1.29982e+06
3.49622e+06
7.3672e+06
1.38323e+07
Spe
edup
Matrix entries
Speedup log (prime entries) operation
no paddingworkg. padding
warp padding
Figure 5.5: Speedup comparison of di�erent padding strategies
31
5 Performance
5.4 Reduction
For this category, two representatives were chosen, the sumAll() and the equals() oper-
ation. sumAll() is natively supported by JavaCL and equals() had to be implemented
anew. Other than to examine how these operations perform in comparison to their CPU
counterparts, another goal is to evaluate whether the native or the newly implemented
reduction functions are more e�cient.
Looking at the times of the sumAll() operations (Figures 5.6 and 5.7), it is striking
that the execution time is almost constant (about 4000ms) until the matrix reaches 220
elements. Because of this, no speedup is achieved until the matrix exceeds 216 elements.
So this operation is very ine�cient for small matrices. However, for big matrices a good
amount of speedup is achieved.
1
10
100
1000
10000
100000
1e+06
26 28 210 212 214 216 218 220 222 224
time
in m
s
Matrix entries
time sumAll operation
CPU timesGPU times
Figure 5.6: Time used by the sumAll() operation
32
5 Performance
0.001
0.01
0.1
1
10
100
26 28 210 212 214 216 218 220 222 224
Spe
edup
Matrix entries
Speedup sumAll operation
speedup
Figure 5.7: Speedup of the sumAll() operation
The newly implemented reduction operation equals() performs more expectedly (Fig-
ures 5.8 and 5.9). Up to a matrix size of 215, the execution time is almost constant at
100ms. From this point on the time starts to scale linearly with the size. The speedup,
however, is rather moderate. The GPU version becomes more e�cient at a matrix size
of about 213 elements and only a maximum speedup of about factor 15 is achieved. This
is mainly attributed to the fact that the CPU version of the equals() function is very
fast.
33
5 Performance
1
10
100
1000
10000
100000
1e+06
26 28 210 212 214 216 218 220 222 224
time
in m
s
Matrix entries
time equals operation
CPU timesGPU times
Figure 5.8: Time used by the equals() operation
0.01
0.1
1
10
100
26 28 210 212 214 216 218 220 222 224
Spe
edup
Matrix entries
Speedup equals operation
Speedup
Figure 5.9: Speedup of the equals() operation
Compared to the native JavaCL reduction, the new implemented reduction performs
34
5 Performance
much better for small matrices. For large entry sizes, the execution times of both
functions start to converge. Since the new implementation shows generally a better
performance, the JavaCL reduction should be replaced.
35
5 Performance
5.5 Sorting
This section discusses how well the unique() function performs, since it is the only oper-
ation from the sorting category. The performance of unique() is very dependent on how
many values are unique. In order to fully examine the behavior of the unique() function,
multiple measurements were taken with di�erent levels of uniqueness: all values are the
same and 1%, 25%, 50% or 100% of the total values are unique.
The CPU version acts overall as expected, the bigger the matrix size and the more
unique values it contains, the more time is needed for the unique() operation (Fig-
ure 5.10). However, the case where every value is unique is an exception. The output
is produced, by creating a tree-set with the matrix entries. The values of the tree are
then copied back into an ArrayList. Since the 100% case has to copy the most data, it
should also use the most time, yet it is the second fastest and uses only about a third
of the time that it should. The only explanation for this is that the JVM was able to
make better byte-code optimizations in this case.
10
100
1000
10000
100000
1e+06
1e+07
26 28 210 212 214 216 218 220 222
time
in m
s
Matrix entries
time unique CPU operation
1 value1%
25%50%
100%
Figure 5.10: CPU times of the unique() operation.
The GPU version of the unique() function is the least optimized one and this becomes
visible in the execution times (Figure 5.11). As of now, there are three bottlenecks: no
36
5 Performance
local memory is used, the merging of sets is only done by a single thread and the
result has to be copied back to the host since the expected output of the function is
an ArrayList. With increasing uniqueness of the values, these bottlenecks get more and
more weight in the calculation because more global memory accesses are made, each
thread has more work to do and more data has to be copied.
However, looking at the speedup, the GPU version still gains a speedup by a factor of
2 (50%-case) up to a factor of 35 (single value) for big matrices (Figure 5.12). Because
the 100%-case is unusually fast on the CPU, the GPU version is overall slower.
100
1000
10000
100000
1e+06
1e+07
26 28 210 212 214 216 218 220 222
time
in m
s
Matrix entries
time unique GPU operation
1 value1%
25%50%
100%
Figure 5.11: GPU times of the unique() operation.
37
5 Performance
0.1
1
10
100
26 28 210 212 214 216 218 220 222
Spe
edup
Matrix entries
Speedup unique GPU operation
1 value1%
25%50%
100%
Figure 5.12: Speedup of the unique() operation.
While there is lots of room for improvement of the unique kernel, it is also questionable
if much more performance can actually be achieved. Copying the complete data on the
CPU from the ArrayList into a TreeSet has a complexity of O(n∗log(n)) and copying thedata back into a new ArrayList takes another O(n) operations, where n is the entry-size.The GPU version also needs to copy its data back to the host and then into an ArrayList.
Assuming that the kernel scales optimally and needs only a constant time to execute,
the maximum speedup which can be achieved is determined by the copy-operations on
the CPU, which has also a complexity of O(n). That means, on average the speedup in
relation to the matrix size is approximately n∗log(n)+nn
= log(n) + 1 ≈ log(n).
38
5 Performance
5.6 Searching
�nd() is the only search function of the search category. The performance of this oper-
ation is also very dependent on how many matching entries are present in the matrix.
So measurements were taken where 0%, 25%, 50% and 100% of the entries match the
searched item.
The CPU implementation is realized by a linear search over an ArrayList. The out-
put is a new ArrayList which contains the indices of the matching items. Therefore, the
execution time is determined by the matrix size and how many items are found. This
can also be seen in Figure 5.13.
1
10
100
1000
10000
100000
1e+06
26 28 210 212 214 216 218 220 222 224
time
in m
s
Matrix entries
time find CPU operation
0%25%50%
100%
Figure 5.13: CPU times of the �nd() operation
In the GPU implementation, the �nd() operation is realized in two steps: an initial-
ization phase, where the matching elements are counted in order to create the output
bu�er, and a second phase, where the data is copied. The initialization is done by calling
the count-kernel once. If no element is found, the operation can be terminated and an
empty list is returned. Because of that, the case where no element is found is much
faster than the other cases (see Figure 5.14). Otherwise the second kernel is called,
where the matrix is split into multiple groups on which the elements then are searched
39
5 Performance
in parallel.
The kernel itself is independent of the amount of matches, since each work-item always
has to access all elements of its group. But since more data is copied from the GPU to
the CPU with a higher match-rate, the execution time also increases the more items are
found.
10
100
1000
10000
100000
1e+06
26 28 210 212 214 216 218 220 222 224
time
in m
s
Matrix entries
time find GPU operation
0%25%50%
100%
Figure 5.14: GPU times of the �nd() operation
Overall, this operation bene�ted the least from the GPU implementation and shows
in the average case no speedup at all. Like the unique() operation, the bottleneck of the
�nd() is mainly in the transfer of the data from the GPU into the ArrayList on the host.
Since both, the linear search on the CPU and the data transfer, are O(n) operations, itis unlikely to achieve much better results.
40
5 Performance
0.01
0.1
1
10
100
26 28 210 212 214 216 218 220 222 224
Spe
edup
Matrix entries
Speedup find GPU operation
0%25%50%
100%
Figure 5.15: Speedup of the �nd() operation
41
6 Conclusion
Overall, the GPU integration can be considered a success. Most operations showed a
good amount of speedup. As for the �nd() and unique() operation, it should be consid-
ered to change the returned data-type from ArrayList to Matrix, so that the bottleneck
of the data-transfer from the GPU to the host could be removed.
Otherwise, the current implementation of the Groovy-Matrix lacks many matrix oper-
ations, like a real matrix multiplication, matrix inversion or the calculation of the de-
terminant. Furthermore, the GPU implementation has as of now multiple limitations.
The only data-type currently supported are �oats. The extension to other data-types
can easily be done by moving the .cl �les for the kernels directly into the source code, in
which the static data-type is replaced with a variable placeholder for each type. Also,
the matrix is currently stored as a �at array, which is �ne for dense matrices. For sparse
matrices, however, this is ine�cient, since most entries are zero and storing them is not
necessary. This would require a complete rewrite of the existing kernels because entirely
di�erent data structures and access-patterns are then required.
Finally, during the benchmarks a general problem with the JavaCL-API occurred. The
cleaning of the GPU memory is dependent on the Java garbage collector, yet the mem-
ory occupation of the GPU is not tracked by the JVM. Unless this will be �xed in a
future release of JavaCL, it is advisable to enhance the GPU implementation with an
own memory manager.
42
Bibliography
[1] Universal Java Matrix Package. http://sourceforge.net/projects/
ujmp/, May 2013.
[2] JavaCL, UJMP Matrices. http://code.google.com/p/javacl/wiki/
Goodies, May 2013.
[3] JCuda. http://www.jcuda.org/jcuda/JCuda.html, May 2013.
[4] CUBLAS. https://developer.nvidia.com/cublas, May 2013.
[5] KALDI, The CUDA Matrix library. http://kaldi.sourceforge.net/
cudamatrix.html, May 2013.
[6] MAGMA, Matrix Algebra on GPU and Multicore Architectures. http://icl.
cs.utk.edu/magma/index.html, May 2013.
[7] CUSP : A C++ Templated Sparse Matrix Library. http://cusplibrary.
github.io/, May 2013.
[8] Khronos OpenCL Working Group. The OpenCL Speci�cation. 2009.
[9] Oliver Cha�k. Getting Started with JavaCL. http://code.google.com/p/
javacl/wiki/GettingStarted, April 2012.
[10] Nvidia. OpenCL Programming Guide for the CUDA Architecture. 2012.
[11] Advanced Micro Devices (AMD). ATI Stream Computing OpenCL. 2010.
[12] Mark Harris. Optimizing Parallel Reduction in CUDA. http://developer.
download.nvidia.com/compute/cuda/1.1-Beta/x86_website/
projects/reduction/doc/reduction.pdf, November 2007.
[13] Xiaochun Ye, Dongrui Fan, Lin Wei, Yuan Nan, and Paolo Ienne. High Performance
Comparison-Based Sorting Algorithm on Many-Core GPUs. 2010.
43
Bibliography
[14] Kliang Zhang, Jiajia Li, Gang Chen, and Baifen Wu. GPU Accelerate Parallel
Odd-Even Merge Sort: An OpenCL Method. 2011.
[15] Nadathur Satish, Mark Harris, and Michael Garland. Designing E�cient Sorting
Algorithms for Manycore GPUs. IEEE, 2009.
[16] Erik Sintorn and Ulf Assersson. Fast parallel GPU-sorting using a hybrid algorithm.
Journal of Parallel and Distributed Computing, pages 1381�1387, 2008.
44