Bachelorthesis Integrating GPU accelerated Matrix operations … · 2014-03-12 · Bachelorthesis...

transcript

Ruprecht-Karls-Universität Heidelberg

Institut für Informatik

Lehrstuhl für Parallele und Verteilte Systeme

Bachelorthesis

Integrating GPU accelerated Matrixoperations into the Groovy programming

language

Name: Robert Kühl

Matrikelnummer: 2551645

Betreuer: Arthur Andrzejak

Datum der Abgabe: 30.06.2013

Ich versichere, dass ich diese Bachelor-Arbeit selbstständig verfasst und nur die

angegebenen Quellen und Hilfsmittel verwendet habe.

Heidelberg, 27.06.2013

Abstract

Due to the demand of more and more realistic graphics in the consumer market, graphic

cards have seen a huge increase in performance since mid 2000. Because of thi and the

advent of programmable shaders as well later Cuda and OpenCL, they became more

and more interesting for high performance computing. This thesis examines how GPU

computing could be integrated into the Groovy-Matrix project. Groovy-Matrix is a

general purpose matrix class for the Groovy programming language, which o�ers easy

to use matrix operations.

The integration is achieved by creating a Java back-end which utilizes the JavaCL-API

in order to access the graphics-hardware. The matrix operations fall into four di�erent

categories: pure data parallel operations, reduction, searching and sorting. While the

�rst two categories greatly bene�t from the GPU integration, the other two only show

little to no speedup, depending on the matrix structure.

Zusammenfassung

Angesichts der zunehmenden Nachfrage des Marktes nach immer realistischerer Gra�k

ist die Performanz von Gra�kkarten seit Mitte der 2000er Jahre erheblich angestiegen.

Aufgrund dessen sowie der Einführung von programmierbaren Shadern und später von

Cuda und OpenCL wurden sie zunehmend interessanter für High Performance Com-

puting. Diese Arbeit untersucht, wie das Rechnen auf Gra�kkarten in das Groovy-

Matrix-Projekt integriert werden kann. Groovy-Matrix ist eine allgemein verwendbare

Matrixklasse für die Programmiersprache Groovy, die einfach zu benutzende Matrix-

operationen anbietet.

Das Einbinden wird erreicht durch die Erstellung eines Java back-ends, das mit Hilfe

der JavaCL-API die Gra�k-Hardware anspricht. Die Matrixoperationen lassen sich in

vier Kategorien unterteilen: rein Daten-parallele Operationen, Reduktion, Suchen und

Sortieren. Während die ersten beiden Kategorien stark von der GPU-Integration pro-

�tieren, zeigen die anderen beiden nur wenig bis keinen Speedup, abhängig von der

Struktur der Matrix.

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 Similar Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Groovy-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 GPU Programming 7

3.1 OpenCL / JavaCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.2 JavaCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.3 General Code Structure . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.4 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.3 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 GPU Implementation 15

4.1 Limitations of the Current Implementation . . . . . . . . . . . . . . . . . 15

4.2 Internal Data-Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Kernel Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.2 Pure Data Parallel Operations . . . . . . . . . . . . . . . . . . . . 19

4.3.3 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Contents

4.3.4 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.5 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Performance 26

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 Pure Data Parallel Operations . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3.1 Best-Case-Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3.2 Worst-Case-Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.5 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.6 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Conclusion 42

Bibliography 43

1 Introduction

1.1 Motivation

Groovy is an agile and dynamic language for the Java Virtual Machine (JVM). Its syntax

is close to the Java programming language, but adds several features on top of it. The

main features are the inclusion of closures, dynamic typing and the ability to run a

program either as a script or as compiled byte code. While mostly compatible to Java,

Groovy avoids many unnecessary language constructs, and is often shorter and easier to

read. For example, Groovy does not need a class with an encapsulated main-method,

or semicolons at the end of a command. So a typical �Hello World� program in Java:

Listing 1.1: Java Hello World

public class HelloWorld {

public static void main(String[] args) {

System.out.println("Hello World!");

can be written in Groovy as:

Listing 1.2: Groovy Hello World

println ’Hello World!’

This, however, comes with a price: performance. While the performance gap between

Java and Groovy was somewhat closed in 2012 with the release of Groovy 2.0 and the

inclusion of static typing, Java still outperforms Groovy in the average case, especially

when dynamic typing is used. This makes Groovy interesting for cases where perfor-

mance limitations are not an issue like web applications, which are often bound by the

IO-bandwidth. This is not acceptable in many scienti�c �elds, where typically large

numerical systems have to be solved. Here, low level programming languages like C

or C++ are dominant because they o�er maximum performances with little overhead.

1 Introduction

With the advent of Cuda and OpenCL, GPGPU-computing became more and more

important because traditional CPUs are nowadays heavily outperformed by GPUs.

Yet this comes with a problem. GPU-programming is not very user-friendly and can

use up a considerable amount of time. Parallelizing a problem, determining the correct

ND-Ranges, taking care of a correct memory usage and memory access patterns can

be a very intimidating task for the average user, who has no prior experience in high

performance computing.

It would be desirable to combine these two worlds: the ease of use of the Groovy pro-

gramming language and the e�ciency of GPU-programming. Currently, the only practi-

cal way to achieve this is through APIs. The user is given an easy to use front-end, where

only certain functions are available. These are then executed transparently for the user

on the GPU. As part of the diploma thesis �Improving Programming E�ciency via Rich

Data Structures�, Aram Altschudjian created the Groovy-Matrix project. The goal of

this project was to provide easy to handle matrix and vector structures for the Groovy,

Scala and Java programming languages. At this time the Groovy-Matrix runs solely on

the CPU. Since matrix operations perform generally very well on GPUs, it would be

interesting to examine if it is possible to integrate GPU support into the Groovy-Matrix.

1.2 Goals

The goal of this thesis is to enhance the Groovy-Matrix such that the various available

operations are executed transparently for the user on the GPU. To achieve this goal, it

is necessary to analyze the structure of the GPU hardware. Since GPU programming

is not natively supported by the Java programming language, a way has to be found to

integrate it.

To test the performance, benchmarks have to be made in order to decide whether the

Groovy-Matrix bene�ted from the GPU integration or not.

1 Introduction

1.3 Thesis Structure

Chapter 2 gives an overview over similar projects and introduces the Groovy-Matrix.

Then, the principals of the GPU architecture as well as the GPU programming with

OpenCL and JavaCL are discussed in Chapter 3. The GPU implementation of the

Groovy-Matrix is detailed in Chapter 4. Chapter 5 shows how well the GPU imple-

mentation performs in comparison to the CPU version. In Chapter 6 the conclusion is

given.

2 Background

2.1 Similar Works

Although the Groovy-Matrix is the �rst GPU accelerated matrix class for the Groovy

programming language, there are multiple similar projects in other programmong lan-

guages. The most similar to this project is the �Universal Java Matrix Package�

(UJMP) [1], which is currently integrated into to the JavaCL API [2], yet at the cur-

rent stage only a few operations are supported. A less user friendly, but more powerful

GPU matrix integration into Java is JCublas from the JCuda-API [3]. For the C/C++

programming language, there are various projects of this kind. Most famous would be

CUBLAS from Nvidia [4], which is also utilized in JCublas. Other noteworthy libraries

are KALDI [5], MAGMA [6] and CUSP [7].

2.2 Groovy-Matrix

2.2.1 Functionality

At the current stage, the Groovy-Matrix supports only basic arithmetic functions like

matrix additions or piecewise matrix multiplication. More complex functions like a true

matrix multiplication, matrix inversion or calculation of the eigenvalues and eigenvectors

are not available. Instead, there are various operations to reorganize the matrix, like

adding or removing dimensions, rows and columns. The user has the option to �lter out

unique values or search for speci�c elements. Furthermore, typical reduce operations

exist. It is possible to sum up each entry or search for the minimum or maximum entry

of the matrix. Additionally, it is possible to label the rows and columns, which gives an

alternate way to access each entry. A full list of each operation is given in Chapter 4

(Table 4.1).

2 Background

Listing 2.1: Sample code of the Groovy-Matrix

//creating two matrices

mat = new MatrixJC()

mat.init(5.0f, [5, 5])

mat2 = new MatrixJC()

mat2.init(2.0f, [5,5])

//summation of two matrices

mat3 = mat + mat2

//append 2 entries to dimension 1

mat3.extend(1,2)

//print unique values of mat3

print mat3.unique()

2.2.2 Structure

The interface de�ning the methods of Groovy-Matrix is simply called �Matrix� and

is implemented by the class MatrixJC in the Groovy, Scala and Java programming

languages. For performance reasons, the core functionality of the Groovy-Matrix is

implemented in the Java version of the MatrixJC. The Groovy and Scala version of the

MatrixJC are facade classes which give the user a nice interface, but bind internally into

the java version.

The data of the matrix is simply stored into a multidimensional ArrayList, which makes

the Groovy-Matrix more suitable for dense matrices.

2 Background

Figure 2.1: Class hierachy of the Groovy-Matrix

3 GPU Programming

3.1 OpenCL / JavaCL

3.1.1 OpenCL

The Open Computing Language (OpenCL) is an open standard for cross-platform par-

allel programming. The goal of OpenCL is to unify the programming of di�erent accel-

erators (such as CPU, GPU, CELL/B.E. etc) in one language. This is done by splitting

the source-code in two parts: host- and kernel-code.

The host-code contains the main program and runs typically on the CPU. The host-code

is normally written in the C programming language, but there are also several bindings

for other programming languages. The kernel-code contains the code-sections for the

accelerator. It is written in a OpenCL-speci�c, C-like, language.

Compute DeviceCompute Unit

ProcessingElement

Figure 3.1: The OpenCL platform model [8]

3 GPU Programming

3.1.2 JavaCL

To integrate OpenCL code into the Groovy-Matrix, it is necessary to create bindings

between Java and OpenCL. This is already done by the JavaCL project. JavaCL is an

API which wraps the OpenCL library in Java code via the JNA framework. JavaCL

currently fully supports OpenCL version 1.0. Furthermore, JavaCL adds various tools,

like an OpenCL code generator, a random number generator and multiple parallel reduce

functions. However, the code for the kernels still has to be written in OpenCL.

3.1.3 General Code Structure

Creating an OpenCL application can be divided in di�erent steps: setup, building the

kernel, executing the kernel and cleanup. The Code samples are taken from the getting

started guide of JavaCL [9].

At �rst, the application has to be prepared to run an OpenCL kernel. In this step, basic

IO-Bu�ers have to be created to transfer data between the host and the kernel device.

Listing 3.1: setup phase

CLContext context = JavaCL.createBestContext();

CLQueue queue = context.createDefaultQueue();

ByteOrder byteOrder = context.getByteOrder();

int n = 1024;

Pointer<Float>

aPtr = allocateFloats(n).order(byteOrder),

bPtr = allocateFloats(n).order(byteOrder);

for (int i = 0; i < n; i++) {

aPtr.set(i, (float)cos(i));

bPtr.set(i, (float)sin(i));

Building the Kernel

In order to build a kernel, the API has to know for which device and in what kind of

context it has to be build. In OpenCL it is necessary to select a device and a context

3 GPU Programming

manually via the corresponding API calls. However, in JavaCL it is possible to let

JavaCL create them automatically with the JavaCL.createBestContext() function. This

function takes a �device feature� as argument, which allows the user to prefer certain

devices like the GPU. This is very useful when it comes to integrating JavaCL into

the Groovy-Matrix, since it is unknown which devices the end-user has. Therefore, the

enhanced Groovy-Matrix can be run on any devices supported by OpenCL.

Once this is done, the kernel-code has to be loaded as a string. The string can be

passed to the context to build a programm and �nally this programm can create the

kernel.

Listing 3.2: kernel building phase

// Read the program sources and compile them :

String src = IOUtils.readText(JavaCLTutorial1.class.getResource("

TutorialKernels.cl"));

CLProgram program = context.createProgram(src);

CLKernel addFloatsKernel = program.createKernel("add_floats");

Executing the Kernel

At �rst, memory has to be allocated on the device. This is done by calling the create-

Bu�er functions from the context. Additionally, a command-queue has to be created

to issue commands to the device. Then the kernel-arguments have to be set and the

kernel is executed by enqueueing it into the command-queue. After the kernel execution

is �nished, the result can be read by enquiring the command queue.

Listing 3.3: execution phase

// Create OpenCL input buffers (using the native memory pointers aPtr

and bPtr) :

CLBuffer<Float>

a = context.createBuffer(Usage.Input, aPtr),

b = context.createBuffer(Usage.Input, bPtr);

// Create an OpenCL output buffer :

CLBuffer<Float> out = context.createBuffer(Usage.Output, n);

//set Arguments and call the kernel

addFloatsKernel.setArgs(a, b, out, n);

3 GPU Programming

CLEvent addEvt = addFloatsKernel.enqueueNDRange(queue, new int[] { n })

cleanup

The cleanup phase consists of transferring memory back to the original data-structure

and cleaning up the bu�ers from memory.

Listing 3.4: cleanup phase

Pointer<Float> outPtr = out.read(queue, addEvt); // blocks until

add_floats finished

//cleaning of the buffers is done by the garbage collector

3.1.4 Programming Models

OpenCL features two programming models: data parallelism and task parallelism. There

is no strict separation between the two models, so hybrid solutions are also possible.

However, the primary model behind OpenCL is data parallel, since it is best supported

by GPUs [8].

Data Parallel Programming Model

The data parallel model is the main focus of OpenCL and it is the best supported

one by GPUs. In this model a series of instructions are applied in parallel to di�erent

data, similar to the SIMD architecture of Flynn's taxonomies. This is done in OpenCL

by splitting a task into multiple work-items. Each work-item runs the same kernel

code in parallel and has its own set of data. Work-items can be grouped into work-

groups. This can be done explicitly, where the programmer determines how many work-

items are executed in parallel and how the work-items are distributed into the work-

groups. Alternatively, the programmer only speci�es how many work-items are executed

in parallel. The distribution into the work-groups is then done implicitly by the OpenCL

implementation. A typical example for a data parallel operation would be a vector

addition since the addition of one vector-element is independent of the addition of the

others and could therefore easily be done in parallel.

3 GPU Programming

Task Parallel Programming Model

In the task parallel programming model, a program would be divided into multiple

concurrent tasks. Each task is implemented in its own kernel. Each kernel is run on its

own compute unit, with a single work-group containing one work-item.

An example for this would be video processing, where the audio and video processing

is done in parallel. This programming model though is not recommended for GPU

programming because calling work-groups with only one work-item is generally very

ine�cient, since then only warps with size one could be executed and concurrency is

3.2 GPU Architecture

This section discusses basic GPU architecture principles. At the current time, there are

two di�erent main vendors producing GPGPUs: NVIDIA and AMD. Since it is unknown

what kind of GPU the end-user has, only fundamental principles will be discussed. This

chapter is based on the NVIDIA OpenCL programming guide [10] and therefore will use

the NVIDIA terminology, but it is also easily applicable to the AMD architecture [11].

3.2.1 Basic Architecture

A basic model for a GPU can be divided into two parts: main memory and computation

units. The main memory is just a large DRAM and is globally accessible by all compu-

tation units. The computation units consists of an array of streaming multiprocessors,

each having their own shared memory and cache. Each streaming multiprocessor is

capable of performing SIMD vector operations.

ALUControl

Figure 3.2: Comparison between CPUs and GPUs [10]

3 GPU Programming

3.2.2 Execution Model

The NVIDIA programming guide describes their GPU-architecture as �single instruction,

multiple threads� (SIMT), analogue to the SIMD model from Flynn's taxonomy. In

the SIMT model, one set of instructions is executed concurrently by multiple threads.

Contrary to SIMD, the execution path of the threads may diverge. Generally, this should

be avoided, since a branching execution path of concurrent threads is serialized, and is

not executed in parallel.

A single thread is represented in OpenCL as a work-item. Multiple work-items can be

bundled into a work-group, which can be further bundled into a grid. A work-group is

assigned to a single streaming multiprocessor, on which a set of work-items is processed

concurrently. This set is called warp, and is typically the size of 32. This model makes

a GPU especially suitable for data-parallel tasks because only then the full warp-sizes

can be utilized.

Block (1, 1)

Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)

Block (2, 1)Block (1, 1)Block (0, 1)

Block (2, 0)Block (1, 0)Block (0, 0)

Figure 3.3: Thread organization in OpenCL [10]

3.2.3 Memory Model

The memory on the GPU is divided into multiple parts: host memory, global memory,

constant memory and local memory.

3 GPU Programming

Compute Device

Compute unit 1 Compute unit N

Private memory 1

Private memory M

PE 1 PE M

Localmemory 1

Global/Constant Memory Data Cache

Global Memory

Compute Device Memory

Constant Memory

Private memory 1

Private memory M

PE 1 PE M

Localmemory N

Figure 3.4: OpenCL memory hierarchy [8]

Host Memory

The Host Memory is the main memory of the CPU. The GPU has no direct access to

it, so in order to access the data, it has to be transfered to the GPU �rst. This memory

transaction can only be issued by the host.

Global Memory

The global memory is in the main memory of the GPU. Therefore, it is the biggest in size

and has a high bandwidth, but also a high latency. The global memory is generally not

cached, except for GPUs with a higher computing capability (CUDA 2.0 and upward).

The host as well as the kernel has full read and write access to the global memory, but

new memory can only be allocated by the host. Every data transfer between CPU and

GPU is done through the global memory.

Constant Memory

The constant memory also resides in the main memory of the GPU, but has di�erent

characteristics than the global memory. First of all, reads from constant memory are

cached. Secondly, only the host has full read and write privileges and can allocate

memory during runtime. The kernel has read-only access, but is allowed to allocate

memory statically during compile time.

3 GPU Programming

Local Memory

The local memory is in the shared memory of a computation unit. Accessing it is fast,

but on the �ip-side it also very limited in size. Reading and writing is only permitted

to the work-group which is currently active on the computation unit, so data exchanges

over multiple work-groups have to be done over the global memory. The host may

allocate local memory, but can neither read nor write to it.

Private Memory

The private memory belongs to single work-items, which have exclusive read and write

privileges. If the memory needed is small enough, it is cached locally on the computation

unit. Otherwise, the data will be allocated in the global memory.

4 GPU Implementation

4.1 Limitations of the Current Implementation

Due to the time constrains of a Bachelor-Thesis, the current implementation of the GPU

version of the Groovy-Matrix has the following limitations:

• only n×m matrices are available,

• the data type of the matrix is limited to ��oat�,

• kernels are not fully optimized.

4.2 Internal Data-Structure

Structurally, the GPU implementation of the Groovy-Matrix is very similar to the CPU

implementation. The data-storage changed from an ArrayList to a CLBu�er. Since the

Groovy-Matrix is for general purpose matrix operations, the whole matrix is stored as

a �at array.

The second di�erence is the inclusion of a new class, JCLMKernels, which manages the

GPU speci�c operations. The device, context and command-queue is independent of

the concrete matrices. Therefore, it is stored as a static member. The JCLMKernels

class also manages the building of the OpenCL kernels. This has the advantage that

every kernel has to be build only once for every matrix.

Figure 4.1: Class hierachy of the GPU implementation

4.3 Kernel Implementations

The di�erent methods of the Groovy-Matrix can be classi�ed into a set of problems.

Table 4.1 lists all methods with their classi�cation. This chapter will discuss how these

classes are implemented in the GPU version of the Groovy-Matrix.

Table 4.1: Groovy-Matrix methods and classi�cation

Category Methods Description

Initialization constructor

init(...) : void

reset() : void

extend(...) : void

delete(...) : void

setOrder(...) : void

assign() : void

setSize(...) : void

Methods in this category require

a (re)initialization of the com-

plete matrix.

Pure Data Par-

allel Operations

plus(...) : Matrix

minus(...) : Matrix

mult(...) : Matrix

div(...) : Matrix

log() : Matrix

subFrom(...) : Matrix

One operation is executed concur-

rently over multiple data. Every

operation is independent of the

others.

Reduction equals(...) : boolean

max() : Object

maxIndex() : int

min() : Object

minIndex() : int

sum() : Object

sumAll() : Object

count() : int

These operations take a set as in-

put and reduce it down to a single

value with a reduction function.

Sorting unique() : ArrayList<> unique returns a sorted subset of

distinct values from a given set.

Searching �nd(entry) : ArrayList<>

n�nd(entry) : ArrayList<>

�nd (n�nd) returns all indizes of

the matrix (not) matching the en-

Other data() : Object

val() : Object

valid() : boolean

entries() : int

indices() : ArrayList<>

get() : Matrix

put() : void

getDim() : int

size(...) : int

list() : ArrayList<>

isEmpty() : boolean

setName(...) : void

setNames(...) : void

getNames : HashMap<>

resolve(...) : int

isAll(...) : boolean

convert(...) : Integer[ ]

toString() : String

These operations do not invoke

any operations on the GPU.

4.3.1 Initialization

Methods of this class either initialize a new matrix or reinitialize an existing matrix.

This is done in two steps:

• create a new Bu�er on the GPU,

• either �ll the bu�er with a default value or copy data appropriately from an old

bu�er.

The implementation of this class of algorithms is trivial. In case of �rst initialization,

the data is simply uploaded into the GPU bu�er. Otherwise the data has to be copied

from an existing bu�er. In case of extend() and delete() the size of the bu�er can vary,

which has to be considered. Listing 4.1 shows the implementation of the extend-kernel,

which is taken as an example of this classi�cation. The goal of this function is to append

new rows or columns to a matrix. The existing rows and columns are to be retained

and the new ones should be �lled with a default value.

Listing 4.1: Extension Kernel

__kernel void extend(__global float* Old, __global float* New, int oldr

, int oldc, int newr, int newc, float defaultval)

int i = get_global_id(0);

if(i >= newr*newc)

return;

int row = i % newc;

int col = i / newc;

if(row < oldr && col < oldc)

New[i] = Old[col * oldr + row];

New[i] = defaultval;

The basic idea is to launch one work-item per entry of the new matrix. This can

be problematic in certain cases. For example, if the new matrix has a size of 1 × p,

where p is prime and bigger than the maximum work-group size, it is impossible to

distribute the work-items e�ectively into multiple work-groups. The result would be

that p work-groups would have to be created, each only containing one work-item, and

a lot of concurrency would be lost. To solve this problem, the amount of work-items

is increased to either match the work-group size or the warp size. In chapter 5 the

di�erent approaches were tested with the log function. The padding to full work-group

size showed a slightly better result and is therefore used.

4.3.2 Pure Data Parallel Operations

Methods of this class execute a single operation over one dataset. Each operation has

no inner data dependencies, so the implementation is trivial.

Listing 4.2: Addition Kernel

__kernel void AplBeqC(__global float* A, __global float* B, __global

float* C, int n)

int i = get_global_id(0);

if(i >= n)

return;

C[i] = A[i] + B[i];

This is very similar to the initialization class with the advantage that the input and

output bu�er have always the same size. As in the initialization class, a bad matrix

size can a�ect the performance, so the same solutions can be applied here. Since the

operations are light weight, the performance is heavily bandwidth-bound.

4.3.3 Reduction

Reduction operations take a data set and reduce it to a single value with a given com-

mutative function. The JavaCL library already implemented reductors for minimum,

maximum, sum and multiplication, so these are used if possible.

Listing 4.3: JavaCL reduction

Reductor<Float> reductor = ReductionUtils.createReductor(

Kernels.getContext(), ReductionUtils.Operation.Add,

OpenCLType.Float, 1);

Float Out = reductor.reduce(Kernels.getQueue(), Buffer).get();

In the remaining cases, it is necessary to create OpenCL Kernels. GPU accelerated

reduction is a well-known problem. [12] discusses a principal reduction algorithm and

various optimizations for it. The basic idea is that every work-item takes two values

of the input array and reduces them to a single value with a given function, which is

written to the output array, e�ectively halving the input size. The same operation is

now repeated recursively on the output array until a single value is left.

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2Values (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2Values

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3 Stride 2

Step 4 Stride 1

Thread IDs

Figure 4.2: Basic reduction without bank con�icts [12]

In order to e�ciently parallelize it on the GPU, the input is split into multiple sub-

arrays, one for each work-group. Every work-group loads its sub array into the local

memory and reduces it to a single value. Then, each work-group writes its result into

the output-array in the global memory, which has the size of the input-array divided

by the work-groups. If the output-array is su�ciently small, the �nal reduction is done

on the CPU. Otherwise, the output-array is taken as the new input and the process is

repeated.

Listing 4.4: Equals Kernel

__kernel void ReduceEquals(const __global float* In,

const uint n,

const __global float* Comp,

__global int* Out,

__local int* localMem)

if(get_global_id(0) < n)

localMem[get_local_id(0)] = (In[get_global_id(0)] == Comp[

get_global_id(0)]);

localMem[get_local_id(0)] = true;

barrier(CLK_LOCAL_MEM_FENCE);

for (uint stride = get_local_size(0)/2; stride > 0; stride /= 2) {

if (get_local_id(0) < stride) {

localMem[get_local_id(0)] &=

localMem[get_local_id(0) + stride];

barrier(CLK_LOCAL_MEM_FENCE);

if (get_local_id(0) == 0)

Out[get_group_id(0)] = localMem[0];

Listing 4.4 shows the reduction kernel for the equals operation. Since it is the idea of

the reduction-algorithm to always halve the input size, the input should be of the size

2n. This, however, may not be the case. This problem is solved by running the kernel

with k2work-items, where k is the next higher power of 2 of the input size. The work-

items which access valid data �ll the local memory with the correct data. The other

work-items �ll the local memory with neutral elements of the reduce operation, in this

case with the value �true�. Since now the data is padded to ful�ll the requirements, the

reduction can now proceed without any special cases. In the end, the �rst work-item

from each work-group writes its result back into the global memory.

4.3.4 Sorting

The only method in this class is the unique() function. The goal of this function is to

extract every unique value out of the matrix into a new vector. In the original Groovy-

Matrix this is done by building a tree-set �rst and then save the nodes back into an

ArrayList. The ArrayList contains then only sorted and unique values.

This method is not feasible on the GPU and an alternative way has to be found. Since

the output should be sorted in the end, the idea is to modify a known sorting-algorithm

to achieve this. Currently, there are three prominent algorithms: radix-sort, merge-sort

and sorting networks.

Out of these three, merge-sort is the most promising because it is easy to implement

and needs only small modi�cations to create the desired output. Merge-sort takes two

sorted sets as input. The output are the two sets merged together as a new sorted set.

For the unique operation, only a small change is necessary. Instead of a sorted set, the

new requirement is a sorted set containing unique values. To achieve this, only a single

line of code has to be changed (Listings 4.5 and 4.6).

Listing 4.5: Mergesort pseudocode

void merge(IN set1, IN set2, OUT outset)

for(i,j = 0; i<set1.size() && j <set2.size(); )

outset.add(min(set1[i],set2[j]));

set1[i] <= set2[j] ? ++i : ++j;

//fill outset with remaining values from set1 or set2

Listing 4.6: Unique Mergesort pseudocode

void merge(IN set1, IN set2, OUT outset)

for(i,j = 0; i<set1.size() && j <set2.size(); )

outset.add(min(set1[i],set2[j]));

if(set1[i] <= set2[j]) ++i;

if(set2[j] <= set1[i]) ++j;

//fill outset with remaining values from set1 or set2

This can be easily ported onto the GPU. Each work-item merges two sets, similar to

the reduction operation. But since now the merging incorporates sets instead of single

items, this can be very ine�cient when the matrix has many unique values. In the worst

case, the sets double in size after each merge. At the same time, the concurrency after

each merge is halved.

6 4 3 3 6 6 1 4

4 6 3 X 6 X 1 4

3 4 6 X 1 4 6 X

1 3 4 6 X X X X

Figure 4.3: Example of the unique() merging

This problem is well-known, and papers [13], [14], [15] and [16] o�er di�erent solutions

for general parallel search algorithms, yet they all require that the search space size does

not change, which is not the case. Since the expected output of the unique() operation

is an ArrayList on the host side, it could be considered to sort the entries of the matrix

on the GPU with a sorting algorithm from one of the aforementioned papers. During

the copy process from the bu�er to the ArrayList, duplicate entries can then easily

identi�ed and left out. This would, however, introduce the memory transactions from

the GPU to the CPU and from the GPU bu�er to the ArrayList as major bottlenecks.

It is questionable if more performance could be achieved (see Section 5.5). Due to the

time constrains, only the simple version of the merge sort was implemented and tested.

4.3.5 Searching

The �nd() function of the Groovy-Matrix falls into this category. It returns all indices

of the matrix which match a given value. Since there is no further information on the

structure of the matrix, the fastest search algorithm would be a linear search. In order

to parallelize the search, the matrix-array is broken down into equal sized subgroups

which are searched in parallel.

Because it is not possible to allocate memory dynamically on the GPU, it is necessary

to preallocate the memory for the output. This is done by a pre-processing step, where

every matching item per subgroup is counted. The counting kernel creates exactly the

desired output. By calling it only once, an array is created in the global memory on

the GPU containing the counted elements for each work-group. If the subgroup size for

the search is also chosen as the work-group size, the newly created array has the exact

count of matches per subgroup. Furthermore, each work-item is able to calculate its

writing o�set, by summing up all entries of the array whose index is smaller than the

global id of the work-item. The total amount of memory necessary for the output can

be calculated by summing up the array.

Output size = 5

Work-group: 0 1 2 3

Work-item: 0 1 2 3

Output:

5 2 3 4 3 2 3 2 8 2 1 4 9 2 3 3Data:

Count: 1 2 0 2

5 2 3 4 3 2 3 2 8 2 1 4 9 2 3 3

2 4 6 1415

Figure 4.4: Example of �nding the Element 3 in an array

Listing 4.7: Find Kernel

__kernel void findIndices(const __global float* In,

const uint n,

const uint itemsPerGroup,

const float comp,

__global int* Out,

__global int* numEntries)

if(numEntries[get_global_id(0)] == 0)

return;

int startIndex = itemsPerGroup * get_global_id(0);

int found = 0;

int writeIndex = 0;

for(int i = 0; i< get_global_id(0); ++i)

writeIndex += numEntries[i];

for(int i = 0; i < itemsPerGroup && (startIndex + i) < n; ++i)

if( In[(startIndex + i)] == comp )

Out[writeIndex++] = startIndex + i;

5 Performance

This chapter examines the performance of the GPU implementation of the Groovy-

Matrix in comparison to the CPU version. The performance is measured by the speedup

which is calculated by s = tCPU

tGPU, where tCPU is the execution time of the CPU version

of the Groovy-Matrix, and tGPU the execution time of the GPU version. In chapter 4,

classi�cations of algorithms were introduced. Since it is expected that each algorithm

in a class will behave similarly, only one representative of each class was tested.

5.1 Experimental Setup

The benchmarks were run on a AMD Opteron 2220 SE CPU having a Nvidia Quadro

FX 5400 graphics card with CUDA compute capability of 1.0. The Java version used

was 1.6.0_27 with the OpenJDK runtime environment. The used JavaCL version was

the javacl-1.0 snapshot-20120722 build.

The comparison between the GPU and CPU were done with multiple matrix entry-sizes.

The tested entry-sizes were 2n, n ∈ {2, ..., 24}, because most kernels require a input sizeof a power of 2 and are padded to match it otherwise.

One problem with making benchmarks in Java is that the JVM has a warm-up phase

in which a program is pro�led to make code optimizations. To avoid this problem, the

execution time of an operation was measured for each entry-size one hundred times and

the best time out of the series was taken for the benchmark. However, in some cases

this is still visible for the smallest entry-sizes in the CPU execution times. In order to

get the most accurate results, the benchmarks were done directly with the Java version

of the Groovy-Matrix.

5.2 Initialization

Initialization is not performance-critical and for this reason only the execution times

were measured (Figure 5.1). The focus of this section is to explore the general behavior

5 Performance

of Java and JavaCL. The benchmarks were done with the copy-constructor, mainly

because of its similarity to the Pure Data Parallel Operations.

The �rst observed behavior is that JavaCL kernel-calls are asynchronous, which has

to be considered for the benchmarks, otherwise the wrong time would be measured.

Figure 5.1 shows two measurements. One was made by calling the copy-constructor

normally (non blocking), the other one had a barrier implemented which waited until

the kernel �nished (blocking).

One thing to note is that the JVM pro�les a program during execution and is able to

optimize code sections during runtime, which is why the times decrease a bit until the

matrix reaches 217 entries.

The sudden increase of the execution time after the matrix exceeds 217 elements is odd.

Since the blocking and non blocking case are a�ected by this, it is assumed that this is

caused by the enqueueNDRange() function of the JavaCL-API.

Up until 214 matrix elements, the execution time of the copy-constructor is not a�ected

by the matrix size because until then the maximum concurrency of the GPU is not

reached. The kernel copies one entry per work-item from the old matrix to the new one,

such that for a matrix size of 214 = 16384 the same number of work-items are created.

The Nvidia Quadro FX 5400 can execute 24 warps with warp-size 32 concurrently per

core. The total number of cores of the Quadro FX 5400 is 16, which means that a total

number of 24 ∗ 32 ∗ 16 = 12288 work-items can be executed in parallel, which is right

between 213 and 214.

5 Performance

26 28 210 212 214 216 218 220 222 224

Matrix entries

time copy constructor operation

not blockingblocking

Figure 5.1: Comparison between the blocking and non blocking copy constructor

5.3 Pure Data Parallel Operations

The log() function is taken as the only representative of this class since all operations of

this class performed very similarly. Two sets of measurements were taken, the �rst with

2n matrix entries, and the second with p matrix entries, were p is a prime number,

p ∈ {3581, 8831, 47533, 105767, 285721, 610639, 1299821, 3496217, 7367197, 13832281}.The goal of the �rst measurement (Figures 5.2 and 5.3) is to examine the speedup

under optimal conditions. The second one (Figures 5.4 and 5.5) has the goal to exam-

ine how the performance is a�ected in the worst case scenario and how much in�uence

di�erent padding strategies have.

5.3.1 Best-Case-Scenario

The best-case-scenario shows little to no variation compared to the copy-constructor

(Figures 5.2 and 5.3) and the same e�ects a�ecting the performance are also present

in this case. Overall, the log() operation performs very well on the GPU and for big

matrices a speedup of almost a factor 1000 is achieved.

5 Performance

100000

26 28 210 212 214 216 218 220 222 224

Matrix entries

time log operation

CPU timesGPU times

Figure 5.2: Time used by the log() operation

26 28 210 212 214 216 218 220 222 224

Matrix entries

Speedup log operation

speedup

Figure 5.3: Speedup of the log() operation

5 Performance

5.3.2 Worst-Case-Scenario

For the worst-case, three di�erent padding strategies were tested for the GPU version:

no padding, padding to match a multiple of the maximum work-group size and padding

to match the warp-size.

The a�ects of a bad matrix size are clearly visible (Figures 5.4 and 5.5) if no padding

is used. Because the matrix-size is a prime number, it is impossible to have more than

one work-item per work-group. As a consequence, each warp contains also only a single

work-item and maximum concurrency is already reached at 24 warps ∗ 16 cores = 384

work-items. Therefore, only a low amount of speedup is achieved.

If padding is used both strategies show good results, except for 3496217 and 13832281

entries, where the padding to the warp-size performs worse than the padding to the work-

group size. The distribution of the work-items into work-groups is left to the JavaCL-API

which creates apparently less e�cient work-group sizes in these cases. Therefore, the

work-group padding is going to be used for the pure data parallel operations.

100000

3581 8831

105767

285721

610639

1.29982e+06

3.49622e+06

7.3672e+06

1.38323e+07

Matrix entries

time log (prime entries) operation

CPUGPU, no padding

GPU, workg. paddingGPU, warp padding

Figure 5.4: Time comparison of di�erent padding strategies

5 Performance

3581 8831

105767

285721

610639

1.29982e+06

3.49622e+06

7.3672e+06

1.38323e+07

Matrix entries

Speedup log (prime entries) operation

no paddingworkg. padding

warp padding

Figure 5.5: Speedup comparison of di�erent padding strategies

5 Performance

5.4 Reduction

For this category, two representatives were chosen, the sumAll() and the equals() oper-

ation. sumAll() is natively supported by JavaCL and equals() had to be implemented

anew. Other than to examine how these operations perform in comparison to their CPU

counterparts, another goal is to evaluate whether the native or the newly implemented

reduction functions are more e�cient.

Looking at the times of the sumAll() operations (Figures 5.6 and 5.7), it is striking

that the execution time is almost constant (about 4000ms) until the matrix reaches 220

elements. Because of this, no speedup is achieved until the matrix exceeds 216 elements.

So this operation is very ine�cient for small matrices. However, for big matrices a good

amount of speedup is achieved.

100000

26 28 210 212 214 216 218 220 222 224

Matrix entries

time sumAll operation

CPU timesGPU times

Figure 5.6: Time used by the sumAll() operation

5 Performance

26 28 210 212 214 216 218 220 222 224

Matrix entries

Speedup sumAll operation

speedup

Figure 5.7: Speedup of the sumAll() operation

The newly implemented reduction operation equals() performs more expectedly (Fig-

ures 5.8 and 5.9). Up to a matrix size of 215, the execution time is almost constant at

100ms. From this point on the time starts to scale linearly with the size. The speedup,

however, is rather moderate. The GPU version becomes more e�cient at a matrix size

of about 213 elements and only a maximum speedup of about factor 15 is achieved. This

is mainly attributed to the fact that the CPU version of the equals() function is very

5 Performance

100000

26 28 210 212 214 216 218 220 222 224

Matrix entries

time equals operation

CPU timesGPU times

Figure 5.8: Time used by the equals() operation

26 28 210 212 214 216 218 220 222 224

Matrix entries

Speedup equals operation

Speedup

Figure 5.9: Speedup of the equals() operation

Compared to the native JavaCL reduction, the new implemented reduction performs

5 Performance

much better for small matrices. For large entry sizes, the execution times of both

functions start to converge. Since the new implementation shows generally a better

performance, the JavaCL reduction should be replaced.

5 Performance

5.5 Sorting

This section discusses how well the unique() function performs, since it is the only oper-

ation from the sorting category. The performance of unique() is very dependent on how

many values are unique. In order to fully examine the behavior of the unique() function,

multiple measurements were taken with di�erent levels of uniqueness: all values are the

same and 1%, 25%, 50% or 100% of the total values are unique.

The CPU version acts overall as expected, the bigger the matrix size and the more

unique values it contains, the more time is needed for the unique() operation (Fig-

ure 5.10). However, the case where every value is unique is an exception. The output

is produced, by creating a tree-set with the matrix entries. The values of the tree are

then copied back into an ArrayList. Since the 100% case has to copy the most data, it

should also use the most time, yet it is the second fastest and uses only about a third

of the time that it should. The only explanation for this is that the JVM was able to

make better byte-code optimizations in this case.

100000

26 28 210 212 214 216 218 220 222

Matrix entries

time unique CPU operation

1 value1%

25%50%

Figure 5.10: CPU times of the unique() operation.

The GPU version of the unique() function is the least optimized one and this becomes

visible in the execution times (Figure 5.11). As of now, there are three bottlenecks: no

5 Performance

local memory is used, the merging of sets is only done by a single thread and the

result has to be copied back to the host since the expected output of the function is

an ArrayList. With increasing uniqueness of the values, these bottlenecks get more and

more weight in the calculation because more global memory accesses are made, each

thread has more work to do and more data has to be copied.

However, looking at the speedup, the GPU version still gains a speedup by a factor of

2 (50%-case) up to a factor of 35 (single value) for big matrices (Figure 5.12). Because

the 100%-case is unusually fast on the CPU, the GPU version is overall slower.

100000

26 28 210 212 214 216 218 220 222

Matrix entries

time unique GPU operation

1 value1%

25%50%

Figure 5.11: GPU times of the unique() operation.

5 Performance

26 28 210 212 214 216 218 220 222

Matrix entries

Speedup unique GPU operation

1 value1%

25%50%

Figure 5.12: Speedup of the unique() operation.

While there is lots of room for improvement of the unique kernel, it is also questionable

if much more performance can actually be achieved. Copying the complete data on the

CPU from the ArrayList into a TreeSet has a complexity of O(n∗log(n)) and copying thedata back into a new ArrayList takes another O(n) operations, where n is the entry-size.The GPU version also needs to copy its data back to the host and then into an ArrayList.

Assuming that the kernel scales optimally and needs only a constant time to execute,

the maximum speedup which can be achieved is determined by the copy-operations on

the CPU, which has also a complexity of O(n). That means, on average the speedup in

relation to the matrix size is approximately n∗log(n)+nn

= log(n) + 1 ≈ log(n).

5 Performance

5.6 Searching

�nd() is the only search function of the search category. The performance of this oper-

ation is also very dependent on how many matching entries are present in the matrix.

So measurements were taken where 0%, 25%, 50% and 100% of the entries match the

searched item.

The CPU implementation is realized by a linear search over an ArrayList. The out-

put is a new ArrayList which contains the indices of the matching items. Therefore, the

execution time is determined by the matrix size and how many items are found. This

can also be seen in Figure 5.13.

100000

26 28 210 212 214 216 218 220 222 224

Matrix entries

time find CPU operation

0%25%50%

Figure 5.13: CPU times of the �nd() operation

In the GPU implementation, the �nd() operation is realized in two steps: an initial-

ization phase, where the matching elements are counted in order to create the output

bu�er, and a second phase, where the data is copied. The initialization is done by calling

the count-kernel once. If no element is found, the operation can be terminated and an

empty list is returned. Because of that, the case where no element is found is much

faster than the other cases (see Figure 5.14). Otherwise the second kernel is called,

where the matrix is split into multiple groups on which the elements then are searched

5 Performance

in parallel.

The kernel itself is independent of the amount of matches, since each work-item always

has to access all elements of its group. But since more data is copied from the GPU to

the CPU with a higher match-rate, the execution time also increases the more items are

found.

100000

26 28 210 212 214 216 218 220 222 224

Matrix entries

time find GPU operation

0%25%50%

Figure 5.14: GPU times of the �nd() operation

Overall, this operation bene�ted the least from the GPU implementation and shows

in the average case no speedup at all. Like the unique() operation, the bottleneck of the

�nd() is mainly in the transfer of the data from the GPU into the ArrayList on the host.

Since both, the linear search on the CPU and the data transfer, are O(n) operations, itis unlikely to achieve much better results.

5 Performance

26 28 210 212 214 216 218 220 222 224

Matrix entries

Speedup find GPU operation

0%25%50%

Figure 5.15: Speedup of the �nd() operation

6 Conclusion

Overall, the GPU integration can be considered a success. Most operations showed a

good amount of speedup. As for the �nd() and unique() operation, it should be consid-

ered to change the returned data-type from ArrayList to Matrix, so that the bottleneck

of the data-transfer from the GPU to the host could be removed.

Otherwise, the current implementation of the Groovy-Matrix lacks many matrix oper-

ations, like a real matrix multiplication, matrix inversion or the calculation of the de-

terminant. Furthermore, the GPU implementation has as of now multiple limitations.

The only data-type currently supported are �oats. The extension to other data-types

can easily be done by moving the .cl �les for the kernels directly into the source code, in

which the static data-type is replaced with a variable placeholder for each type. Also,

the matrix is currently stored as a �at array, which is �ne for dense matrices. For sparse

matrices, however, this is ine�cient, since most entries are zero and storing them is not

necessary. This would require a complete rewrite of the existing kernels because entirely

di�erent data structures and access-patterns are then required.

Finally, during the benchmarks a general problem with the JavaCL-API occurred. The

cleaning of the GPU memory is dependent on the Java garbage collector, yet the mem-

ory occupation of the GPU is not tracked by the JVM. Unless this will be �xed in a

future release of JavaCL, it is advisable to enhance the GPU implementation with an

own memory manager.

Bibliography

[1] Universal Java Matrix Package. http://sourceforge.net/projects/

ujmp/, May 2013.

[2] JavaCL, UJMP Matrices. http://code.google.com/p/javacl/wiki/

Goodies, May 2013.

[3] JCuda. http://www.jcuda.org/jcuda/JCuda.html, May 2013.

[4] CUBLAS. https://developer.nvidia.com/cublas, May 2013.

[5] KALDI, The CUDA Matrix library. http://kaldi.sourceforge.net/

cudamatrix.html, May 2013.

[6] MAGMA, Matrix Algebra on GPU and Multicore Architectures. http://icl.

cs.utk.edu/magma/index.html, May 2013.

[7] CUSP : A C++ Templated Sparse Matrix Library. http://cusplibrary.

github.io/, May 2013.

[8] Khronos OpenCL Working Group. The OpenCL Speci�cation. 2009.

[9] Oliver Cha�k. Getting Started with JavaCL. http://code.google.com/p/

javacl/wiki/GettingStarted, April 2012.

[10] Nvidia. OpenCL Programming Guide for the CUDA Architecture. 2012.

[11] Advanced Micro Devices (AMD). ATI Stream Computing OpenCL. 2010.

[12] Mark Harris. Optimizing Parallel Reduction in CUDA. http://developer.

download.nvidia.com/compute/cuda/1.1-Beta/x86_website/

projects/reduction/doc/reduction.pdf, November 2007.

[13] Xiaochun Ye, Dongrui Fan, Lin Wei, Yuan Nan, and Paolo Ienne. High Performance

Comparison-Based Sorting Algorithm on Many-Core GPUs. 2010.

Bibliography

[14] Kliang Zhang, Jiajia Li, Gang Chen, and Baifen Wu. GPU Accelerate Parallel

Odd-Even Merge Sort: An OpenCL Method. 2011.

[15] Nadathur Satish, Mark Harris, and Michael Garland. Designing E�cient Sorting

Algorithms for Manycore GPUs. IEEE, 2009.

[16] Erik Sintorn and Ulf Assersson. Fast parallel GPU-sorting using a hybrid algorithm.

Journal of Parallel and Distributed Computing, pages 1381�1387, 2008.

Bachelorthesis Integrating GPU accelerated Matrix operations … · 2014-03-12 · Bachelorthesis...

Documents