Math Performance on a GPU

Christian Balkenius

Introduction

With the introduction of the CUDA environment for graphics cards from NVIDIA, it has become possible to run numerical calculations of the GPU much faster than is possible on the main processor. It would obviously be useful if Ikaros could use such functions and a number of tests were performed to test what could be gained from using the GPU for math operations.

The sgemm function is the work horse of many numerical algorithms and it was selected for the tests. The sgemm function is one of the subroutines of the BLAS library and performs multiplication between two matrices.

Using CUDA on OS X

The tests were run on OS X and since there were a few things that I did not immediately realize after installing CUDA on OS X, this section describes the few steps that are necessary to get CUDA to work.

1. Make sure the CUDA drivers are installed.

Since I did not have an NVIDIA graphics card when I installed Leopard, it did not seem that the drivers were installed. I reinstalled the 10.5.3 combo update of OS X and this appears to have corrected this problem. (It is possible that this was not necessary and that I simply misuderstood the error messages I got.)

2. Install CUDA.

The files are available at the NVIDA site.

3. Locate the libraries

The CUDA libraries are installed in /usr/local/cuda, but for some reason, this is not mentioned anywhere.

4. Set the path correctly to run the examples

The examples installed as part of CUDA do not run out of the box. First it is necessary to set the path to the libraries. This involved finding the hidden file .profile in your home directly and adding the following two lines (using for example TextWrangler):

   export PATH=/usr/local/cuda/bin:$PATH
   export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH

5. Run an example in XCode

Apart from adding the files to an XCode project, this involves four steps.

Drag the relevant CUDA library from /usr/local/cuda/lib to the project
Drag the relevant CUDA headers from /usr/local/cuda/include to the project
In the build settings, add the Header Search Path "/usr/local/cuda/include" and the Library Search Path /usr/local/cuda/lib. Make sure that you have select "All Configurations" before these values are set.
The last and least obvious step is to set the variable DYLD_LIBRARY_PATH to /usr/local/cuda/lib in the environment for the executable. This is done by selecting the executable in XCode and pressing the info icon and then selecting Arguments in the window that appears.

Testing CUDA

All test were run on a Mac Pro 2 x Dual-Core Intel Xeon at 2.66 GHz with 1.33 GHz bus speed. The GPU used was a NVIDIA GeForce 8800 GT.

A matrix multiplication function that multiplies two NxN matrices was tested with different N (128, 256, 512, 1024 and 2048). The time required for the multiplication was tested under five conditions:

CUBLAS: This implementation uses CUBLAS. The timing did not include the time required to move the data to and from the GPU.
CUBLAS+MEM: This second implementation was identical to above but the time needed to move data to and from the GPU is included. This is the execution time that would result if the CUBLAS functions are simply used to replace the standard BLAS functions.
CBLAS: This implementation uses the Accelerate framework from Apple. This framework is optimized to use the SSE instructions of the four processor cores simultaneously.
ikaros: This is the standard multiplication used in Ikaros when BLAS in not available.
simple: The final implementation is the simple version included with CUBLAS as a test of the performance without the GPU. This version was mainly included to test how well it would compare to the optimized Ikaros version of the function.

Results

The table below shows the results of the test. The test was not very scientific so the results should be taken with a grain of salt, but the figures still indicate something about the performance of the different methods. It is clear from the table that the execution time is dramatically different for the different cases. In the case with N = 2048, the cublas functions performed 3000 times faster than a straight forward implementation of the matrix multiplication.

N	cublas	cublas+mem	cblas	ikaros	simple
128	0.16	1	2	5	6
256	0.44	2	2	38	34
512	2.62	9	10	310	281
1024	19	39	54	8920	5923
2048	152	219	387	494116	497435

Time for the matrix operation (ms)

Discussion

Even if the performance gain was not as large when the memory operations were included, the CUBLAS functions were still faster than all the other alternatives in all cases. This shows that it is really worth the effort to use the GPU for these types of operations.

Although it was reasonably straight forward to use the cublas functions, there were also a few problems. One is that it was not possible to use the functions with a matrix larger than 2048 x 2048 since the GPU memory would otherwise be too small. This may be a problem with very large calculations.

An additional conclusion to draw from the results is that it may be worth the extra effort to implement whole algorithms on the GPU. If a sequence of operations can be performed without copying data into main memory, then it is possible to produce very fast implementations. For example, a neural network could keep its state and weight matrix on the GPU and only the input and output would have to be copied to it and from it.

This suggests that it may be a good idea to try to implement functions in Ikaros that supports this type of programming although it is unclear exactly how this should work. However, it would be necessary to first test how much the execution time decreases when non-linear operations or conditions are included in the calculations.

In conclusion, the simple experiments have shows that it can be useful to use the GPU for numerical calculations. Furthermore, the results suggest that it will probably be necessary to look at GPU processing in the future to keep Ikaros competitive. If the operations that copy data to and from the GPU are sufficient, then it would be easy to use functions instead of cblas.