1 | | = A fast GEMM implementation on a Cypress GPU = |
2 | | by N.Nakasato (University of Aizu), submitted September 7, 2010. |
3 | | |
4 | | This paper will be presented at 1st International Workshop on |
5 | | Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10) |
6 | | held as part of SC10, New Orleans, November 13-19, 2010 |
7 | | |
8 | | == abstract == |
9 | | We present benchmark results of optimized dense matrix multiplication |
10 | | kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels |
11 | | for single (SP), double (DP) and double-double (DDP) precision. |
12 | | Our SGEMM and DGEMM kernels show 73% and 87% of |
13 | | the theoretical performance of the GPU, respectively. |
14 | | Currently, our SGEMM and DGEMM kernels are fastest |
15 | | with one GPU chip to our knowledge. |
16 | | Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. |
17 | | It is more than 200 times faster than the performance |
18 | | results on single core of a recent CPU (with mpack version 0.6.5). |
19 | | We describe our GEMM kernels with main focus on the SGEMM implementation |
20 | | since all GEMM kernels share common programming and optimization techniques. |
21 | | While a conventional wisdom of GPU programming recommends us |
22 | | to heavily use shared memory on GPUs, |
23 | | we show that texture cache is very effective on the Cypress architecture. |
24 | | |
25 | | == preliminary results == |
26 | | * [wiki:"GEMM_Performance_Cypress"] |
27 | | * [wiki:"MatrixMultiply"] |
28 | | |
29 | | == preprint == |
30 | | Posted later. |
31 | | |
32 | | == Sample program for DGEMM == |
33 | | |
34 | | |
| 1 | See [wiki:"Fast_GEMM_implementation_On_Cypress"] since we slightly change the title ;-). |