| 2 | | by N.Nakasato (University of Aizu) |
| 3 | | == abstrac == |
| | 2 | by N.Nakasato (University of Aizu), submitted September 7, 2010 |
| | 3 | == abstract == |
| | 4 | We present benchmark results of optimized dense matrix multiplication |
| | 5 | kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels |
| | 6 | for single (SP), double (DP) and double-double (DDP) precision. |
| | 7 | Our SGEMM and DGEMM kernels show 73% and 87% of |
| | 8 | the theoretical performance of the GPU, respectively. |
| | 9 | Currently, our SGEMM and DGEMM kernels are fastest |
| | 10 | with one GPU chip to our knowledge. |
| | 11 | Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. |
| | 12 | It is more than 200 times faster than the performance |
| | 13 | results on single core of a recent CPU (with mpack version 0.6.5). |
| | 14 | We describe our GEMM kernels with main focus on the SGEMM implementation |
| | 15 | since all GEMM kernels share common programming and optimization techniques. |
| | 16 | While a conventional wisdom of GPU programming recommends us |
| | 17 | to heavily use shared memory on GPUs, |
| | 18 | we show that texture cache is very effective on the Cypress architecture. |
| | 19 | |
| | 20 | == Results == |