| 1 | = A Fast GEMM Implementation On a Cypress GPU = |
| 2 | by N.Nakasato (University of Aizu), submitted September 7, 2010. Accepted for presentation October 10, 2010. |
| 3 | |
| 4 | We will present out results on this paper at 1st International Workshop on |
| 5 | Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10 http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Home.html) |
| 6 | held as part of SC10, New Orleans, November 13-19, 2010. |
| 7 | |
| 8 | == abstract == |
| 9 | We present benchmark results of optimized dense matrix multiplication |
| 10 | kernels for Cypress GPU. |
| 11 | We write general matrix multiply (GEMM) kernels |
| 12 | for single (SP), double (DP) and double-double (DDP) precision. |
| 13 | Our SGEMM and DGEMM kernels show ~ 2 Tflop/s and ~ 470 Glop/s, respectively. |
| 14 | These results for SP and DP correspond to 73% and 87% of |
| 15 | the theoretical performance of the GPU, respectively. |
| 16 | Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. |
| 17 | Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. |
| 18 | This performance in DDP is more than 200 times faster than the performance |
| 19 | results in DDP on single core of a recent CPU (with mpack version 0.6.5).We describe our GEMM kernels with main focus on the SGEMM implementation |
| 20 | since all GEMM kernels share common programming and optimization techniques.While a conventional wisdom of GPU programming recommends us |
| 21 | to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture. |
| 22 | |
| 23 | |
| 24 | == Sample program for DGEMM == |
| 25 | |
| 26 | http://github.com/dadeba/dgemm_cypress/ |
| 27 | |
| 28 | == preliminary results == |
| 29 | * [wiki:"GEMM_Performance_Cypress"] |
| 30 | * [wiki:"MatrixMultiply"] |
| 31 | |
| 32 | == preprint == |
| 33 | |
| 34 | [attachment:Nakasato_PBMS2010.pdf] |