Changes between Version 8 and Version 9 of Fast_GEMM_implementation_On_Cypress
- Timestamp:
- Nov 2, 2010 12:50:19 AM (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Fast_GEMM_implementation_On_Cypress
v8 v9 8 8 == abstract == 9 9 We present benchmark results of optimized dense matrix multiplication 10 kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels 10 kernels for Cypress GPU. 11 We write general matrix multiply (GEMM) kernels 11 12 for single (SP), double (DP) and double-double (DDP) precision. 12 Our SGEMM and DGEMM kernels show 73% and 87% of 13 Our SGEMM and DGEMM kernels show ~ 2 Tflop/s and ~ 470 Glop/s, respectively. 14 These results for SP and DP correspond to 73% and 87% of 13 15 the theoretical performance of the GPU, respectively. 14 Currently, our SGEMM and DGEMM kernels are fastest 15 with one GPU chip to our knowledge. 16 Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. 16 17 Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. 17 18 This performance in DDP is more than 200 times faster than the performance 18 in DDP on single core of a recent CPU (with mpack version 0.6.5). 19 We describe our GEMM kernels with main focus on the SGEMM implementation 20 since all GEMM kernels share common programming and optimization techniques. 21 While a conventional wisdom of GPU programming recommends us 22 to heavily use shared memory on GPUs, 23 we show that texture cache is very effective on the Cypress architecture. 19 results in DDP on single core of a recent CPU (with mpack version 0.6.5).We describe our GEMM kernels with main focus on the SGEMM implementation 20 since all GEMM kernels share common programming and optimization techniques.While a conventional wisdom of GPU programming recommends us 21 to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture. 22 24 23 25 24 == Sample program for DGEMM ==