Context Navigation

Changes between Version 8 and Version 9 of Fast_GEMM_implementation_On_Cypress

Timestamp:: Nov 2, 2010 12:50:19 AM (15 years ago)
Author:: nakasato
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Fast_GEMM_implementation_On_Cypress

-                      v8
+                      v9
 == abstract ==
 We present benchmark results of optimized dense matrix multiplication
+kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels
+kernels for Cypress GPU.
+We write general matrix multiply (GEMM) kernels
 for single (SP), double (DP) and double-double (DDP) precision.
+Our SGEMM and DGEMM kernels show 73% and 87% of
+Our SGEMM and DGEMM kernels show ~ 2 Tflop/s and ~ 470 Glop/s, respectively.
+These results for SP and DP correspond to 73% and 87% of
 the theoretical performance of the GPU, respectively.
+Currently, our SGEMM and DGEMM kernels are fastest
+with one GPU chip to our knowledge.
+Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge.
 Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s.
 This performance in DDP is more than 200 times faster than the performance
+in DDP on single core of a recent CPU (with mpack version 0.6.5).
+We describe our GEMM kernels with main focus on the SGEMM implementation
+since all GEMM kernels share common programming and optimization techniques.
+While a conventional wisdom of GPU programming recommends us
+to heavily use shared memory on GPUs,
+we show that texture cache is very effective on the Cypress architecture.
+results in DDP on single core of a recent CPU (with mpack version 0.6.5).We describe our GEMM kernels with main focus on the SGEMM implementation
+since all GEMM kernels share common programming and optimization techniques.While a conventional wisdom of GPU programming recommends us
+to heavily use shared memory on GPUs,  we show that texture cache is very effective on the Cypress architecture.
 == Sample program for DGEMM ==