Context Navigation

Changes between Version 1 and Version 2 of Fastest_GEMM_implementation_On_Cypress

-                      v1
+                      v2
 = A fastest GEMM implementation on Cypress GPU =
+by N.Nakasato (University of Aizu)
+== abstrac ==
+by N.Nakasato (University of Aizu), submitted September 7, 2010
+== abstract ==
+We present benchmark results of optimized dense matrix multiplication
+kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels
+for single (SP), double (DP) and double-double (DDP) precision.
+Our SGEMM and DGEMM kernels show 73% and 87% of
+the theoretical performance of the GPU, respectively.
+Currently, our SGEMM and DGEMM kernels are fastest
+with one GPU chip to our knowledge.
+Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s.
+It is more than 200 times faster than the performance
+results on single core of a recent CPU (with mpack version 0.6.5).
+We describe our GEMM kernels with main focus on the SGEMM implementation
+since all GEMM kernels share common programming and optimization techniques.
+While a conventional wisdom of GPU programming recommends us
+to heavily use shared memory on GPUs,
+we show that texture cache is very effective on the Cypress architecture.
+== Results ==