Context Navigation

Changes between Initial Version and Version 1 of Fast_GEMM_implementation_On_Cypress

                       v1
+= A fast GEMM implementation on a Cypress GPU =
+by N.Nakasato (University of Aizu), submitted September 7, 2010.
+We will present out results on this paper at 1st International Workshop on
+Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10)
+held as part of SC10, New Orleans, November 13-19, 2010.
+== abstract ==
+We present benchmark results of optimized dense matrix multiplication
+kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels
+for single (SP), double (DP) and double-double (DDP) precision.
+Our SGEMM and DGEMM kernels show 73% and 87% of
+the theoretical performance of the GPU, respectively.
+Currently, our SGEMM and DGEMM kernels are fastest
+with one GPU chip to our knowledge.
+Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s.
+It is more than 200 times faster than the performance
+results on single core of a recent CPU (with mpack version 0.6.5).
+We describe our GEMM kernels with main focus on the SGEMM implementation
+since all GEMM kernels share common programming and optimization techniques.
+While a conventional wisdom of GPU programming recommends us
+to heavily use shared memory on GPUs,
+we show that texture cache is very effective on the Cypress architecture.
+== preliminary results ==
+ * [wiki:"GEMM_Performance_Cypress"]
+ * [wiki:"MatrixMultiply"]
+== preprint ==
+Posted later.
+== Sample program for DGEMM ==