Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of Fast_GEMM_Implementation_On_Cypress

Timestamp:: Nov 7, 2010 11:22:02 AM (15 years ago)
Author:: nakasato
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Fast_GEMM_Implementation_On_Cypress

                       v1
+= A Fast GEMM Implementation On a Cypress GPU =
+by N.Nakasato (University of Aizu), submitted September 7, 2010. Accepted for presentation October 10, 2010.
+We will present out results on this paper at 1st International Workshop on
+Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10 http://www.dcs.warwick.ac.uk/~sdh/pmbs10/pmbs10/Workshop_Home.html)
+held as part of SC10, New Orleans, November 13-19, 2010.
+== abstract ==
+We present benchmark results of optimized dense matrix multiplication
+kernels for Cypress GPU.
+We write general matrix multiply (GEMM) kernels
+for single (SP), double (DP) and double-double (DDP) precision.
+Our SGEMM and DGEMM kernels show ~ 2 Tflop/s and ~ 470 Glop/s, respectively.
+These results for SP and DP correspond to 73% and 87% of
+the theoretical performance of the GPU, respectively.
+Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge.
+Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s.
+This performance in DDP is more than 200 times faster than the performance
+results in DDP on single core of a recent CPU (with mpack version 0.6.5).We describe our GEMM kernels with main focus on the SGEMM implementation
+since all GEMM kernels share common programming and optimization techniques.While a conventional wisdom of GPU programming recommends us
+to heavily use shared memory on GPUs,  we show that texture cache is very effective on the Cypress architecture.
+== Sample program for DGEMM ==
+http://github.com/dadeba/dgemm_cypress/
+== preliminary results ==
+ * [wiki:"GEMM_Performance_Cypress"]
+ * [wiki:"MatrixMultiply"]
+== preprint ==
+[attachment:Nakasato_PBMS2010.pdf]