Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Version 12 and Version 13 of Fastest_GEMM_implementation_On_Cypress

Timestamp:: Oct 11, 2010 8:35:57 AM (15 years ago)
Author:: nakasato
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Fastest_GEMM_implementation_On_Cypress

-                      v12
+                      v13
+= A fast GEMM implementation on a Cypress GPU =
+by N.Nakasato (University of Aizu), submitted September 7, 2010.
+This paper will be presented at 1st International Workshop on
+Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10)
+held as part of SC10, New Orleans, November 13-19, 2010
+== abstract ==
+We present benchmark results of optimized dense matrix multiplication
+kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels
+for single (SP), double (DP) and double-double (DDP) precision.
+Our SGEMM and DGEMM kernels show 73% and 87% of
+the theoretical performance of the GPU, respectively.
+Currently, our SGEMM and DGEMM kernels are fastest
+with one GPU chip to our knowledge.
+Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s.
+It is more than 200 times faster than the performance
+results on single core of a recent CPU (with mpack version 0.6.5).
+We describe our GEMM kernels with main focus on the SGEMM implementation
+since all GEMM kernels share common programming and optimization techniques.
+While a conventional wisdom of GPU programming recommends us
+to heavily use shared memory on GPUs,
+we show that texture cache is very effective on the Cypress architecture.
+== preliminary results ==
+ * [wiki:"GEMM_Performance_Cypress"]
+ * [wiki:"MatrixMultiply"]
+== preprint ==
+Posted later.
+== Sample program for DGEMM ==
+See [wiki:"Fast_GEMM_implementation_On_Cypress"] since we slightly change the title ;-).