wiki:Fast_GEMM_implementation_On_Cypress

Version 4 (modified by nakasato, 14 years ago) (diff)

--

A Fast GEMM Implementation on a Cypress GPU

by N.Nakasato (University of Aizu), submitted September 7, 2010.

We will present out results on this paper at 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10) held as part of SC10, New Orleans, November 13-19, 2010.

abstract

We present benchmark results of optimized dense matrix multiplication kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. This performance in DDP is more than 200 times faster than the performance in DDP on single core of a recent CPU (with mpack version 0.6.5). We describe our GEMM kernels with main focus on the SGEMM implementation since all GEMM kernels share common programming and optimization techniques. While a conventional wisdom of GPU programming recommends us to heavily use shared memory on GPUs, we show that texture cache is very effective on the Cypress architecture.

preliminary results

preprint

Posted later.

Sample program for DGEMM

http://github.com/dadeba/dgemm_cypress/

Attachments (1)

Download all attachments as: .zip