Changes between Initial Version and Version 1 of Fast_GEMM_implementation_On_Cypress


Ignore:
Timestamp:
Oct 11, 2010 8:33:13 AM (14 years ago)
Author:
nakasato
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Fast_GEMM_implementation_On_Cypress

    v1 v1  
     1= A fast GEMM implementation on a Cypress GPU = 
     2by N.Nakasato (University of Aizu), submitted September 7, 2010. 
     3 
     4We will present out results on this paper at 1st International Workshop on  
     5Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10) 
     6held as part of SC10, New Orleans, November 13-19, 2010. 
     7 
     8== abstract == 
     9We present benchmark results of optimized dense matrix multiplication 
     10kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels 
     11for single (SP), double (DP) and double-double (DDP) precision.  
     12Our SGEMM and DGEMM kernels show 73% and 87% of  
     13the theoretical performance of the GPU, respectively. 
     14Currently, our SGEMM and DGEMM kernels are fastest  
     15with one GPU chip to our knowledge. 
     16Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. 
     17It is more than 200 times faster than the performance  
     18results on single core of a recent CPU (with mpack version 0.6.5). 
     19We describe our GEMM kernels with main focus on the SGEMM implementation 
     20since all GEMM kernels share common programming and optimization techniques. 
     21While a conventional wisdom of GPU programming recommends us  
     22to heavily use shared memory on GPUs,   
     23we show that texture cache is very effective on the Cypress architecture.  
     24 
     25== preliminary results == 
     26 * [wiki:"GEMM_Performance_Cypress"] 
     27 * [wiki:"MatrixMultiply"] 
     28 
     29== preprint == 
     30Posted later. 
     31 
     32== Sample program for DGEMM == 
     33