Changes between Version 1 and Version 2 of Fastest_GEMM_implementation_On_Cypress


Ignore:
Timestamp:
Sep 9, 2010 1:17:51 AM (14 years ago)
Author:
nakasato
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Fastest_GEMM_implementation_On_Cypress

    v1 v2  
    11= A fastest GEMM implementation on Cypress GPU = 
    2 by N.Nakasato (University of Aizu) 
    3 == abstrac == 
     2by N.Nakasato (University of Aizu), submitted September 7, 2010 
     3== abstract == 
     4We present benchmark results of optimized dense matrix multiplication 
     5kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels 
     6for single (SP), double (DP) and double-double (DDP) precision.  
     7Our SGEMM and DGEMM kernels show 73% and 87% of  
     8the theoretical performance of the GPU, respectively. 
     9Currently, our SGEMM and DGEMM kernels are fastest  
     10with one GPU chip to our knowledge. 
     11Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. 
     12It is more than 200 times faster than the performance  
     13results on single core of a recent CPU (with mpack version 0.6.5). 
     14We describe our GEMM kernels with main focus on the SGEMM implementation 
     15since all GEMM kernels share common programming and optimization techniques. 
     16While a conventional wisdom of GPU programming recommends us  
     17to heavily use shared memory on GPUs,   
     18we show that texture cache is very effective on the Cypress architecture.  
     19 
     20== Results ==