Changes between Version 8 and Version 9 of Fast_GEMM_implementation_On_Cypress


Ignore:
Timestamp:
Nov 2, 2010 12:50:19 AM (14 years ago)
Author:
nakasato
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Fast_GEMM_implementation_On_Cypress

    v8 v9  
    88== abstract == 
    99We present benchmark results of optimized dense matrix multiplication 
    10 kernels for a Cypress GPU. We write general matrix multiply (GEMM) kernels 
     10kernels for Cypress GPU.  
     11We write general matrix multiply (GEMM) kernels 
    1112for single (SP), double (DP) and double-double (DDP) precision.  
    12 Our SGEMM and DGEMM kernels show 73% and 87% of  
     13Our SGEMM and DGEMM kernels show ~ 2 Tflop/s and ~ 470 Glop/s, respectively. 
     14These results for SP and DP correspond to 73% and 87% of  
    1315the theoretical performance of the GPU, respectively. 
    14 Currently, our SGEMM and DGEMM kernels are fastest  
    15 with one GPU chip to our knowledge. 
     16Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. 
    1617Furthermore, the performance of our matrix multiply kernel in DDP is 31 Gflop/s. 
    1718This performance in DDP is more than 200 times faster than the performance  
    18 in DDP on single core of a recent CPU (with mpack version 0.6.5). 
    19 We describe our GEMM kernels with main focus on the SGEMM implementation 
    20 since all GEMM kernels share common programming and optimization techniques. 
    21 While a conventional wisdom of GPU programming recommends us  
    22 to heavily use shared memory on GPUs,   
    23 we show that texture cache is very effective on the Cypress architecture.  
     19results in DDP on single core of a recent CPU (with mpack version 0.6.5).We describe our GEMM kernels with main focus on the SGEMM implementation 
     20since all GEMM kernels share common programming and optimization techniques.While a conventional wisdom of GPU programming recommends us  
     21to heavily use shared memory on GPUs,  we show that texture cache is very effective on the Cypress architecture.  
     22 
    2423 
    2524== Sample program for DGEMM ==