wiki:MatrixMultiply

Context Navigation

Version 14 (modified by nakasato, 16 years ago) (diff)
--

Matrix Multiply on GPU

We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams. One is transposed input matrix A and other is input matrix B in normal format. Output matrix C is also not transposed. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured.

Peformance Summary

board	Pmax	Nmax	prec	GPR	MADD peak
HD4850	736	3328	SP	25	1040
HD5870	2140	7424	SP	25	2720
HD4850	160	1920	DP	17	208
HD5870	431	2432	DP	17	544

Source code

Will be posted later.

Single precision

Double precision

Useful forum discussions

Discussion on a highly optimized MM kernel

http://forum.beyond3d.com/showthread.php?t=54842

Discussion on MM kernels in OpenCL

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963

IL code generator in C++

CAL++ http://sourceforge.net/projects/calpp/

Meta-programing works in reality.

Attachments (6)

MM1.png (12.7 KB) - added by nakasato 16 years ago.
DMM.png (5.0 KB) - added by nakasato 16 years ago.
SMM.png (5.3 KB) - added by nakasato 16 years ago.
DDMM.png (5.2 KB) - added by nakasato 16 years ago.
dis.txt (19.7 KB) - added by nakasato 16 years ago.
kernel_single.il (2.8 KB) - added by nakasato 16 years ago.

Download all attachments as: .zip

Download in other formats:

Plain Text