wiki:MatrixMultiply

Context Navigation

Version 25 (modified by nakasato, 16 years ago) (diff)
--

Matrix Multiply on GPU

We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams for computing C=AB. One is transposed input matrix A (i.e. column major) and other is input matrix B in normal format (i.e. row major). Output matrix C is also row major. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured.

Add double-double performance. We used 2x2 block in this case. On Cypress architecture GPU, we take advantage of FMA_64 instruction.

Peformance Summary

board	Pmax	Nmax	prec	reg. usage	MAD peak
HD4850	736	3328	SP	25	1040
HD5870	2140	7424	SP	25	2720
HD4850	177	1408	DP	19	208
HD5870	475	2048	DP	19	544
HD4850	7.5	768	DD	21
HD5870	20	1024	DD	21
HD5870 FMA	31	1024	DD	18