Context Navigation

Changes between Version 26 and Version 27 of MatrixMultiply

v26	v27
2	2	We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams for computing C=AB. One is transposed input matrix A (i.e. column major) and other is input matrix B in normal format (i.e. row major). Output matrix C is also row major. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured.
3	3
4		Add double-double performance. We used 2x2 block in this case. On Cypress architecture GPU, we take advantage of FMA_64 instruction.
	4	Add double-double precision performance. We used 2x2 block in this case. On Cypress architecture GPU, we take advantage of FMA_64 instruction.
5	5
6	6	== Peformance Summary ==