Context Navigation

Changes between Version 28 and Version 29 of MatrixMultiply

v28	v29
2	2	We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams for computing C=AB. One is transposed input matrix A (i.e. column major) and other is input matrix B in normal format (i.e. row major). Output matrix C is also row major. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured.
3	3
4		Add double-double (DD) precision performance. We used 2x2 block in this case. On Cypress architecture GPU, we take advantage of FMA_64 instruction. For MAD peak in DD, we assume one DD operation takes 20 DP operations(ops) without FMA and 15 ops with FMA. Precicely, DD add ~~takes ~ 20 ops and DD mul wihtou FMA takes ~ 20. DD mul with FMA~~ takes ~ 8 ops.
	4	Add double-double (DD) precision performance. We used 2x2 block in this case. On Cypress architecture GPU, we take advantage of FMA_64 instruction. For MAD peak in DD, we assume one DD operation takes 20 DP operations(ops) without FMA and 15 ops with FMA. Precicely, DD add and DD mul without FMA takes ~ 20 ops. While DD mul with FMA only takes ~ 8 ops.
5	5
6	6	== Peformance Summary ==