Context Navigation

Changes between Version 24 and Version 25 of MatrixMultiply

-                      v24
+                      v25
 = Matrix Multiply on GPU =
 We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams for computing C=AB. One is transposed input matrix A (i.e. column major) and other is input matrix B in normal format (i.e. row major). Output matrix C is also row major. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured.
+Add double-double performance. We used 2x2 block in this case. On Cypress architecture GPU, we take advantage of FMA_64 instruction.
 == Peformance Summary ==
 …
 || HD4850 || 177    || 1408 || DP || 19 || 208 ||
 || HD5870 || 475    || 2048 || DP || 19 || 544 ||
+|| HD4850 || 7.5    || 768  || DD || 21 ||  ||
+|| HD5870 || 20     || 1024 || DD || 21 ||  ||
+|| HD5870 FMA || 31     || 1024 || DD || 18 ||   ||
 Pmax & MAD in GFLOPS
 …
 == Double precision ==
 [[Image(DMM.png)]]
+== Double-Double ==
+[[Image(DDMM.png)]]
 = Useful forum discussions =