= Matrix Multiply on GPU =
We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams. One is transposed input matrix A and other is input matrix B in normal format. Output matrix C is also not transposed. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured.

Peformance Summary (Pmax & MADD is in GFLOPS)
|| board  || Pmax || Nmax || prec  || GPR || MADD peak ||
|| HD4850 || 736    || 3328 || SP || 25 || 1040 ||
|| HD5870 || 2140   || 7424 || SP || 25 || 2720 ||
|| HD4850 || 160    || 1920 || DP || 17 || 208 ||
|| HD5870 || 431    || 2432 || DP || 17 || 544 ||

== Source code ==
Will be posted later.

== Single precision ==
[[Image(SMM.png)]]

== Double precision ==
[[Image(DMM.png)]]

= Useful forum discussions =
== Discussion on a highly optimized MM kernel ==
http://forum.beyond3d.com/showthread.php?t=54842

== Discussion on MM kernels in OpenCL ==
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963

=== IL code generator in C++ ===
CAL++ http://sourceforge.net/projects/calpp/

Meta-programing works in reality.


[[Image(MM1.png)]]