= Matrix Multiply on GPU =
We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams. One is transposed input matrix A and other is input matrix B in normal format. Output matrix C is also not transposed. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured.

== Peformance Summary ==
|| board  || Pmax || Nmax || prec  || GPR || MADD peak ||
|| HD4850 || 736    || 3328 || SP || 25 || 1040 ||
|| HD5870 || 2140   || 7424 || SP || 25 || 2720 ||
|| HD4850 || 160    || 1920 || DP || 17 || 208 ||
|| HD5870 || 431    || 2432 || DP || 17 || 544 ||
Pmax & MADD in GFLOPS

== Source code ==
Will be posted later.

== Single precision ==
[[Image(SMM.png)]]

== Double precision ==
[[Image(DMM.png)]]

= Useful forum discussions =
== Discussion on a highly optimized MM kernel ==
http://forum.beyond3d.com/showthread.php?t=54842

== Discussion on MM kernels in OpenCL ==
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963

=== IL code generator in C++ ===
CAL++ http://sourceforge.net/projects/calpp/

Meta-programing works in reality.


[[Image(MM1.png)]]