= Matrix Multiply on GPU =
We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams. One is transposed input matrix A and other is input matrix B in normal format. Output matrix C is also not transposed. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured.

== Source code ==
Will be posted later.

== Single precision ==
[[Image(SMM.png)]]

== Double precision ==
[[Image(DMM.png)]]

= Useful forum discussions =
== Discussion on a highly optimized MM kernel ==
http://forum.beyond3d.com/showthread.php?t=54842

== Discussion on MM kernels in OpenCL ==
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963

=== IL code generator in C++ ===
CAL++ http://sourceforge.net/projects/calpp/

Meta-programing works in reality.


[[Image(MM1.png)]]