= Matrix Multiply on GPU = We have implemented single/double precision matrix multiply program for RV770/Cypress. In our implementation, we use two input streams for computing C=AB. One is transposed input matrix A (i.e. column major) and other is input matrix B in normal format (i.e. row major). Output matrix C is also row major. We adopted 8x8 block for single precision and 4x4 for double precision. Here is benchmark result for each case. Note only kernel execution time is measured. == Peformance Summary == || board || Pmax || Nmax || prec || reg. usage || MADD peak || || HD4850 || 736 || 3328 || SP || 25 || 1040 || || HD5870 || 2140 || 7424 || SP || 25 || 2720 || || HD4850 || 177 || 1408 || DP || 17 || 208 || || HD5870 || 475 || 2048 || DP || 17 || 544 || Pmax & MADD in GFLOPS == Source code == Will be posted later. == Single precision == [[Image(SMM.png)]] == Double precision == [[Image(DMM.png)]] = Useful forum discussions = == Discussion on a highly optimized MM kernel == http://forum.beyond3d.com/showthread.php?t=54842 == Discussion on MM kernels in OpenCL == http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963 === IL code generator in C++ === CAL++ http://sourceforge.net/projects/calpp/ Meta-programing works in reality. Great work! [[Image(MM1.png)]]