= Perfomrance comparison of GPU boards from AMD as of Oct 2009 =
We have tested GPU borads from AMD with our test program implementing a simple N^2^ force evaluation algorithm.
A nominal performacne of each board is shown in the following table.

|| board  || arch  || clock   || memory || No.SC || SPmul-add || DPadd || DPmul || BW ||
|| HD4850 || RV770 || 625 MHz || DDR3 662 MHz 256bit || 800  || 1040 GFLOPS || 208 GFLOPS || 104 GFLOPS || 63.6 GB/sec ||
|| HD4870 || RV770 || 750 MHz || DDR5 900 MHz 256bit || 800  || 1200 GFLOPS || 240 GFLOPS || 120 GFLOPS || 115.2 GB/sec ||
|| HD4770 || RV740 || 750 MHz || DDR5 800 MHz 128bit || 640  || 960  GFLOPS || 192 GFLOPS || 96  GFLOPS || 51.2 GB/sec ||
|| HD5870 || RV870 || 850 MHz || DDR5 1.2 GHz 256bit || 1600 || 2720 GFLOPS || 544 GFLOPS || 272 GFLOPS || 153.6 GB/sec ||

The program is basically same as our demo program posted [http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770#DemoProgram here]. The demo program should work with 5870 but I did not test it.

== Result == 
[[Image(GFLOPS.png)]] 

== Note ==
 * We count one-force-interaction as 38 FP operations. This is a traditional flop-count in this field.
 * At large N, 5870 shows 2.2x better performance than 4870. It reaches ~ 2.2 Tflops.
 * 4850 and 4770 show identical performance. Memory BW is not critical at all in this test program.
 * I add a new result "5870_opt". It is obtained with a new optimized kernel. It boosts the performance ~ 18 %.

== Analysis of VLIW instructions ==
Our kernel is directly written in IL. After compilation into a machine code, I analyze the generated VLIW instructions. Results are obtained with fglrx 8.65.4 (Catalyst 9.9) and CAL 1.4beta. 

Both RV770 and RV870 archtecture has 5-way VLIW units at its heart. Depnding on detailed computations, it is supposed that the device driver (or internal compiler?) tries to fill 5-slots as much as possible. The first column shows a number of occupied slots and the second column shows a number of corresponding VLIW instructions appeared in the kernel. The third column indicates a fraction of VLIW instructions with a given number of occupied slots. The last row presents a total numberf VLIW instructions in the kernel.

=== 5870_opt kernel ===
|| #slots || #inst || fraction ||
|| 1 || 1 || 2 % ||
|| 2 || 2 || 3 % ||
|| 3 || 4 || 6 % ||
|| 4 || 16 || 24 % ||
|| 5 || 43 || 65 % ||
||total|| 66 || ||

=== old kernel ===
|| #slots || #inst || fraction ||
|| 1 || 12 || 15 % ||
|| 2 || 6 || 7 % ||
|| 3 || 4 || 5 % ||
|| 4 || 16 || 20 % ||
|| 5 || 43 || 53 % ||
||total|| 81|| ||

Almost 90% of the time, "5870_opt kernel" runs at maximum efficiency (here, I mean more than 4 slots are occupied by some operations). It is compared to 73% in the case of our old kernel. 

== Comparison with 5870_opt kernel ==
[[Image(GFLOPS_opt.png)]] 

== Benchmark system ==
|| CPU || Core2 E8400 3.0 GHz ||
|| MB  || Asus P5EWS ||
|| Memory || DDR2 800 1 GB x 4 ||
|| Power unit || Schythe CorePower3 600W ||
|| OS || Ubuntu 8.04 x86_64 2.6.24-23-generic ||
|| Catalyst || 9.9 ||
|| CAL version || 1.4beta ||

This system looks old but PCIe bus on P5EWS is faster than any other systems we have so far.