= Perfomrance comparison of GPU boards from AMD as of Oct 2009 = We have tested GPU borads from AMD with our test program implementing a simple N^2^ force evaluation algorithm. A nominal performacne of each board is shown in the following table. || board || arch || clock || memory || No.SC || SPmul-add || DPadd || DPmul || BW || || HD4850 || RV770 || 625 MHz || DDR3 662 MHz 256bit || 800 || 1040 GFLOPS || 208 GFLOPS || 104 GFLOPS || 63.6 GB/sec || || HD4870 || RV770 || 750 MHz || DDR5 900 MHz 256bit || 800 || 1200 GFLOPS || 240 GFLOPS || 120 GFLOPS || 115.2 GB/sec || || HD4770 || RV740 || 750 MHz || DDR5 800 MHz 128bit || 640 || 960 GFLOPS || 192 GFLOPS || 96 GFLOPS || 51.2 GB/sec || || HD5870 || RV870 || 850 MHz || DDR5 1.2 GHz 256bit || 1600 || 2720 GFLOPS || 544 GFLOPS || 272 GFLOPS || 153.6 GB/sec || The program is basically same as our demo program posted [http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770#DemoProgram here]. The demo program should work with 5870 but I did not test it. == Result == [[Image(GFLOPS.png)]] == Note == * We count one-force-interaction as 38 FP operations. This is a traditional flop-count in this field. * At large N, 5870 shows 2.2x better performance than 4870. It reaches ~ 2.2 Tflops. * 4850 and 4770 show identical performance. Memory BW is not critical at all in this test program. * I add a new result "5870_opt". It is obtained with a new optimized kernel. It boosts the performance ~ 18 %. == Analysis of VLIW instructions == Our kernel is directly written in IL. After compilation into a machine code, I analyze the generated VLIW instructions. Results are obtained with fglrx 8.65.4 (Catalyst 9.9) and CAL 1.4beta. Both RV770 and RV870 archtecture has 5-way VLIW units at its heart. Depnding on detailed computations, it is supposed that the device driver (or internal compiler?) tries to fill 5-slots as much as possible. The first column shows a number of occupied slots and the second column shows a number of corresponding VLIW instructions appeared in the kernel. The third column indicates a fraction of VLIW instructions with a given number of occupied slots. The last row presents a total numberf VLIW instructions in the kernel. === 5870_opt kernel === || #slots || #inst || fraction || || 1 || 1 || 2 % || || 2 || 2 || 3 % || || 3 || 4 || 6 % || || 4 || 16 || 24 % || || 5 || 43 || 65 % || ||total|| 66 || || === old kernel === || #slots || #inst || fraction || || 1 || 12 || 15 % || || 2 || 6 || 7 % || || 3 || 4 || 5 % || || 4 || 16 || 20 % || || 5 || 43 || 53 % || ||total|| 81|| || Almost 90% of the time, "5870_opt kernel" runs at maximum efficiency (here, I mean more than 4 slots are occupied by some operations). It is compared to 73% in the case of our old kernel. == Comparison with 5870_opt kernel == [[Image(GFLOPS_opt.png)]] == Benchmark system == || CPU || Core2 E8400 3.0 GHz || || MB || Asus P5EWS || || Memory || DDR2 800 1 GB x 4 || || Power unit || Schythe CorePower3 600W || || OS || Ubuntu 8.04 x86_64 2.6.24-23-generic || || Catalyst || 9.9 || || CAL version || 1.4beta || This system looks old but PCIe bus on P5EWS is faster than any other systems we have so far.