wiki:Tests_With_RV870

Perfomrance comparison of GPU boards from AMD as of Oct 2009

We have tested GPU borads from AMD with our test program implementing a simple N2 force evaluation algorithm. A nominal performacne of each board is shown in the following table.

board arch clock memory No.SC SPmul-add DPadd DPmul BW
HD4850 RV770 625 MHz DDR3 662 MHz 256bit 800 1040 GFLOPS 208 GFLOPS 104 GFLOPS 63.6 GB/sec
HD4870 RV770 750 MHz DDR5 900 MHz 256bit 800 1200 GFLOPS 240 GFLOPS 120 GFLOPS 115.2 GB/sec
HD4770 RV740 750 MHz DDR5 800 MHz 128bit 640 960 GFLOPS 192 GFLOPS 96 GFLOPS 51.2 GB/sec
HD5870 RV870 850 MHz DDR5 1.2 GHz 256bit 1600 2720 GFLOPS 544 GFLOPS 272 GFLOPS 153.6 GB/sec

The program is basically same as our demo program posted here. The demo program should work with 5870 but I did not test it.

Result

Note

  • We count one-force-interaction as 38 FP operations. This is a traditional flop-count in this field.
  • At large N, 5870 shows 2.2x better performance than 4870. It reaches ~ 2.2 Tflops.
  • 4850 and 4770 show identical performance. Memory BW is not critical at all in this test program.
  • I add a new result "5870_opt". It is obtained with a new optimized kernel. It boosts the performance ~ 18 %.

Analysis of VLIW instructions

Our kernel is directly written in IL. After compilation into a machine code, I analyze the generated VLIW instructions. Results are obtained with fglrx 8.65.4 (Catalyst 9.9) and CAL 1.4beta.

Both RV770 and RV870 archtecture has 5-way VLIW units at its heart. Depnding on detailed computations, it is supposed that the device driver (or internal compiler?) tries to fill 5-slots as much as possible. The first column shows a number of occupied slots and the second column shows a number of corresponding VLIW instructions appeared in the kernel. The third column indicates a fraction of VLIW instructions with a given number of occupied slots. The last row presents a total numberf VLIW instructions in the kernel.

5870_opt kernel

#slots #inst fraction
1 1 2 %
2 2 3 %
3 4 6 %
4 16 24 %
5 43 65 %
total 66

old kernel

#slots #inst fraction
1 12 15 %
2 6 7 %
3 4 5 %
4 16 20 %
5 43 53 %
total 81

Almost 90% of the time, "5870_opt kernel" runs at maximum efficiency (here, I mean more than 4 slots are occupied by some operations). It is compared to 73% in the case of our old kernel.

Comparison with 5870_opt kernel

Benchmark system

CPU Core2 E8400 3.0 GHz
MB Asus P5EWS
Memory DDR2 800 1 GB x 4
Power unit Schythe CorePower3 600W
OS Ubuntu 8.04 x86_64 2.6.24-23-generic
Catalyst 9.9
CAL version 1.4beta

This system looks old but PCIe bus on P5EWS is faster than any other systems we have so far.

Last modified 14 years ago Last modified on Nov 30, 2009 5:59:11 PM

Attachments (2)

Download all attachments as: .zip