Perfomrance comparison of GPU boards from AMD as of Oct 2009
We have tested GPU borads from AMD with our test program implementing a simple N2 force evaluation algorithm. A nominal performacne of each board is shown in the following table.
board | arch | clock | memory | No.SC | SPmul-add | DPadd | DPmul | BW |
HD4850 | RV770 | 625 MHz | DDR3 662 MHz 256bit | 800 | 1040 GFLOPS | 208 GFLOPS | 104 GFLOPS | 63.6 GB/sec |
HD4870 | RV770 | 750 MHz | DDR5 900 MHz 256bit | 800 | 1200 GFLOPS | 240 GFLOPS | 120 GFLOPS | 115.2 GB/sec |
HD4770 | RV740 | 750 MHz | DDR5 800 MHz 128bit | 640 | 960 GFLOPS | 192 GFLOPS | 96 GFLOPS | 51.2 GB/sec |
HD5870 | RV870 | 850 MHz | DDR5 1.2 GHz 256bit | 1600 | 2720 GFLOPS | 544 GFLOPS | 272 GFLOPS | 153.6 GB/sec |
The program is basically same as our demo program posted here. The demo program should work with 5870 but I did not test it.
Result
Note
- We count one-force-interaction as 38 FP operations. This is a traditional flop-count in this field.
- At large N, 5870 shows 2.2x better performance than 4870. It reaches ~ 2.2 Tflops.
- 4850 and 4770 show identical performance. Memory BW is not critical at all in this test program.
- I add a new result "5870_opt". It is obtained with a new optimized kernel. It boosts the performance ~ 18 %.
Analysis of VLIW instructions
Our kernel is directly written in IL. After compilation into a machine code, I analyze the generated VLIW instructions. Results are obtained with fglrx 8.65.4 (Catalyst 9.9) and CAL 1.4beta.
Both RV770 and RV870 archtecture has 5-way VLIW units at its heart. Depnding on detailed computations, it is supposed that the device driver (or internal compiler?) tries to fill 5-slots as much as possible. The first column shows a number of occupied slots and the second column shows a number of corresponding VLIW instructions appeared in the kernel. The third column indicates a fraction of VLIW instructions with a given number of occupied slots. The last row presents a total numberf VLIW instructions in the kernel.
5870_opt kernel
#slots | #inst | fraction |
1 | 1 | 2 % |
2 | 2 | 3 % |
3 | 4 | 6 % |
4 | 16 | 24 % |
5 | 43 | 65 % |
total | 66 |
old kernel
#slots | #inst | fraction |
1 | 12 | 15 % |
2 | 6 | 7 % |
3 | 4 | 5 % |
4 | 16 | 20 % |
5 | 43 | 53 % |
total | 81 |
Almost 90% of the time, "5870_opt kernel" runs at maximum efficiency (here, I mean more than 4 slots are occupied by some operations). It is compared to 73% in the case of our old kernel.
Comparison with 5870_opt kernel
Benchmark system
CPU | Core2 E8400 3.0 GHz |
MB | Asus P5EWS |
Memory | DDR2 800 1 GB x 4 |
Power unit | Schythe CorePower3 600W |
OS | Ubuntu 8.04 x86_64 2.6.24-23-generic |
Catalyst | 9.9 |
CAL version | 1.4beta |
This system looks old but PCIe bus on P5EWS is faster than any other systems we have so far.
Attachments (2)
- GFLOPS.png (36.5 KB) - added by nakasato 15 years ago.
- GFLOPS_opt.png (28.3 KB) - added by nakasato 15 years ago.
Download all attachments as: .zip