Context Navigation

Perfomrance comparison of GPU boards from AMD as of Oct 2009

We have tested GPU borads from AMD with our test program implementing a simple N² force evaluation algorithm. A nominal performacne of each board is shown in the following table.

board	arch	clock	memory	No.SC	SPmul-add	DPadd	DPmul	BW
HD4850	RV770	625 MHz	DDR3 662 MHz 256bit	800	1040 GFLOPS	208 GFLOPS	104 GFLOPS	63.6 GB/sec
HD4870	RV770	750 MHz	DDR5 900 MHz 256bit	800	1200 GFLOPS	240 GFLOPS	120 GFLOPS	115.2 GB/sec
HD4770	RV740	750 MHz	DDR5 800 MHz 128bit	640	960 GFLOPS	192 GFLOPS	96 GFLOPS	51.2 GB/sec
HD5870	RV870	850 MHz	DDR5 1.2 GHz 256bit	1600	2720 GFLOPS	544 GFLOPS	272 GFLOPS	153.6 GB/sec

The program is basically same as our demo program posted here. The demo program should work with 5870 but I did not test it.

Result

Note

We count one-force-interaction as 38 FP operations. This is a traditional flop-count in this field.
At large N, 5870 shows 2.2x better performance than 4870. It reaches ~ 2.2 Tflops.
4850 and 4770 show identical performance. Memory BW is not critical at all in this test program.
I add a new result "5870_opt". It is obtained with a new optimized kernel. It boosts the performance ~ 18 %.

Analysis of VLIW instructions

Our kernel is directly written in IL. After compilation into a machine code, I analyze the generated VLIW instructions. Results are obtained with fglrx 8.65.4 (Catalyst 9.9) and CAL 1.4beta.

Both RV770 and RV870 archtecture has 5-way VLIW units at its heart. Depnding on detailed computations, it is supposed that the device driver (or internal compiler?) tries to fill 5-slots as much as possible. The first column shows a number of occupied slots and the second column shows a number of corresponding VLIW instructions appeared in the kernel. The third column indicates a fraction of VLIW instructions with a given number of occupied slots. The last row presents a total numberf VLIW instructions in the kernel.

5870_opt kernel

#slots	#inst	fraction
1	1	2 %
2	2	3 %
3	4	6 %
4	16	24 %
5	43	65 %
total	66

old kernel

#slots	#inst	fraction
1	12	15 %
2	6	7 %
3	4	5 %
4	16	20 %
5	43	53 %
total	81

Almost 90% of the time, "5870_opt kernel" runs at maximum efficiency (here, I mean more than 4 slots are occupied by some operations). It is compared to 73% in the case of our old kernel.

Comparison with 5870_opt kernel

Benchmark system

CPU	Core2 E8400 3.0 GHz
MB	Asus P5EWS
Memory	DDR2 800 1 GB x 4
Power unit	Schythe CorePower3 600W
OS	Ubuntu 8.04 x86_64 2.6.24-23-generic
Catalyst	9.9
CAL version	1.4beta

This system looks old but PCIe bus on P5EWS is faster than any other systems we have so far.

Last modified 17 years ago Last modified on Nov 30, 2009 5:59:11 PM

Attachments (2)

GFLOPS.png (36.5 KB) - added by nakasato 17 years ago.
GFLOPS_opt.png (28.3 KB) - added by nakasato 17 years ago.

Download all attachments as: .zip

Download in other formats:

Plain Text