= Implementation of a Parallel Tree Method on a GPU = 
by N.Nakasato (Journal of Computational Science)
== abstract ==
The kd-tree is a fundamental tool in computer science. 
Among other applications, the application of kd-tree search (by the
tree method) to the fast evaluation of particle interactions and
neighbor search is highly important, since the computational
complexity of these problems is reduced from O(N^2^) for a brute
force method to O(N log N) for the tree method, where N is
the number of particles. 
In this paper, we present a parallel
implementation of the tree method running on a graphics processing
unit (GPU). We present a detailed description of how we have
implemented the tree method on a Cypress GPU. 
An optimization that we found important is localized particle ordering to
effectively utilize cache memory. 
We present a number of test results
and performance measurements. 
Our results show that the execution of the tree traversal 
in a force calculation on a GPU is practical and efficient.

== Link ==
http://dx.doi.org/10.1016/j.jocs.2011.01.006

== Recent Update (201112) ==
Here is a recent result with AMD APP OpenCL SDK verstion 2.5 on Cayman (Radeon 6970).
Thanks for much better cache treatment in AMD OpenSDK, our OpenCL version (see Figure 11 in our paper) outperforms our IL version.
"VL" refers to the explicit vector length in our new tree walk kernel in OpenCL.

[[Image(tree_cayman.png)]]