= OpenCL Notes =
== Local OpenCL path ==
{{{
OPENCL_VENDOR_PATH=$(HOME)/lib/OpenCL
}}}

== Configuring multi GPU setup ==
{{{
sudo aticonfig --adapter=all --initial 
}}}

And set the following environment variable
{{{
export DISPLAY=:0
}}}

== clang to parse and compile OpenCL kernels ==

http://steckdenis.wordpress.com/2011/05/02/using-clang-to-compile-opencl-kernels/

http://people.freedesktop.org/~steckdenis/clover/index.html

http://www.khronos.org/message_boards/viewtopic.php?f=28&t=3531

== Disable auto vectorization ==
In the section 6.7.2 (187 page) in the OpenCL Specification Version: 1.1 (Revision: 33), "__attribute__((vec_type_hint(<typen>))" is described. This hint controls the autovectorizer in the compiler for OpenCL C. I tested this feature by dumping the assembly code for a kernel (heavily using "float8" ) targeted for AVX instructions with Intel SDK (version 1.5).

|| attribute       || lines ||  comment    ||
|| no hint         || 1567  || vectorized (inner-loop is further unrolled) ||
|| with the hint || 333   || simple translation of the input kernel  ||

This hint amazingly reduces the size of the generated assembly code.
Without this attribute, the generated code includes two functions: (1) the code without unrolling and (2) the code over unrolled (8 stages! To fully utilize L1?). 
Due to this, the generated assembly file is large but we don't know which code is really used.

In both code, the core part is  calculated with AVX instructions but the performance of the two codes is slightly different.  .... now investigating.

Intel's vectorizer was presented in a talk at LLVM developers meeting 2011. http://llvm.org/devmtg/2011-11/


== How to use "ioc" command equipped with Intel SDK. ==
We need to set the environment variable INTELOCLSDKROOT

{{{
export INTELOCLSDKROOT=/usr/lib64/OpenCL/vendors/intel  
}}}

To dump the assembly code:
{{{
ioc -input=kernel_file.cl -asm 
}}}

== Dump IL/ISA with AMD SDK == 
Set the following the environment variable (APP Programming Guide August 2011, section 4.2 (63 page)).
{{{
export GPU_DUMP_DEVICE_KERNEL=3         
}}}


== SDK and driver == 

=== Latest SDK ===
AMD http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx

Intel http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/

Nvidia 

Apple'SDK comes with MacOS X only

=== Latest Driver for AMD ===
http://support.amd.com/us/gpudownload/linux/Pages/radeon_linux.aspx?type=2.4.1&product=2.4.1.3.42&lang=English

== Using AMD APP on Scientific Linux ==
I have tested Scientific Linux 6.1 x86_64.

 1. Install required packages (free-glut etc.)
 2. Install "X Window System" group packages and xdm
{{{
sudo yum groupinstall "X Window System"
sudo yum install xdm
}}}
 3. Edit /etc/inittab to change the default runlevel 5
 4. Edit /etc/X11/xdm/Xservers to add "-ac" option to the command argument for Xserver
 5. Edit /etc/rc.local to add "xdm" line for start-uping xdm.
 6. Install the driver and the SDK.

= Random notes =


== icc ==
http://software.intel.com/en-us/articles/using-intel-compilers-for-linux-with-ubuntu/

== packages ==
Ubuntu 10.04
{{{
sudo aptitude install xdm ia32-libs subversion zsh libgsl0-dev gfortran libnetcdf-dev g++  freeglut3-dev xserver-xorg rake emacs lv xterm autofs nfs-client vim rake  libgsl0-dev gfortran libnetcdf-dev g++ libblas-dev rake emacs lv ssh ia32-libs subversion git-core libnuma1
}}}


http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=147002

== process affinity ==
From command line: http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html

http://www.open-mpi.org/projects/hwloc/

== DOUBLE ==
http://developer.amd.com/support/KnowledgeBase/Lists/KnowledgeBase/DispForm.aspx?ID=92




= old info =
== Standard Compute Layer Library ==
http://www.browndeertechnology.com/stdcl.html

A wrapper library API for OpenCL API used in the tutorial below.
It seems that the libstdcl greatly simplify 
a sample program ($OPENCLDIR/samples/opencl/cl/app/NBody) supplied with Stream SDK 2.0beta.

== OpenCL Tutorial: N-Body Simulation == 
http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody.html

A tutorial that modifies the sample NBody program written in OpenCL API.

== local & global ==
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=123350&enterthread=y
{{{
It's a tricky question . On 4xxx __local mem is really __global mem ( ATI thinks it's too much work to optimize compiler to use 48xx  LDS - although it's possible ). 
On 5xxx __local is LDS - so it's located in simd core.
}}}


== Catalyst 10.1 with Ubuntu 9.10 workarounds ==
=== Change the kernel boot option ===
Edit "/etc/default/grub" and execute update-grub. I add "nopat" to GRUB_CMDLINE_LINUX_DEFAULT for Catalyst 10.1.

The configuration file for the grub is at /boot/grub/grub.cfg.
This file is automatically generated by update-grub command.

This trick is not necessary for Catalyst 10.2.

=== X server setting ===
The default login manager gdm is difficult to properly configure, I install xdm instead of gdm.
The configuration file for xdm is at /etc/X11/xdm directory.

Edit "Xservers" file as
{{{
:0 local /usr/bin/X :0 vt7 -nolisten tcp -ac
}}}
This "-ac" option enable remote applications to access the local X server.
Note this option is generally regarded as "bad" for security. Be careful.