OpenCL Programming by Example
上QQ阅读APP看书,第一时间看更新

Memory model

The OpenCL Memory model guarantees a relaxed memory consistency between devices. This means that different work items may see a different view of global memory as the computation progresses. This leads to a bigger challenge for the developers to partition data and splitting computation tasks into different work items. Synchronization is required to ensure data consistency within the work items of a work group. One needs to make sure that the data the work item is accessing is always correct. This makes the application developers task a little complicated to write applications with relaxed consistency, and hence explicit synchronization mechanisms are required.

The x86/x86_64 CPU cache coherent architecture is different from the OpenCL relaxed memory architecture. In cache coherent systems, data that resides in the local processor caches is guaranteed to be consistent across processors. The programmer need not worry about the data partitioning in cache coherent architectures. This results in a lot of memory bandwidth at the back of the cache, and makes the task of an application programmer easier. The OpenCL Memory model scales well across cache coherent memory architectures also. An OpenCL programmer must have knowledge of partitioning the data across his application work load, to achieve the highest performance in massively parallel heterogeneous systems. The standard defines four distinct memory regions. Each region can be accessed by the work items executing a kernel. The following are the different types of memory.

Global memory

Every OpenCL device has an associated global memory. This is the largest size memory subsystem. Memory accesses can be coalesced, to ensure contiguous memory reads and thereby increasing the performance. All the work items in all the work groups can read or write into this memory buffer. This memory can be cached depending on the OpenCL device. Take a look at the following OpenCL kernel prototype:

__kernel
void histogram_kernel(__global const uint* data, 
  __local uchar* sharedArray, 
  __global uint* binResultR, 
  __global uint* binResultG, 
  __global uint* binResultB) 

The __global or global keyword identifies this buffer region. This memory region is device wide and changes made in this region are visible to all the work items in the NDRange.

Constant memory

An OpenCL device has a region of memory called the constant memory, which is initialized by the host. This is similar to creating an OpenCL buffer with CL_MEM_READ_ONLY flag. This is the region of memory that remains constant throughout the execution time of the kernel.

Local memory

For high performance every OpenCL device has an associated local memory. This is the memory closest to the OpenCL processing element. Every work item in a work group can use this buffer and is shared amongst them that is if one work item modifies a local memory then the changes are visible across all the work items in a work group. As shown in the diagram the local memory is associated with one OpenCL compute unit. This means that the work items in a work group should all run on one compute unit. The __local or local keyword identifies this memory region.

Private memory

Memory region or processing element scratch registers are all referred to as the private region. This region of memory is used by the OpenCL device complier to allocate all the local variables in the kernel code. Any modifications done to this memory region are not visible to the other work items. As shown in the following diagram every processing element has a private memory. This is the default memory attribute in an OpenCL kernel:

Private memory

OpenCL Memory Model, Courtesy Khronos

Based on the underlying architecture the work items in a given work group execute concurrently on the processing elements of a single compute unit. This means that one work group is associated with one compute unit of the hardware in OpenCL. This is because most of the hardware architectures have high speed memory local to the compute unit. In the context of OpenCL we refer to private memory as high speed memory.

The private memory can be shared among all the work items in the work group. For example in some graphics architectures, every compute unit has a large private memory say of the size 64 KB. When all the work items in the work group run on the device this 64 KB is shared among all the work items. For example a work group of size 64 work items will allocate 1 KB of private memory for each work item. This makes the application programmer create the OpenCL kernels, which use small number of registers and the hardware scheduler should be able to launch many work items or wave fronts at a time.