OpenCL Programming by Example
上QQ阅读APP看书,第一时间看更新

Execution model

The two main execution units in OpenCL are the kernels and the host program. The kernels execute on the so called OpenCL device and the host program runs on the host computer. The main purpose of the host program is to create and query the platform and device attributes, define a context for the kernels, build the kernel, and manage the execution of these kernels.

On submission of the kernel by the host to the device, an N dimensional index space is created. N is at least 1 and not greater than 3. Each kernel instance is created at each of the coordinates of this index space. This instance is called as the "work item" and the index space is called as the NDRange. In the following screenshot we have shown the three scenarios for 1, 2 and 3 dimensional NDRange:

Execution model

OpenCL NDRange

In the saxpy example which we discussed in the previous chapter, we have taken a global size of 1024 and a local size of 64. Each work item computes the corresponding:

C[local id] = alpha* A[local id] + B[local id];

A total of sixteen work groups are spawned. When the clEnqueueNDRange function is executed, a 1 Dimensional NDRange is created for the saxpy_kernel function. The explanation of clEnqueueNDRange function is given in the next section. Since in saxpy every data C[…] can be calculated independently, all the work items can run in a parallel way. We divide the problem of 1024 element saxpy into work groups, so that a group of contiguous elements can work on a separate OpenCL capable compute unit.

NDRange

An NDRange is the kernel execution index in an N-dimensional index space. The values which N can take are 1, 2, or 3. An NDRange value is given by an array of integers of length N specifying the index's extent in each dimension. Starting OpenCL 1.2 an offset index value can also be specified for each dimension, which gives the starting offset for an NDRange. If this offset is not specified then its value is 0 by default in each dimension. The extent of a work group is specified by local_work_size in the clEnqueueNDRangeKernel function below. Global ID and Local ID are N tuple values. The global_work_size function defines the total number of work items, which can be spawned for the OpenCL kernel. The global ID components are values in the range from offset X, to X plus the global_work_size function in their corresponding dimensions.

A group of work items are organized in OpenCL work groups. Take a look at the following diagram of a 2D NDRange. The work groups provide a coarse-grained decomposition of the index space. Each work group is assigned a unique ID with the same number of dimensions as the global index space used to define the work items. Every work item in a work group is assigned a unique local ID, so that every work item can be uniquely addressed during the kernel execution within a work group scope. Similarly work items are also assigned a unique global ID in the NDRange and can be uniquely addressed during the kernel execution.

Work groups are also assigned work group IDs. This is also an array of N integers, where each integer defines the number of work groups in each dimension. The work groups' IDs are assigned in the same manner as it is done for assigning global IDs. See equation 2 later in the section. Every work item has an associated work group ID and a local ID. It's easy to calculate the global ID of the work item, when we are given a work group ID and a local-ID. See equation 1 later in this section. Each work item can be identified in two ways; global index, and work group index plus a local index in each of its respective dimensions.

Let's explain the following with an equation: N=2 NDRange:

We will be using the following terms for defining the Execution model:

  • work-item: It is the individual kernel execution instance
  • work-group: It is a group of work items form a work group
  • global-id: A unique global ID given to each work item in the global NDRange
  • local-id: A unique local ID given to each work item within a work group

Consider a (12,12) NDRange as shown in the following figure. Each of the smallest box is a work item. As you can see there are twelve of them in each row and there are twelve such rows.

NDRange

Execution model, Courtesy Khronos

In the preceding diagram the global size is defined by (12, 12) ~ (Gx, Gy). The extent of Gx and Gy is 0 to 11. The total number of work items is given by the product of Gx and Gy, which amounts to a total of 144 work items.

The size of each work group is (4, 4) ~ (Sx, Sy). The extent of Sx and Sy is 0 to 3. The total number of work items in a work group is given by the product of Sx and Sy. In this example there are sixteen work items in the work group.

From the extent of the global work items (Gx, Gy) and the extent of the local work items (Sx, Sy), we can compute the number of work groups (Wx, Wy) in the NDRange.

Each work item is identified by its global ID (gx, gy) or local ID (sx, sy). The work items in a work group belong to a work group ID (wx, wy) defined in the following equation 3. Similarly the global ID can be computed using a combination of local ID (sx, sy) and work group ID (wx, wy), as shown in the equation:

(gx , gy) = (wx * Sx + sx, wy * Sy + sy) (1)

The number of work groups can be computed using the equation:

(Wx, Wy) = (Gx / Sx, Gy / Sy) (2)

The work-group ID for a work item is computed the using equation:

(wx, wy) = ( (gx - sx) / Sx, (gy - sy) / Sy ) (3)

Till now we have discussed about the work item, work group, local ID, and global ID. All these values can be determined inside a kernel execution at runtime using the built-in functions, which are listed as follows:

  • get_global_id(int dim);
  • get_local_id(int dim);
  • get_num_groups(int dim);
  • get_group_size(int dim);
  • get_group_id(int dim);

The NDRange execution space is defined by the OpenCL API. The associated parameters should all be created in an OpenCL context as follows:

cl_int clEnqueueNDRangeKernel(cl_command_queue command_queue,
  cl_kernel        kernel,
  cl_uint          work_dim,
  const size_t   * global_work_offset,
  const size_t   * global_work_size,
  const size_t   * local_work_size,
  cl_uint          num_events_in_wait_list,
  const cl_event * event_wait_list,
  cl_event       * event)

This function enqueue's a command to execute a kernel on the device associated with the command_queue function. Of all the OpenCL functions that run on the host, clEnqueueNDRangeKernel is the most important to understand. Not only does it deploys kernels to devices, it also specifies how many work items should be generated to execute the kernel (global_work_size) and the number of work items in each work group (local_work_size). The following list represents certain objects:

  • command_queue: Every command_queue is associated with one device. kernel will be enqueued for execution on this device. The command_queue object is created using the clCreateCommandQueue function.
  • kernel: It refers to an OpenCL kernel object. This kernel object would have been created using the OpenCL program object.
  • work_dim: It specifies the dimension of the NDRange (index space). The value can be 1, 2 or 3.
  • global_work_offset: This is a size_t pointer to the work_dim elements. If set to NULL all the values in each dimension take the default value as 0. Otherwise this is used to calculate the global ID of a work item.
  • global_work_size: This is a size_t pointer to the work_dim elements, which specifies the extent of the global work items in every dimensions.
  • local_work_size: This is also a size_t pointer to the work_dim elements and specifies the extent of local work items in every dimension.
  • event_wait_list and num_events_in_wait_list: The event_wait_list object contains handles to events, which an OpenCL implementation will wait for before enqueuing this command.
  • event: Every enqueued command returns an OpenCL event object that is the reference to the command in the queue. Here the kernel's execution handle is returned in the event pointer. This cl_event object can be used later on for reference to the execution status.

The OpenCL supports two of these execution models; the data parallel programming model and the task parallel programming model. The clEnqueueNDRangeKernel function is a kind of data parallel execution model, the task parallel programming model will be discussed in Chapter 5, OpenCL Program and Kernel Objects.

We just coined the term "enqueues a command", let's explain what a queue has to do with the OpenCL. Before that, let's discuss the OpenCL context.

OpenCL context

A context defines the entire OpenCL environment, including the devices, the program objects, the OpenCL kernels, memory objects, command queues, and so on. A context can be associated with multiple devices or with only one device. The OpenCL context associated with command queue and the kernel should be the same. They cannot be from different contexts.

Before we can create a context we must first query the OpenCL runtime to determine which vendor platforms are available in the system. After you have selected a vendor platform, the first step is to initialize the OpenCL implementation in order to create a context. The rest of the OpenCL work like creating devices and memory, compiling, and running programs is performed within this context. A context can have a number of associated devices, which can be either of CPU or GPU or both, and, within a context. Contexts in the OpenCL are referenced by a cl_context object, which must be initialized using the following OpenCL API:

cl_context clCreateContext (const cl_context_properties *properties,
                     cl_uint num_devices,
                     const cl_device_id *devices,
                     void (CL_CALLBACK *pfn_notify)
                            (const char *errinfo, 
                             const void *private_info, 
                             size_t cb, void *user_data),
                     void *user_data,
                     cl_int *errcode_ret)

The following is the list of few contexts of the OpenCL along with its description:

  • properties: It is a list of name and its corresponding value. The name is the context property name like CL_CONTEXT_PLATFORM and this is followed by the property value. An example of the same is as follows:
    cl_context_properties props[3] =
    {
        CL_CONTEXT_PLATFORM,
        (cl_context_properties)platforms,
        0
    }; 

    One can add more property values based on the requirements of the application.

  • num_devices: It is the number of devices one wants to associate with the context. The devices pointer should have at least num_devices, cl_device_id instance
  • devices: It is a pointer to a num_devices list of cl_device_id instances, which will be associated with the context.
  • errcode_ret: The error code returned by the OpenCL implementation associated with a call to this function.
  • pfn_notify: It is a function pointer to the callback function, which an application can register. The underlying OpenCL implementation will call this function to asynchronously report errors for context creation. If set to NULL then no callback function is registered. The prototype of a callback function is as follows:
    void OpenCL_Context_Callback(const char *errinfo,const void *private_info,size_t cb, void *user_data);
  • user_data: This is the pointer to the data, which will be passed to the callback function if registered by the application. If no callback function is registered this should be set to NULL.

OpenCL command queue

The OpenCL command queue is an object where OpenCL commands are queued to be executed by the device. The command queue is created for every usable OpenCL device for kernel execution. One can have multiple command queues for different tasks in applications. This way an application developer can run tasks independently on different command queues. We will discuss about the various synchronization mechanisms using multiple command queues in Chapter 6, Events and Synchronization. The following code snippet creates a command queue and a write (clEnqueueWriteBuffer), and NDRange execution of the kernel commands are queued on to the device:

cl_command_queue command_queue = clCreateCommandQueue(context, device_list[0],
    0, &clStatus);
clStatus = clEnqueueWriteBuffer(command_queue, A_clmem, 
                   CL_TRUE, 0, 
                     VECTOR_SIZE * sizeof(float), A, 0, NULL, NULL); 
clStatus = clEnqueueNDRangeKernel(command_queue, kernel,1, NULL, &global_size,&local_size, 0, NULL, NULL);

The host program creates this command queue. The snapshot of the queue anytime shall give you the list of enqueued commands. These commands can be of data transfer, or kernel execution commands or barriers within the command queue. The host enqueues these commands to the command queue. Each command or task is associated with an OpenCL event. These events can be used as a synchronization mechanism to coordinate execution between the host and the device.

There can be multiple queues associated within a context. They can dispatch commands independently and concurrently with no explicit mechanism to synchronize between them.

Queues can be in-order of the execution queues. The commands are dequeued in first in first out (FIFO) manner. Hence application can send commands to the queue and be ensured that they execute in order.

Out of order command queues are also supported by the OpenCL. The commands are issued in order, but do not wait for the previous command to complete before the next command executes. We will discuss more about this in Chapter 5, OpenCL Program and Kernel Objects.