
Platform model
In heterogeneous computing, knowledge about the architecture of the targeted device is critical to reap the full benefits of the hardware. We had discussed the hardware architectures from AMD, Intel, and NVIDIA in Chapter 1, Hello OpenCL. Though we will briefly discuss about the hardware from different vendors, we suggest you to take a deeper look at the underlying platform on which you will be working. In this section we will describe the OpenCL Platform model and map the AMD, NVIDIA, and Intel hardware architectures to the OpenCL Platform definitions.

OpenCL platform model, Courtesy Khronos
An OpenCL Platform model consists of a host connected to one or more devices like CPU's, GPU's or hardware accelerators like DSP's. Each OpenCL device consists of one or more compute units, which in turn is further divided into one-to-many processing elements. Computations on a device that is the actual kernel (work item) execution occurs within these processing elements. We just coined the term work item. This we will discuss later in this chapter when we discuss about the OpenCL Execution model.
We will now discuss the four different architectures from different device vendors and try to map their architecture to the OpenCL Platform model. In the next diagram we have shown four different OpenCL architectures and their mappings to the Platform models.
AMD A10 5800K APUs
A10 5800K APU has four AMD x86_64 processor cores, which forms the host. Its graphics processor includes as many as six SIMD engines, each with four texture units and sixteen thread processors. There are four ALUs in each thread processor, adding up to 384 total shader cores or processing elements. The following diagram shows the relation of the Trinity APU to the OpenCL Platform model:

APU Showing the Platform Model and the Graphics Core. Courtesy AMD
This platform has two devices, the CPU device and the GPU device. The x86 CPU device is also the host. The OpenCL Platform model can be mapped as having four compute units and each having one processing element. The graphics processor connected to the host CPU also forms an OpenCL device of type GPU. The six SIMD engines form the six GPU device compute units in the platform. Each of the six compute elements have sixteen thread processors, each having four processing elements. In all there are 384 processing elements or shader cores in this platform for the GPU device.
AMD Radeon™ HD 7870 Graphics Processor
HD 7870 discrete card is a graphics processor based on the AMD GCN architecture. This compute device can be connected to any x86/x86_64 platform. The CPU forms the host and the GPU forms the device in the OpenCL platform. AMD Radeon HD 7870 GPU has a total of twenty compute units. With each compute unit having 64 shader cores a total of 1280 processing elements are there.

AMD Radeon™ HD 7870 Architecture diagram, © Advanced Micro Devices, Inc.
NVIDIA® GeForce® GTC 680 GPU
The NVIDIA GTX 680 graphics card architecture diagram is shown as follows. There are eight blocks of compute units in this graphics processor. Also referred to as the Kepler Architecture, the compute units are called the Streaming Multiprocessors-X (SMX). This SMX compute unit is an advance over previous architectures and has 192 CUDA cores or processing elements. This is shown in the following diagram:

NVIDIA GeForce® GTX 680 Architecture. © NVIDIA
Intel® IVY bridge
The IVY bridge architecture is very similar to the sandy bridge architecture discussed in Chapter 1, Hello OpenCL. The CPU device can be mapped as any x86 CPU as discussed in the AMD A10 5800K APU's section. In the case of Intel hardware's, the GPU device offers what is called as the Execution Units (EUs). These numbers vary across different SOC solutions provided by Intel. In Intel HD 4000 there are sixteen EUs. These sixteen EUs form the processing elements or sixteen compute unit, that is each execution unit is a compute unit.
For all the preceding OpenCL hardware architectures, which we have discussed till now an OpenCL application consists of a host program that runs according to the models native to the host platform. The host application submits commands to the device to which executes the OpenCL kernels on the processing elements in a compute device. The OpenCL specification describes the functions to create memory objects called buffers and run OpenCL kernels on an OpenCL device. The host queues the thread launch. Before processing the data the host application writes to device, and finally after processing it reads from device. It would be good if the data transfer bandwidth between the host and the device is good enough to hide the data transfer bottleneck with the highly parallel computing power of the device. Some computers may use a shared memory architecture between the host computer (CPU) and the OpenCL device (say a GPU). In such cases the memory transfer bottlenecks may be minimal.