GPU programming concepts

From GPU

Jump to: navigation, search

Contents

[edit] Computational resources

There are a variety of computational resources available on the GPU:

  • Programmable processors - Vertex, primitive and fragment pipelines allow programmer to perform kernel on streams of data
  • Rasterizer - creates fragments and interpolates per-vertex constants such as texure coordinates and color
  • Texture Unit - read only memory interface
  • Framebuffer - write only memory interface

In fact, the programmer can substitute a write only texture for output instead of the framebuffer. This is accomplished either through Render-To-Texture (RTT), Render-To-Backbuffer-Copy-To-Texture(RTBCTT), or the more recent stream-out.

[edit] Textures as stream

The most common form for a stream to take in GPGPU is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation, and so on.

Since textures are used as memory, texture lookups are then used as memory reads. Certain operations can be done automatically by the GPU because of this.

[edit] Kernels

Kernels can be thought of as the body of loops. For example, if the programmer was operating on a grid on the CPU he might have code that looked like this:

/* Pseudocode */

x = 1e8
y = 1e8

make array x by y

for each "x" { // Loop this block 1e8 times
  for each "y" { // Loop this block 1e8 times
    do_some_hard_work(x, y) // This is done 1e16 times (10 000 000 000 000 000)
  }
} 

On the GPU, the programmer only specifies the body of the loop as the kernel and what data to loop over by invoking geometry processing.

[edit] Flow control

In regular programs it is possible to control the flow of the program using if-then-else statements and various forms of loops. Such flow control structures have been indirectly added to GPUs. Branching could be zeroed out even on Nvidia's NV20, which gives roughly the same result. Conditional writes could be accomplished using a series of simpler instructions, but looping and direct conditional branching were not possible.

Recent GPUs allow branching, but usually with a performance penalty. Branching should generally be avoided in inner loops, whether in CPU or GPU code, and various techniques, such as static branch resolution, pre-computation, and Z-cull can be used to achieve branching when hardware support does not exist.

[edit] GPU techniques

[edit] Map

The map operation simply applies the given function (the kernel) to every element in the stream. A simple example is multiplying each value in the stream by a constant (increasing the brightness of an image). The map operation is simple to implement on the GPU. The programmer generates a fragment for each pixel on screen and applies a fragment program to each one. The result stream of the same size is stored in the output buffer.

[edit] Reduce

Some computations require calculating a smaller stream (possibly a stream of only 1 element) from a larger stream. This is called a reduction of the stream. Generally a reduction can be accomplished in multiple steps. The results from the previous step are used as the input for the current step and the range over which the operation is applied is reduced until only one stream element remains.

[edit] Stream filtering

Stream filtering is essentially a non-uniform reduction. Filtering involves removing items from the stream based on some criteria.

[edit] Scatter

The scatter operation is most naturally defined on the vertex processor. The vertex processor is able to adjust the position of the vertex, which allows the programmer to control where information is deposited on the grid. Other extensions are also possible, such as controlling how large an area the vertex affects.

The fragment processor cannot perform a direct scatter operation because the location of each fragment on the grid is fixed at the time of the fragment's creation and cannot be altered by the programmer. However, a logical scatter operation may sometimes be recast or implemented with an additional gather step. A scatter implementation would first emit both an output value and an output address. An immediately following gather operation uses address comparisons to see whether the output value maps to the current output slot.

[edit] Gather

The fragment processor is able to read textures in a random access fashion, so it can gather information from any grid cell, or multiple grid cells, as desired.

[edit] Sort

The sort operation transforms an unordered set of elements into an ordered set of elements. The most common implementation on GPUs is using sorting networks.

[edit] Search

The search operation allows the programmer to find a particular element within the stream, or possibly find neighbors of a specified element. The GPU is not used to speed up the search for an individual element, but instead is used to run multiple searches in parallel.

[edit] Data structures

A variety of data structures can be represented on the GPU:

  • Dense arrays
  • Sparse arrays - static or dynamic
  • Adaptive structures
Personal tools