GPUVerify: a Verifier for GPU Kernels
©

Documentation

Basic Usage

OpenCL: To invoke GPUVerify on an OpenCL kernel, do:

gpuverify --local_size=<work-group-dimensions> --num_groups=<grid-dimensions> <OpenCL file>

Here, <work-group-dimensions> is a vector specifying the dimensionality of each work group, <grid-dimensions> is a vector specifying the dimensionality of the grid of work groups, and <OpenCL file> is an OpenCL file with extension .cl.

For example, to check kernel.cl with respect to a (16x16) grid of work groups, each of which is a (32x32) arrangement of work-items, do:

gpuverify --num_groups=[16,16] --local_size=[32,32] kernel.cl

For brevity, a 1D vector (x) can be written as simply x. So, for instance, to check kernel.cl with respect to a single work group which is a (32x32) arrangement of work-items, you can specify:

gpuverify --num_groups=1 --local_size=[32,32] kernel.cl
which is equivalent to:
gpuverify --num_groups=[1] --local_size=[32,32] kernel.cl

CUDA: To invoke GPUVerify on a CUDA kernel, do:
gpuverify --blockDim=<block-dimensions> --gridDim=<grid-dimensions> <CUDA file>

Here, <block-dimensions> is a vector specifying the dimensionality of each thread block, <grid-dimensions> is a vector specifying the dimensionality of the grid of thread blocks, and <CUDA file> is a CUDA file with extension .cu.

For example, to check kernel.cu with respect to a (16x16) grid of thread blocks, each of which is a (32x32) arrangement of threads, do:

gpuverify --gridDim=[16,16] --blockDim=[32,32] kernel.cu

For brevity, a 1D vector (x) can be written as simply x. So, for instance, to check kernel.cl with respect to a single work group which is a (32x32) arrangement of work-items, you can specify:

gpuverify --gridDim=1 --blockDim=[32,32] kernel.cu
which is equivalent to:
gpuverify --gridDim=[1] --blockDim=[32,32] kernel.cu

Usage modes

There are two main usage modes:

  • verify mode (default)
  • bug-finding mode, specified via the --findbugs flag

Verify mode

By default, GPUVerify runs in verify mode. In this mode, the tool will attempt to verify that the input kernel is free from defects. If verification succeeds, the tool will report that the kernel has been verified as free from the types of defects which GPUVerify can check for. This verification result can be trusted, modulo bugs in GPUVerify and known sources of unsoundness in the tool. In due course this documentation will be updated to describe potential sources of unsoundness.

In verify mode, any defects reported by GPUVerify are possible defects, but if the kernel contains loops or multiple (non-inlined) procedures there is a high chance that they may be false positives arising due to limitations of the tool's invariant and contract inference procedures.

Bug-finding mode

The --findbugs flag causes GPUVerify to run in bug-finding mode. In this mode, all loops are unwound by a fixed (default=2) number of iterations; executions that go beyond this unwinding depth are not checked.

In bug-finding mode, defects reported by GPUVerify are much more likely to be genuine than when the tool runs in verify mode, because loops are not abstracted using invariants. However, GPUVerify can still report false positives because:

  • shared arrays are modelled abstractly
  • floating point calculations are represented abstractly using uninterpreted functions
If GPUVerify reports that no defects were found when running in bug-finding mode this does not mean that the kernel is free from defects.

General properties checked by GPUVerify

Intra-group data races

GPUVerify checks for intra-group data races. In OpenCL terms, an intra-group data race is a race between work items in the same work group. In CUDA terms an intra-group data race is a race between threads in the same thread block.

Suppose the following OpenCL kernel is executed by a single work group consisting of 1024 work items:

1  __kernel void foo(__global int *p) {
2    p[get_local_id(0)] = get_local_id(0);
3    p[get_local_id(0) + get_local_size(0) - 1] = get_local_id(0);
4  }

An intra-group data race can occur between work items 0 and 1023. If we run GPUVerify on the example:

gpuverify --local_size=1024 --num_groups=1 intra-group.cl
then this intra-group race is detected:
intra-group.cl: error: possible write-write race on ((char*)p)[4092]:

intra-group.cl:3:23: write by thread (0, 0, 0) group (0, 0, 0)
 p[get_local_id(0) + get_local_size(0) - 1] = get_local_id(0);

intra-group.cl:2:5:  write by thread (1023, 0, 0) group (0, 0, 0)
 p[get_local_id(0)] = get_local_id(0);

Inter-group data races

GPUVerify checks for inter-group data races. In OpenCL terms, an inter-group data race is a race between work items in differentwork groups. In CUDA terms an inter-group data race is a race between threads in different thread blocks.

Suppose the following CUDA kernel is executed by 8 thread blocks each consisting of 64 work items:

1  #include <cuda.h>
2  
3  __global__ void foo(int *p) {
4    p[threadIdx.x] = threadIdx.x;
5  }

The kernel is free from intra-group data races, but inter-group data race can occur between threads in different blocks that have identical intra-block thread indices. If we run GPUVerify on the example:

gpuverify --blockDim=64 --gridDim=8 inter-group.cu
then an inter-group race is detected:
inter-group.cu: error: possible write-write race on ((char*)p)[0]:

inter-group.cu:4:3: write by thread (0, 0, 0) group (0, 0, 0)
 p[threadIdx.x] = threadIdx.x;

inter-group.cu:4:3: write by thread (0, 0, 0) group (1, 0, 0)
 p[threadIdx.x] = threadIdx.x;
Barrier divergence

GPUVerify detects cases where a kernel breaks the rules for barrier synchronization in conditional code defined in the CUDA and OpenCL documentation. In particular, the tool checks that if a barrier occurs in a conditional statement then all threads must evaluate the condition uniformly, and if a barrier occurs inside a loop then all threads must execute the same number of loop iterations before synchronizing at the barrier.

GPUVerify rejects the following OpenCL kernel, executed by a single work group of 1024 work items, because work items will execute different numbers of loop iterations, breaking the barrier synchronization rules:

1  __kernel void foo(__global int *p) {
2    for(int i = 0; i < get_global_id(0); i++) {
3      p[i + get_global_id(0)] = get_global_id(0);
4      barrier(CLK_GLOBAL_MEM_FENCE);
5    }
6  }
gpuverify --local_size=1024 --num_groups=1 barrier-div-opencl.cl
barrier-div.cl:4:5: error: barrier may be reached by non-uniform control flow
   barrier(CLK_GLOBAL_MEM_FENCE);

GPUVerify rejects the following CUDA kernel when, say, executed by a 32x32 grid of 16x16 thread blocks. The reason is that the tool assumes the contents of array p are arbitrary, so there is no guarantee that all threads in a thread block will reach the same barrier:

1  #include <cuda.h>
2  
3  __global__ void foo(int *p) {
4    if(p[threadIdx.x]) {
5      // May be reached by some threads but not others depending on contents of p
6      __syncthreads();
7    }  
8  }
gpuverify --blockDim=[16,16] --gridDim=[32,32] barrier-div-cuda.cu
barrier-div-cuda.cu:6:5: error: barrier may be reached by non-uniform control flow
   __syncthreads();

Command Line Options

-h, --help

Display brief help messages for command line options.

-I <value>

Add directory to include search path.

-D <value>

Define symbol.

--findbugs

Run tool in bug-finding mode.

--loop-unwind=X

Explore traces that pass through at most X loop heads

--no-benign

Do not tolerate benign data races.

--no-infer

Turn off invariant inference.

--only-divergence

Only check for barrier divergence, not for races.

For example,

--only-intra-group

Do not check for inter-group races

--verify

Run tool in verify mode.

--verbose

Show commands to run and use verbose output.

Annotation Language

Assertions

You can write assertions in GPU kernels using the special function __assert. Such assertions will be checked by GPUVerify. For example, consider the following CUDA kernel:

1  #include <cuda.h>
2  
3  __global__ void foo() {
4    __assert(threadIdx.x + blockIdx.x != 27);
5  }

If we run GPUVerify specifying blockDim=8 and gridDim=8 then the tool verifies that the assertion cannot fail. However, for large block and grid sizes it can fail:

gpuverify --blockDim=64 --gridDim=64 failing-assert.cu
failing-assert.cu:4:3: error: this assertion might not hold for thread (5, 0, 0) group (22, 0, 0)
 __assert(threadIdx.x + blockIdx.x != 27);
failing-assert.cu:4:3: error: this assertion might not hold for thread (16, 0, 0) group (11, 0, 0)
 __assert(threadIdx.x + blockIdx.x != 27);

Loop invariants

GPUVerify may be unable to verify a kernel due to limitations in the tool's invariant inference engine. In this case the user can supply invariants manually. However, any manually supplied invariant is checked by the tool.

If an assertion is placed exactly at the head of a loop, it is treated by GPUVerify as a user-supplied loop invariant. For example, in the following (dumb) OpenCL kernel:

1   __kernel void foo() {
2     int i = 0;
3     while(
4       __assert(i <= 100) // Assertion at loop head treated as invariant
5       , // Comma operator separates invariant from loop guard
6       i < get_local_id(0) // This is the loop guard
7     ) {
8       i++;
9     }
10  }

the assertion __assert(i <= 100) is taken to be a loop invariant because it appears directly at the head of the loop. The comma operator is used to allow invariant assertions to precede the loop guard. In fact, invariant assertions are actually supplied as part of the loop guard expression, but they do not affect the truth value of the loop guard.

If the above example is analysed with a sufficiently small local size, the invariant is verified as holding. However, for a local size larger than 128 the invariant is not maintained by the loop, and GPUVerify detects this potential problem:

gpuverify --local_size=128 --num_groups=16 assert-as-invariant.cl
assert-as-invariant.cl:4:5: error: loop invariant might not be maintained by the loop for thread (102, 0, 0) group (8, 0, 0)
   __assert(i <= 100) // Assertion at loop head treated as invariant
assert-as-invariant.cl:4:5: error: loop invariant might not be maintained by the loop for thread (101, 0, 0) group (0, 0, 0)
   __assert(i <= 100) // Assertion at loop head treated as invariant

For readability, one can write __invariant as a synonym for __assert. This is recommended to state the intention that a given assertion is an invariant rather than a plain assertion, but currently GPUVerify does not check that this intention is satisifed. In particular, if __invariant is used somewhere other than a loop head then it is treated as a plain assertion.

Multiple invariants for a loop can be specified using multiple __invariant or __assert commands at the loop head. The following example illustrates this, and also shows how an invariant can be specified for a for loop:

1   __kernel void foo() {
2     int j = get_local_id(0) - 1;
3     for(int i = get_local_id(0);
4         __invariant(i < 200), // First invariant
5         __assert(i >= 0),  // Second invariant
6         __invariant(j == i - 1), // Third invariant
7         i > 0; // Loop guard
8         i--
9     ) {
10      j--;
11    }
12  }

Notice that three invariants have been specified, two using __invariant and one using __assert. Notice also that with a for loop, the invariants come immediately before the guard, which is the middle component of the for loop's specifier. Again, the comma operator is used to separate the invariants from the guard.

For sufficiently small local sizes these invariants hold. However, for a large local size, the first invariant will fail on loop entry, as reported by GPUVerify:

gpuverify --local_size=1024 --num_groups=16 invariant-for-loop.cl
invariant-for-loop.cl:4:7: error: loop invariant might not hold on entry for thread (512, 0, 0) group (0, 0, 0)
     __invariant(i < 200), // First invariant
invariant-for-loop.cl:4:7: error: loop invariant might not hold on entry for thread (576, 0, 0) group (8, 0, 0)
     __invariant(i < 200), // First invariant

Invariants are easy to specify for do-while loops: the invariants must simply appear as the first commands of the loop body:

1   __kernel void foo() {
2     int j = get_local_id(0) - 1;
3     int i = get_local_id(0);
4     do {
5       __invariant(i < 200); // First invariant
6       __assert(i >= 0); // Second invariant
7       __invariant(j == i - 1); // Third invariant
8       i--;
9       j--;
10    } while(i > 0);
11  }

While programming using goto is not recommended, the goto keyword may be used in OpenCL. If you write a loop using goto and wish to specify invariants for this loop then the rule is the same as usual: place invariants at the loop head. This is illustrated for a goto version of the do-while example as follows:

1   __kernel void foo() {
2     int j = get_local_id(0) - 1;
3     int i = get_local_id(0);
4   head:
5       __invariant(i < 200); // First invariant
6       __assert(i >= 0); // Second invariant
7       __invariant(j == i - 1); // Third invariant
8       i--;
9       j--;
10      if(i > 0) goto head;
11  }
Specifying pre-conditions using requires

Sometimes a kernel is correct only for certain input parameter values. For example, the following CUDA kernel is not correct when executed by a single block of 1024 threads for arbitrary values of parameter sz:

1  #include <cuda.h>
2  
3  __global__ void foo(float *A, int sz) {
4    for(int i = 0; i < 100; i++) {
5      A[sz*i + threadIdx.x] *= 2.0f;
6    }
7  }

There will be data races if the absolute value of sz is less than the size of a thread block, or if the absolute value of sz is so large that overflow can occur. GPUVerify therefore rejects the kernel:

needs_requires.cu: error: possible write-read race on ((char*)A)[-62043744]:

needs_requires.cu:5:5: read by thread (512, 0, 0) group (0, 0, 0)
 A[sz*i + threadIdx.x] *= 2.0f;

needs_requires.cu:5:5: write by thread (520, 0, 0) group (0, 0, 0)
 A[sz*i + threadIdx.x] *= 2.0f;

needs_requires.cu: error: possible read-write race on ((char*)A)[37438500]:

needs_requires.cu:5:5: write by thread (0, 0, 0) group (0, 0, 0)
 A[sz*i + threadIdx.x] *= 2.0f;

needs_requires.cu:5:5: read by thread (1, 0, 0) group (0, 0, 0)
 A[sz*i + threadIdx.x] *= 2.0f;

If the programmer knows that sz should be equal to the x dimension of a thread block, they can specify this via a pre-condition on the kernel, using he __requires annotation:

1  #include <cuda.h>
2  
3  __global__ void foo(float *A, int sz) {
4    __requires(sz == blockDim.x);
5    for(int i = 0; i < 100; i++) {
6      A[sz*i + threadIdx.x] *= 2.0f;
7    }
8  }

With this annotation, GPUVerify is able to verify race-freedom.

Rules for using __requires:

Specifying post-conditions using ensures
  • __return_val
  • __old
  • Global assertions and pre-/post-conditions
    • __global_requires
    • __global_ensures
    • __global_assert
    Constructs for reasoning about memory accesses