OpenCL: To invoke GPUVerify on an OpenCL kernel, do:
Here, <work-group-dimensions> is a vector specifying the dimensionality of each work group, <grid-dimensions> is a vector specifying the dimensionality of the grid of work groups, and <OpenCL file> is an OpenCL file with extension .cl.
For example, to check kernel.cl with respect to a (16x16) grid of work groups, each of which is a (32x32) arrangement of work-items, do:
For brevity, a 1D vector (x) can be written as simply x. So, for instance, to check kernel.cl with respect to a single work group which is a (32x32) arrangement of work-items, you can specify:
Here, <block-dimensions> is a vector specifying the dimensionality of each thread block, <grid-dimensions> is a vector specifying the dimensionality of the grid of thread blocks, and <CUDA file> is a CUDA file with extension .cu.
For example, to check kernel.cu with respect to a (16x16) grid of thread blocks, each of which is a (32x32) arrangement of threads, do:
For brevity, a 1D vector (x) can be written as simply x. So, for instance, to check kernel.cl with respect to a single work group which is a (32x32) arrangement of work-items, you can specify:
There are two main usage modes:
By default, GPUVerify runs in verify mode. In this mode, the tool will attempt to verify that the input kernel is free from defects. If verification succeeds, the tool will report that the kernel has been verified as free from the types of defects which GPUVerify can check for. This verification result can be trusted, modulo bugs in GPUVerify and known sources of unsoundness in the tool. In due course this documentation will be updated to describe potential sources of unsoundness.
In verify mode, any defects reported by GPUVerify are possible defects, but if the kernel contains loops or multiple (non-inlined) procedures there is a high chance that they may be false positives arising due to limitations of the tool's invariant and contract inference procedures.
The --findbugs flag causes GPUVerify to run in bug-finding mode. In this mode, all loops are unwound by a fixed (default=2) number of iterations; executions that go beyond this unwinding depth are not checked.
In bug-finding mode, defects reported by GPUVerify are much more likely to be genuine than when the tool runs in verify mode, because loops are not abstracted using invariants. However, GPUVerify can still report false positives because:
GPUVerify checks for intra-group data races. In OpenCL terms, an intra-group data race is a race between work items in the same work group. In CUDA terms an intra-group data race is a race between threads in the same thread block.
Suppose the following OpenCL kernel is executed by a single work group consisting of 1024 work items:
An intra-group data race can occur between work items 0 and 1023. If we run GPUVerify on the example:
GPUVerify checks for inter-group data races. In OpenCL terms, an inter-group data race is a race between work items in differentwork groups. In CUDA terms an inter-group data race is a race between threads in different thread blocks.
Suppose the following CUDA kernel is executed by 8 thread blocks each consisting of 64 work items:
The kernel is free from intra-group data races, but inter-group data race can occur between threads in different blocks that have identical intra-block thread indices. If we run GPUVerify on the example:
GPUVerify detects cases where a kernel breaks the rules for barrier synchronization in conditional code defined in the CUDA and OpenCL documentation. In particular, the tool checks that if a barrier occurs in a conditional statement then all threads must evaluate the condition uniformly, and if a barrier occurs inside a loop then all threads must execute the same number of loop iterations before synchronizing at the barrier.
GPUVerify rejects the following OpenCL kernel, executed by a single work group of 1024 work items, because work items will execute different numbers of loop iterations, breaking the barrier synchronization rules:
GPUVerify rejects the following CUDA kernel when, say, executed by a 32x32 grid of 16x16 thread blocks. The reason is that the tool assumes the contents of array p are arbitrary, so there is no guarantee that all threads in a thread block will reach the same barrier:
Display brief help messages for command line options.
Add directory to include search path.
Define symbol.
Run tool in bug-finding mode.
Explore traces that pass through at most X loop heads
Do not tolerate benign data races.
Turn off invariant inference.
Only check for barrier divergence, not for races.
For example,
Do not check for inter-group races
Run tool in verify mode.
Show commands to run and use verbose output.
You can write assertions in GPU kernels using the special function __assert. Such assertions will be checked by GPUVerify. For example, consider the following CUDA kernel:
If we run GPUVerify specifying blockDim=8 and gridDim=8 then the tool verifies that the assertion cannot fail. However, for large block and grid sizes it can fail:
GPUVerify may be unable to verify a kernel due to limitations in the tool's invariant inference engine. In this case the user can supply invariants manually. However, any manually supplied invariant is checked by the tool.
If an assertion is placed exactly at the head of a loop, it is treated by GPUVerify as a user-supplied loop invariant. For example, in the following (dumb) OpenCL kernel:
the assertion __assert(i <= 100) is taken to be a loop invariant because it appears directly at the head of the loop. The comma operator is used to allow invariant assertions to precede the loop guard. In fact, invariant assertions are actually supplied as part of the loop guard expression, but they do not affect the truth value of the loop guard.
If the above example is analysed with a sufficiently small local size, the invariant is verified as holding. However, for a local size larger than 128 the invariant is not maintained by the loop, and GPUVerify detects this potential problem:
For readability, one can write __invariant as a synonym for __assert. This is recommended to state the intention that a given assertion is an invariant rather than a plain assertion, but currently GPUVerify does not check that this intention is satisifed. In particular, if __invariant is used somewhere other than a loop head then it is treated as a plain assertion.
Multiple invariants for a loop can be specified using multiple __invariant or __assert commands at the loop head. The following example illustrates this, and also shows how an invariant can be specified for a for loop:
Notice that three invariants have been specified, two using __invariant and one using __assert. Notice also that with a for loop, the invariants come immediately before the guard, which is the middle component of the for loop's specifier. Again, the comma operator is used to separate the invariants from the guard.
For sufficiently small local sizes these invariants hold. However, for a large local size, the first invariant will fail on loop entry, as reported by GPUVerify:
Invariants are easy to specify for do-while loops: the invariants must simply appear as the first commands of the loop body:
While programming using goto is not recommended, the goto keyword may be used in OpenCL. If you write a loop using goto and wish to specify invariants for this loop then the rule is the same as usual: place invariants at the loop head. This is illustrated for a goto version of the do-while example as follows:
Sometimes a kernel is correct only for certain input parameter values. For example, the following CUDA kernel is not correct when executed by a single block of 1024 threads for arbitrary values of parameter sz:
There will be data races if the absolute value of sz is less than the size of a thread block, or if the absolute value of sz is so large that overflow can occur. GPUVerify therefore rejects the kernel:
If the programmer knows that sz should be equal to the x dimension of a thread block, they can specify this via a pre-condition on the kernel, using he __requires annotation:
With this annotation, GPUVerify is able to verify race-freedom.
Rules for using __requires: