Advanced Features

Annotation language

Expert users can make use of a variety of builtin functions to annotate their code to aid verification. We now describe these annotations.

Assertions

You can write assertions in GPU kernels using the special function __assert. Such assertions will be checked by GPUVerify. For example, consider the following CUDA kernel:

#include <cuda.h>

__global__ void foo() {
  __assert(threadIdx.x + blockIdx.x != 27);
}

If we run GPUVerify specifying blockDim=8 and gridDim=8 then the tool verifies that the assertion cannot fail. However, for large block and grid sizes it can fail:

gpuverify --blockDim=64 --gridDim=64 failing-assert.cu
failing-assert.cu:4:3: error: this assertion might not hold for thread (5, 0, 0) group (22, 0, 0)

  __assert(threadIdx.x + blockIdx.x != 27);

failing-assert.cu:4:3: error: this assertion might not hold for thread (16, 0, 0) group (11, 0, 0)

  __assert(threadIdx.x + blockIdx.x != 27);

Loop invariants

GPUVerify may be unable to verify a kernel due to limitations in the tool’s invariant inference engine. In this case the user can supply invariants manually. However, any manually supplied invariant is checked by the tool.

If an assertion is placed exactly at the head of a loop, it is treated by GPUVerify as a user-supplied loop invariant. For example, in the following (dumb) OpenCL kernel:

__kernel void foo() {
  int i = 0;
  while(
    __assert(i <= 100) // Assertion at loop head treated as invariant
    , // Comma operator separates invariant from loop guard
    i < get_local_id(0) // This is the loop guard
  ) {
    i++;
  }
}

the assertion __assert(i <= 100) is taken to be a loop invariant because it appears directly at the head of the loop. The comma operator is used to allow invariant assertions to precede the loop guard. In fact, invariant assertions are actually supplied as part of the loop guard expression, but they do not affect the truth value of the loop guard.

If the above example is analysed with a sufficiently small local size, the invariant is verified as holding. However, for a local size larger than 101 the invariant is not maintained by the loop, and GPUVerify detects this potential problem:

gpuverify --local_size=128 --num_groups=16 assert-as-invariant.cl
assert-as-invariant.cl:4:5: error: loop invariant might not be maintained by the loop for thread (102, 0, 0) group (8, 0, 0)

    __assert(i <= 100) // Assertion at loop head treated as invariant

assert-as-invariant.cl:4:5: error: loop invariant might not be maintained by the loop for thread (101, 0, 0) group (0, 0, 0)

    __assert(i <= 100) // Assertion at loop head treated as invariant

For readability, one can write __invariant as a synonym for __assert. This is recommended to state the intention that a given assertion is an invariant rather than a plain assertion, but currently GPUVerify does not check that this intention is satisifed. In particular, if __invariant is used somewhere other than a loop head then it is treated as a plain assertion.

Multiple invariants for a loop can be specified using multiple __invariant or __assert commands at the loop head. The following example illustrates this, and also shows how an invariant can be specified for a for loop:

__kernel void foo() {
  int j = get_local_id(0) - 1;
  for(int i = get_local_id(0);
      __invariant(i < 200), // First invariant
      __assert(i >= 0),  // Second invariant
      __invariant(j == i - 1), // Third invariant
      i > 0; // Loop guard
      i--
  ) {
    j--;
  }
}

Notice that three invariants have been specified, two using __invariant and one using __assert. Notice also that with a for loop, the invariants come immediately before the guard, which is the middle component of the for loop’s specifier. Again, the comma operator is used to separate the invariants from the guard.

For sufficiently small local sizes these invariants hold. However, for a large local size, the first invariant will fail on loop entry, as reported by GPUVerify:

gpuverify --local_size=1024 --num_groups=16 invariant-for-loop.cl
invariant-for-loop.cl:4:7: error: loop invariant might not hold on entry for thread (512, 0, 0) group (0, 0, 0)

      __invariant(i < 200), // First invariant

invariant-for-loop.cl:4:7: error: loop invariant might not hold on entry for thread (576, 0, 0) group (8, 0, 0)

      __invariant(i < 200), // First invariant

Invariants are easy to specify for do-while loops: the invariants must simply appear as the first commands of the loop body:

__kernel void foo() {
  int j = get_local_id(0) - 1;
  int i = get_local_id(0);
  do {
    __invariant(i < 200); // First invariant
    __assert(i >= 0); // Second invariant
    __invariant(j == i - 1); // Third invariant
    i--;
    j--;
  } while(i > 0);
}

While programming using goto is not recommended, the goto keyword may be used in OpenCL. If you write a loop using goto and wish to specify invariants for this loop then the rule is the same as usual: place invariants at the loop head. This is illustrated for a goto version of the do-while example as follows:

__kernel void foo() {
  int j = get_local_id(0) - 1;
  int i = get_local_id(0);
head:
    __invariant(i < 200); // First invariant
    __assert(i >= 0); // Second invariant
    __invariant(j == i - 1); // Third invariant
    i--;
    j--;
    if(i > 0) goto head;
}

Pre-conditions

Sometimes a kernel is correct only for certain input parameter values. For example, the following CUDA kernel is not correct when executed by a single block of 1024 threads for arbitrary values of parameter sz:

#include <cuda.h>

__global__ void foo(float *A, int sz) {
  for(int i = 0; i < 100; i++) {
    A[sz*i + threadIdx.x] *= 2.0f;
  }
}

There will be data races if the absolute value of sz is less than the size of a thread block, or if the absolute value of sz is so large that overflow can occur. GPUVerify therefore rejects the kernel:

needs_requires.cu: error: possible write-read race on ((char*)A)[-62043744]:


needs_requires.cu:5:5: read by thread (512, 0, 0) group (0, 0, 0)

  A[sz*i + threadIdx.x] *= 2.0f;


needs_requires.cu:5:5: write by thread (520, 0, 0) group (0, 0, 0)

  A[sz*i + threadIdx.x] *= 2.0f;


needs_requires.cu: error: possible read-write race on ((char*)A)[37438500]:


needs_requires.cu:5:5: write by thread (0, 0, 0) group (0, 0, 0)

  A[sz*i + threadIdx.x] *= 2.0f;


needs_requires.cu:5:5: read by thread (1, 0, 0) group (0, 0, 0)

  A[sz*i + threadIdx.x] *= 2.0f;

If the programmer knows that sz should be equal to the x dimension of a thread block, they can specify this via a pre-condition on the kernel, using he __requires annotation:

#include <cuda.h>

__global__ void foo(float *A, int sz) {
  __requires(sz == blockDim.x);
  for(int i = 0; i < 100; i++) {
    A[sz*i + threadIdx.x] *= 2.0f;
  }
}

With this annotation, GPUVerify is able to verify race-freedom.

Rules for using ``__requires``: .. todo:: Be careful with short-circuit evaluation

Post-conditions

Todo

Need to describe __return_val and __old

Assume statements

Todo

Say that this should be used with extreme care

Global assertions and pre-/post-conditions

Todo

This will explain:

  • __global_requires
  • __global_ensures
  • __global_assert

Reasoning about memory accesses

Todo

__write_implies, etc. Be sure to comment on byte-level reasoning issue.

Looking into automatic invariant generation

Inspecting invariants generated by GPUVerify

If you are looking to extend GPUVerify’s invariant inference capabilities, it can be useful to look at the invariants the tool is generating.

To do this, run GPUVerify with the –keep-temps option. If you just want to see the candidate invariants the tool generates, and do not wish for verification to be attempted, you can use –stop-at-bpl to tell GPUVerify to stop once is has generated the Boogie program that contains candidate invariants.

Assuming that your kernel was called kernel.cl or kernel.cu, the –keep-temps option will lead to a file called kernel.bpl being created.

If you look in this file and search for :tag, you may find some assertions that have the :tag attribute. This attribute is a string indicating which invariant generation rule inside the GPUVerifyVCGen component of GPUVerify led to the candidate invariant being generated.

For example, if you try this on:

testsuite/OpenCL/test_mod_invariants/global_direct/kernel.cl

then in the .bpl file you will see something like:

$1:
  assert {:tag "accessedOffsetsSatisfyPredicates"} _b2 ==> _WRITE_HAS_OCCURRED_$$A ==> BV32_AND(BV32_SUB(256bv32, 1bv32), _WRITE_OFFSET_$$A) == BV32_AND(BV32_SUB(256bv32, 1bv32), BV32_ADD(BV32_MUL(group_size_x, group_id_x$1), local_id_x$1));
  assert {:tag "guardNonNeg"} {:thread 1} p0$1 ==> _b1 ==> BV32_SLE(0bv32, $i.0$1);
  assert {:tag "guardNonNeg"} {:thread 2} p0$2 ==> _b1 ==> BV32_SLE(0bv32, $i.0$2);
  assert {:tag "loopCounterIsStrided"} {:thread 1} p0$1 ==> _b0 ==> BV32_AND(BV32_SUB(256bv32, 1bv32), $i.0$1) == BV32_AND(BV32_SUB(256bv32, 1bv32), BV32_ADD(BV32_MUL(group_size_x, group_id_x$1), local_id_x$1));
  assert {:tag "loopCounterIsStrided"} {:thread 2} p0$2 ==> _b0 ==> BV32_AND(BV32_SUB(256bv32, 1bv32), $i.0$2) == BV32_AND(BV32_SUB(256bv32, 1bv32), BV32_ADD(BV32_MUL(group_size_x, group_id_x$2), local_id_x$2));
  assert {:tag "user"} {:originated_from_invariant} {:line 12} {:col 7} {:fname "kernel.cl"} {:dir "/Users/nafe/work/autobuild/mac/gpuverify/testsuite/OpenCL/test_mod_invariants/global_direct"} {:thread 1} p0$1 ==> _c0 ==> (if $i.0$1 == v0$1 then 1bv1 else 0bv1) != 0bv1;
  assert {:tag "user"} {:originated_from_invariant} {:line 12} {:col 7} {:fname "kernel.cl"} {:dir "/Users/nafe/work/autobuild/mac/gpuverify/testsuite/OpenCL/test_mod_invariants/global_direct"} {:thread 2} p0$2 ==> _c0 ==> (if $i.0$2 == v0$2 then 1bv1 else 0bv1) != 0bv1;

(Of course, the exact form and number of invariants generated may change, as it is sensitive to the test case in question and the current state of the candidate generation engine.)

The candidate invariants can be pretty difficult to read! They have not been designed for human consumption.

Watching GPUVerify eliminate invariants

Each invariant candidate has the form _bX ==> e, where _bX is a Boolean constant marked with the :existential attribute. This tells Boogie that _bX should be treated specially: it should be used by the Houdini algorithm to turn the given invariant on or off. Clearly if _bX is set to true, it is as if the invariant e appeared directly, while if _bX is set to false it is as if no invariant were present at all (because false ==> e is equivalent to the trivial invariant true).

Houdini starts by setting all of the _b variables to true and trying to prove that they form an inductive invariant for the program. If they do not, Houdini will kick out at least one candidate, by setting its _b variable to false, and the process continues.

To see this kicking out of candidates in action, you can run GPUVerify with –verbose and with --boogie-opt=/trace (see –boogie-opt=...). If you try this on:

testsuite/OpenCL/test_mod_invariants/global_direct/kernel.cl

You’ll see something like this in the output:

Verifying $foo
Houdini assignment axiom: (And true _b0 _b1 _b2 _b3 _b4 _b5 _b6 _b7 _b8)
Time taken = 0.156258
Removing _b6
Removing _b2
Verifying $foo
Houdini assignment axiom: (And true _b0 _b1 (! _b2) _b3 _b4 _b5 (! _b6) _b7 _b8)

Time taken = 0.0781293

The first Houdini assignment axiom sets all of the _b variables to true. Boogie then tries to verify the kernel, and it finds that _b6 and _b2 (i.e. the access lower bound candidates shown in Inspecting invariants generated by GPUVerify) do not hold. In the next Houdini iteration, these candidates are flipped to false. The remaining invariants are then found to hold. With this invariant, the tool then proceeds to do race analysis (not shown in the above snippet).

Barrier invariants

Todo

Simple example of barrier invariant usage

Uninterpreted functions

Todo

Show how UFs can be used to model operations abstractly

Procedure specifications

Todo

Show (currently hacky) way of providing a procedure spec without body

Passing arguments to Boogie

Todo

Give example use of --boogie-opt

Linking with Boogie files

Todo

Describe how an external .bpl file can be used.