Taming the complexities of the C11 and OpenCL memory models ‐ Additional companion material

This webpage contains the appendices and additional material for the paper entitled Taming the complexities of the C11 and OpenCL memory models by John Wickerson and Mark Batty.

OpenCL 2.0 specification documents

We provide copies of the OpenCL 2.0 specification documents for the convenience of reviewers. The documents in the Khronos registry have been slightly revised recently, which may invalidate the line and page numbers that we refer to in our paper.

Obtaining and using the herd simulator

The source code of herd is available to download from a github repository.

To build the herd executable, simply run make in the herd directory. To instruct herd to exhaustively simulate a given litmus test (say, foo.litmus) according to a given model definition (say, bar.cat), issue the following command:
  ./herd -model path/to/bar.cat path/to/foo.litmus
  
To display the executions in graphical form, issue the following command, which launches Ghostview in an X11 session and sets some reasonable configuration options:
  ./herd -model path/to/bar.cat -show all -gv -graph columns ↵
  -initwrites true -showinitwrites false -squished true ↵
  -scale 2 -symetric dr path/to/foo.litmus
  
The sections below provide links to download our .cat models and some .litmus tests for experimenting with our models.

Two formalisations of the C11 memory model

The following .cat file describes the C11 memory model and uses a partial order over SC operations:

The following .cat file describes a simpler and slightly stronger version of the C11 memory model:

Those definitions that are common to both c11_partialSC.cat and c11_simp.cat are factored out into a separate .cat file that is then imported by both:

We provide several litmus tests for exploring the model. Those prefixed with 'cppmem' are ported directly from examples distributed with the CppMem tool.
  1. mp_relacq.litmus
    This illustrates the basic release/acquire message-passing idiom.
  2. mp_relaxed.litmus
    This uses relaxed accesses rather than release/acquire. This is insufficient to prevent a race on x.
  3. mp_fences.litmus
    This, like the previous test, uses relaxed accesses, but adds fences to restore the release/acquire synchronisation.
  4. mp_relacq_rs.litmus
    This extends the basic release/acquire message-passing idiom with an extra (relaxed) write to y that appears in the release sequence.
  5. mp_sc.litmus
    This strengthens the basic release/acquire message-passing idiom to use SC accesses, thus showing how these accesses can be used to perform release/acquire synchronisation.
  6. cppmem_iriw_relacq.litmus
    This shows that the two 'reading' threads (P2) and (P3) can observe the writes to x and y in different orders.
  7. iriw_sc.litmus
    This, in contrast to the previous test, uses the SC memory order for all accesses. This prevents the 'reading' threads observing the writes in different orders.
  8. cppmem_partial_sb.litmus
    This provides an example of the partiality of the sequenced-before relation.
  9. cppmem_sc_atomics.litmus
    This provides a simple example of conflicting memory accesses that do not race because they are atomic.
  10. cppmem_unsequencedrace.litmus
    This illustrates an unsequenced race: a data race within a single thread.

Proof of equivalence to Batty's original formalisation

We provide the HOL4 proof script that describes our mechanised proof of equivalence between c11_partialSC.cat and Batty's original formalisation in Lem.

Formalisation of OpenCL 2.0 memory model

The following .cat file describes the OpenCL 2.0 memory model: We provide several litmus tests for exploring the model:
  1. MP_ra_dev.litmus
    This test corresponds to Example 2 in the paper. Release/acquire synchronisation on the atomic flag y prevents reading a stale value of x. The use of device scope is sufficient here because the two threads are executing on the same device.
  2. MP_ra_wg.litmus
    This test is the same as the previous one, but uses work-group scope instead of device scope when accessing the atomic flag y. This is insufficient because the two threads are in different work-groups, and causes the program to be faulty.
  3. MP_ra_dev_broken.litmus
    This test corresponds to Example 3 in the paper. It is the same as the MP_ra_dev.litmus test above, but the two threads are in different devices. This means that the release/acquire synchronisation at device scope is no longer sufficient. Hence, the program is faulty.
  4. MP_sc_dev.litmus
    This test is the same as the MP_ra_dev.litmus test above, but it uses SC atomics to access the atomic flag y. This is sufficient to induce synchronisation because the SC memory order subsumes both the release and the acquire memory orders.
  5. MP_ra_dev_localflag.litmus
    This test corresponds to Example 4 in the paper. It is the same as the MP_ra_dev.litmus test above, but puts the atomic flag y in local rather than global memory. This inhibits the global happens-before edges that are necessary to separate the non-atomic accesses to x, thus causing a data race, which renders the whole program faulty.
  6. MP_ra_sc_global_local.litmus
    This test corresponds to Example 5 in the paper. It is the same as the faulty MP_ra_dev_localflag.litmus test above, but it inserts global-and-local fences in order to avoid the data race on x. Thanks to the global-and-local fences, the release/acquire synchronisation on the local flag y induces a global happens-before edge.
  7. ISA2.litmus
    This test corresponds to Example 6 in the paper. The objective is to transfer the message at non-atomic location x from thread P0 to thread P2. This is achieved by release/acquire synchronisation between P0 and thread P1 in the same work-group, on the atomic flag y using work-group scope, followed by release/acquire synchronisation between P1 and P2 in a different work-group, on the atomic flag z using device scope.
  8. ISA2_broken.litmus
    This test is the same as the previous one, but puts the atomic flag y in local rather than global memory. This inhibits the global happens-before edges that are necessary to separate the non-atomic accesses to x, thus causing a data race, which renders the whole program faulty.
  9. IRIW_sc_wg.litmus
    This test corresponds to Example 7 in the paper. It is the same as the iriw_sc.litmus test given in C11 above, but the SC operations are annotated with work-group scope. This prevents the SC operations from being chained into a total order, and thus allows the reading threads to see the writes to x and y in two different orders.
  10. IRIW_sc_dev.litmus
    This test is the same as the IRIW_sc_wg.litmus test above, except that the SC operations are now upgraded to device scope. This means that the SC operations are now chained into a total order, and the relaxed behaviour exhibited in the previous test is no longer permitted.