Overhauling SC atomics in C11 and OpenCL

This webpage contains additional material for the paper Overhauling SC atomics in C11 and OpenCL by Mark Batty, Alastair F. Donaldson, and John Wickerson.

Contents

1. Virtual machine

We provide a VirtualBox virtual machine pre-installed with Herd (for simulating our various formalisations of the C11 and OpenCL memory models) and the HOL theorem prover (for reproducing our proof that our Herd formalisation of the C11 memory model is equivalent to Batty et al.'s earlier formalisation in Lem). The username is herd and the password is letmein.

Alternative: building Herd from source. Herd is provided on the virtual machine, but can alternatively be built from source. The source code is available to download from a Github repository.

To build the Herd executable, simply run make in the herd directory.

2. Using the Herd simulator

To instruct Herd to exhaustively simulate a given litmus test (say, foo.litmus) according to a given model definition (say, bar.cat), issue the following command (while in the same directory as bar.cat):
  ~/herdtools/herd -initwrites true -model bar.cat path/to/foo.litmus
To display the executions in graphical form, issue the following command, which launches Ghostview in an X11 session and sets some reasonable configuration options:
  ~/herdtools/herd -initwrites true -show all -gv -graph columns ↵
   -showinitwrites false -squished true -scale 2 -symetric dr ↵
  -model bar.cat path/to/foo.litmus
The various .cat files are located in the home directory on our virtual machine; they are also available to download via the links in the following two sections. The various .litmus files are located in the ~/litmustests directory on our virtual machine; they are also available to download via the links in the following two sections.

3. Three formalisations of the C11 memory model

The following .cat file describes the original C11 memory model:

The following .cat file describes the C11 memory model and uses a partial order over SC operations:

The following .cat file describes a simpler and slightly stronger version of the C11 memory model:

Those definitions that are common to the three files above are factored out into a separate .cat file that is then imported by those files:

We provide several litmus tests for exploring the model. Those prefixed with 'cppmem' are ported directly from examples distributed with the CppMem tool.
  1. example1.litmus
    This test corresponds to Example 1 in the paper.
  2. mp_relacq.litmus
    This illustrates the basic release/acquire message-passing idiom.
  3. mp_relaxed.litmus
    This uses relaxed accesses rather than release/acquire. This is insufficient to prevent a race on x.
  4. mp_fences.litmus
    This, like the previous test, uses relaxed accesses, but adds fences to restore the release/acquire synchronisation.
  5. mp_relacq_rs.litmus
    This extends the basic release/acquire message-passing idiom with an extra (relaxed) write to y that appears in the release sequence.
  6. mp_sc.litmus
    This strengthens the basic release/acquire message-passing idiom to use SC accesses, thus showing how these accesses can be used to perform release/acquire synchronisation.
  7. cppmem_iriw_relacq.litmus
    This shows that the two 'reading' threads can observe the writes to x and y in different orders.
  8. iriw_sc.litmus
    This, in contrast to the previous test, uses the SC memory order for all accesses. This prevents the 'reading' threads observing the writes in different orders.
  9. cppmem_partial_sb.litmus
    This provides an example of the partiality of the sequenced-before relation.
  10. cppmem_sc_atomics.litmus
    This provides a simple example of conflicting memory accesses that do not race because they are atomic.
  11. cppmem_unsequencedrace.litmus
    This illustrates an unsequenced race: a data race within a single thread.

4. Two formalisations of the OpenCL memory model

The following .cat file describes the OpenCL 2.0 memory model: The following .cat file describes the OpenCL 2.0 memory model after our scoped SC axiom has been applied:

Those definitions that are common to both files above are factored out into a separate .cat file that is then imported by both files:

We provide several litmus tests for exploring the model:
  1. example4.litmus
    This test corresponds to Example 4 in the paper.
  2. example5.litmus
    This test corresponds to Example 5 in the paper.
  3. example6.litmus
    This test corresponds to Example 6 in the paper.
  4. example7a.litmus, example7b.litmus
    These tests corresponds to Example 7 in the paper.
  5. example8.litmus
    This test corresponds to Example 8 in the paper.
  6. example9a.litmus, example9b.litmus
    These tests corresponds to Example 9 in the paper.
  7. example10.litmus
    This test corresponds to Example 10 in the paper.
  8. MP_ra_dev.litmus
    Release/acquire synchronisation on the atomic flag y prevents reading a stale value of x. The use of device scope is sufficient here because the two threads are executing on the same device.
  9. MP_ra_wg.litmus
    This test is the same as the previous one, but uses work-group scope instead of device scope when accessing the atomic flag y. This is insufficient because the two threads are in different work-groups, and causes the program to be faulty.
  10. MP_ra_dev_broken.litmus
    This test is the same as the MP_ra_dev.litmus test above, but the two threads are in different devices. This means that the release/acquire synchronisation at device scope is no longer sufficient. Hence, the program is faulty.
  11. MP_sc_dev.litmus
    This test is the same as the MP_ra_dev.litmus test above, but it uses SC atomics to access the atomic flag y. This is sufficient to induce synchronisation because the SC memory order subsumes both the release and the acquire memory orders.
  12. ISA2.litmus
    This test demonstrates cumulativity in OpenCL. The objective is to transfer the message at non-atomic location x from thread P0 to thread P2. This is achieved by release/acquire synchronisation between P0 and thread P1 in the same work-group (on the atomic flag y, using work-group scope), followed by release/acquire synchronisation between P1 and P2 in a different work-group (on the atomic flag z, using device scope).
  13. ISA2_broken.litmus
    This test is the same as the previous one, but puts the atomic flag y in local rather than global memory. This inhibits the global happens-before edges that are necessary to separate the non-atomic accesses to x, thus causing a data race, which renders the whole program faulty.
  14. IRIW_sc_wg.litmus
    This is the same as the iriw_sc.litmus test given in C11 above, but the SC operations are annotated with work-group scope. In the original opencl.cat model, the SC ordering constraints do not apply, so the reading threads can see the writes to x and y in two different orders. In the revised opencl_scopedsc.cat model, the SC ordering constraints do apply, and the relaxed behaviour is forbidden.
  15. IRIW_sc_dev.litmus
    This test is the same as the IRIW_sc_wg.litmus test above, except that the SC operations are now upgraded to device scope. Both OpenCL models now disallow the relaxed behaviour.

5. Equivalence to Batty et al.

We provide a proof in HOL 4 that our c11_partialSC.cat formalisation of the C11 memory model is equivalent to Batty et al.'s original Lem formalisation. The proof is located in the ~/equivalence_proof directory on our virtual machine, and is also available to download directly:

To read the statement of the proof, refer to lemHerdEquivalent_paper in the file cpp11soundness_paper.lem.

Remark. We point out that this proof of equivalence relates the Lem formalisation not to c11_orig.cat, but to c11_partialSC.cat. The essential difference between those two .cat files, which enables the 'partial SC' innovation, is that the former uses axiom S4 [see Defn. 11 of the paper] where the latter uses S4a [see Sec 3.1 of the paper]. In a slight departure from the text of the standard, Batty et al.'s original Lem formalisation actually uses axiom S4a. This explains why the equivalence is with respect to c11_partialSC.cat.

6. Suggested textual changes

6a. Suggested changes to the C11 specification

Our paper presents a proposal for simplifying the sequential consistency axioms in the C11 model. We give here our suggestion for how the specification document can be rephrased to accommodate our proposal, while maintaining the prose style used throughout the rest of the document.

The following text is reproduced verbatim from the C11 standard (Section 7.17.3, paragraphs 6 and 9–11):

6. There shall be a single total order S on all memory_order_seq_cst operations, consistent with the "happens before" order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values:

[...]

9. For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.

10. For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

11. For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order

We propose that the paragraphs quoted above be removed and replaced with the following three:

1. A value computation A of an object M "reads before" a side effect B on M if B follows, in the modification order of M, the side effect that A observes.

If X reads before Y, or happens before Y, or precedes Y in modification order, then X (and any fences sequenced before X) is "SC-before" Y (and any fences sequenced after Y).

3. There shall be a single total order S on all memory_order_seq_cst operations, consistent with the SC-before order.

6b. Suggested changes to the OpenCL specification

Our paper presents a proposal for simplifying the sequential consistency axioms in the OpenCL model. We give here our suggestion for how the specification document can be rephrased to accommodate our proposal, while maintaining the prose style used throughout the rest of the document.

The following text is reproduced verbatim from the OpenCL 2.1 standard (51/14–52/13):

If one of the following two conditions holds:

then there shall exist a single total order S for all memory_order_seq_cst operations that is consistent with the modification orders for all affected locations, as well as the appropriate global-happens-before and local-happens-before orders for those locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M in global or local memory observes one of the following values:

[...]

If the total order S exists, the following rules hold:

We propose that the paragraphs quoted above be removed and replaced with the following three:

1. A value computation A of an object M "reads before" a side effect B on M if B follows, in the modification order of M, the side effect that A observes.

If X reads before Y, or global or local happens before Y, or precedes Y in modification order, then X (and any fences sequenced before X) is "SC-before" Y (and any fences sequenced after Y).

3. There shall be a single total order S on all memory_order_seq_cst operations, consistent with the SC-before order restricted to operations with inclusive scopes.