Overhauling SC atomics in C11 and OpenCL (Additional companion material)

We have formalised three versions of the C11 memory model in the .cat format: the original model, one with a partial order over SC operations, and one that is simpler and a little stronger (see Sec 3).
We have formalised two versions of the OpenCL 2.0 memory model in the .cat format: the original model, and one that uses a 'scoped SC' axiom (see Sec 4).
All of these memory models can be explored using the Herd memory model simulator (see Sec 2). We provide several litmus tests for exploring the models (see Sec 3 and Sec 4).
Our c11_partialSC.cat formalisation of the C11 memory model is equivalent to Batty et al.'s original Lem formalisation. We justify this claim by providing a proof mechanised in the HOL theorem prover (see Sec 5).

1. Virtual machine

We provide a VirtualBox virtual machine pre-installed with Herd (for simulating our various formalisations of the C11 and OpenCL memory models) and the HOL theorem prover (for reproducing our proof that our Herd formalisation of the C11 memory model is equivalent to Batty et al.'s earlier formalisation in Lem). The username is herd and the password is letmein.

Virtual machine

Alternative: building Herd from source. Herd is provided on the virtual machine, but can alternatively be built from source. The source code is available to download from a Github repository.

Herd source code

To build the Herd executable, simply run make in the herd directory.

2. Using the Herd simulator

To instruct Herd to exhaustively simulate a given litmus test (say, foo.litmus) according to a given model definition (say, bar.cat), issue the following command (while in the same directory as bar.cat):

  ~/herdtools/herd -initwrites true -model bar.cat path/to/foo.litmus

To display the executions in graphical form, issue the following command, which launches Ghostview in an X11 session and sets some reasonable configuration options:

  ~/herdtools/herd -initwrites true -show all -gv -graph columns ↵
   -showinitwrites false -squished true -scale 2 -symetric dr ↵
  -model bar.cat path/to/foo.litmus

The various .cat files are located in the home directory on our virtual machine; they are also available to download via the links in the following two sections. The various .litmus files are located in the ~/litmustests directory on our virtual machine; they are also available to download via the links in the following two sections.

3. Three formalisations of the C11 memory model

The following .cat file describes the original C11 memory model:

c11_orig.cat

The following .cat file describes the C11 memory model and uses a partial order over SC operations:

c11_partialSC.cat

The following .cat file describes a simpler and slightly stronger version of the C11 memory model:

c11_simp.cat

Those definitions that are common to the three files above are factored out into a separate .cat file that is then imported by those files:

c11_base.cat

We provide several litmus tests for exploring the model. Those prefixed with 'cppmem' are ported directly from examples distributed with the CppMem tool.

example1.litmus
This test corresponds to Example 1 in the paper.
mp_relacq.litmus
This illustrates the basic release/acquire message-passing idiom.
mp_relaxed.litmus
This uses relaxed accesses rather than release/acquire. This is insufficient to prevent a race on x.
mp_fences.litmus
This, like the previous test, uses relaxed accesses, but adds fences to restore the release/acquire synchronisation.
mp_relacq_rs.litmus
This extends the basic release/acquire message-passing idiom with an extra (relaxed) write to y that appears in the release sequence.
mp_sc.litmus
This strengthens the basic release/acquire message-passing idiom to use SC accesses, thus showing how these accesses can be used to perform release/acquire synchronisation.
cppmem_iriw_relacq.litmus
This shows that the two 'reading' threads can observe the writes to x and y in different orders.
iriw_sc.litmus
This, in contrast to the previous test, uses the SC memory order for all accesses. This prevents the 'reading' threads observing the writes in different orders.
cppmem_partial_sb.litmus
This provides an example of the partiality of the sequenced-before relation.
cppmem_sc_atomics.litmus
This provides a simple example of conflicting memory accesses that do not race because they are atomic.
cppmem_unsequencedrace.litmus
This illustrates an unsequenced race: a data race within a single thread.

4. Two formalisations of the OpenCL memory model

The following .cat file describes the OpenCL 2.0 memory model:

opencl.cat

The following .cat file describes the OpenCL 2.0 memory model after our scoped SC axiom has been applied:

opencl_scopedsc.cat

Those definitions that are common to both files above are factored out into a separate .cat file that is then imported by both files:

opencl_base.cat

We provide several litmus tests for exploring the model:

example4.litmus
This test corresponds to Example 4 in the paper.
example5.litmus
This test corresponds to Example 5 in the paper.
example6.litmus
This test corresponds to Example 6 in the paper.
example7a.litmus, example7b.litmus
These tests corresponds to Example 7 in the paper.
example8.litmus
This test corresponds to Example 8 in the paper.
example9a.litmus, example9b.litmus
These tests corresponds to Example 9 in the paper.
example10.litmus
This test corresponds to Example 10 in the paper.
MP_ra_dev.litmus
Release/acquire synchronisation on the atomic flag y prevents reading a stale value of x. The use of device scope is sufficient here because the two threads are executing on the same device.
MP_ra_wg.litmus
This test is the same as the previous one, but uses work-group scope instead of device scope when accessing the atomic flag y. This is insufficient because the two threads are in different work-groups, and causes the program to be faulty.
MP_ra_dev_broken.litmus
This test is the same as the MP_ra_dev.litmus test above, but the two threads are in different devices. This means that the release/acquire synchronisation at device scope is no longer sufficient. Hence, the program is faulty.
MP_sc_dev.litmus
This test is the same as the MP_ra_dev.litmus test above, but it uses SC atomics to access the atomic flag y. This is sufficient to induce synchronisation because the SC memory order subsumes both the release and the acquire memory orders.
ISA2.litmus
This test demonstrates cumulativity in OpenCL. The objective is to transfer the message at non-atomic location x from thread P0 to thread P2. This is achieved by release/acquire synchronisation between P0 and thread P1 in the same work-group (on the atomic flag y, using work-group scope), followed by release/acquire synchronisation between P1 and P2 in a different work-group (on the atomic flag z, using device scope).
ISA2_broken.litmus
This test is the same as the previous one, but puts the atomic flag y in local rather than global memory. This inhibits the global happens-before edges that are necessary to separate the non-atomic accesses to x, thus causing a data race, which renders the whole program faulty.
IRIW_sc_wg.litmus
This is the same as the iriw_sc.litmus test given in C11 above, but the SC operations are annotated with work-group scope. In the original opencl.cat model, the SC ordering constraints do not apply, so the reading threads can see the writes to x and y in two different orders. In the revised opencl_scopedsc.cat model, the SC ordering constraints do apply, and the relaxed behaviour is forbidden.
IRIW_sc_dev.litmus
This test is the same as the IRIW_sc_wg.litmus test above, except that the SC operations are now upgraded to device scope. Both OpenCL models now disallow the relaxed behaviour.

5. Equivalence to Batty et al.

We provide a proof in HOL 4 that our c11_partialSC.cat formalisation of the C11 memory model is equivalent to Batty et al.'s original Lem formalisation. The proof is located in the ~/equivalence_proof directory on our virtual machine, and is also available to download directly:

HOL 4 proof script

To read the statement of the proof, refer to lemHerdEquivalent_paper in the file cpp11soundness_paper.lem.

Remark. We point out that this proof of equivalence relates the Lem formalisation not to c11_orig.cat, but to c11_partialSC.cat. The essential difference between those two .cat files, which enables the 'partial SC' innovation, is that the former uses axiom S4 [see Defn. 11 of the paper] where the latter uses S4a [see Sec 3.1 of the paper]. In a slight departure from the text of the standard, Batty et al.'s original Lem formalisation actually uses axiom S4a. This explains why the equivalence is with respect to c11_partialSC.cat.

6. Suggested textual changes

6a. Suggested changes to the C11 specification

Our paper presents a proposal for simplifying the sequential consistency axioms in the C11 model. We give here our suggestion for how the specification document can be rephrased to accommodate our proposal, while maintaining the prose style used throughout the rest of the document.

The following text is reproduced verbatim from the C11 standard (Section 7.17.3, paragraphs 6 and 9–11):

6. There shall be a single total order S on all memory_order_seq_cst operations, consistent with the "happens before" order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values:

the result of the last modification A of M that precedes B in S, if it exists, or

if A exists, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory_order_seq_cst and that does not happen before A, or

if A does not exist, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory_order_seq_cst.

[...]

9. For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.

10. For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

11. For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order

We propose that the paragraphs quoted above be removed and replaced with the following three:

1. A value computation A of an object M "reads before" a side effect B on M if B follows, in the modification order of M, the side effect that A observes.

If X reads before Y, or happens before Y, or precedes Y in modification order, then X (and any fences sequenced before X) is "SC-before" Y (and any fences sequenced after Y).

3. There shall be a single total order S on all memory_order_seq_cst operations, consistent with the SC-before order.

6b. Suggested changes to the OpenCL specification

Our paper presents a proposal for simplifying the sequential consistency axioms in the OpenCL model. We give here our suggestion for how the specification document can be rephrased to accommodate our proposal, while maintaining the prose style used throughout the rest of the document.

The following text is reproduced verbatim from the OpenCL 2.1 standard (51/14–52/13):

If one of the following two conditions holds:

All memory_order_seq_cst operations have the scope memory_scope_all_svm_devices and all affected memory locations are contained in system allocations or fine grain SVM buffers with atomics support

All memory_order_seq_cst operations have the scope memory_scope_device and all affected memory locations are not located in system allocated regions or fine-grain SVM buffers with atomics support

then there shall exist a single total order S for all memory_order_seq_cst operations that is consistent with the modification orders for all affected locations, as well as the appropriate global-happens-before and local-happens-before orders for those locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M in global or local memory observes one of the following values:

the result of the last modification A of M that precedes B in S, if it exists, or

if A exists, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory_order_seq_cst and that does not happen before A, or

if A does not exist, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory_order_seq_cst.

[...]

If the total order S exists, the following rules hold:

For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_cst fence X sequenced-before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.

For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced-before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced-before X, Y is sequenced-before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.

For atomic operations A and B on an atomic object M, if there are memory_order_seq_cst fences X and Y such that A is sequenced-before X, Y is sequenced-before B, and X precedes Y in S, then B occurs later than A in the modification order of M.