This webpage contains additional material for the paper Overhauling SC atomics in C11 and OpenCL by Mark Batty, Alastair F. Donaldson, and John Wickerson.
.cat
format: the original model, one with a partial
order over SC operations, and one that is simpler and a little
stronger (see Sec 3). .cat
format: the original model, and one that uses a
'scoped SC' axiom (see Sec 4).c11_partialSC.cat
formalisation of the C11
memory model is equivalent to Batty et al.'s original Lem
formalisation. We justify this claim by providing a proof mechanised
in the HOL theorem prover (see Sec 5).
We provide a VirtualBox virtual machine pre-installed with Herd (for
simulating our various formalisations of the C11 and OpenCL memory
models) and the HOL theorem prover (for reproducing our proof that
our Herd formalisation of the C11 memory model is equivalent to
Batty et al.'s earlier formalisation in Lem). The username is
herd
and the password is letmein
.
Alternative: building Herd from source. Herd is provided on the virtual machine, but can alternatively be built from source. The source code is available to download from a Github repository.
To build the Herd executable, simply runmake
in the
herd
directory.
foo.litmus
)
according to a given model definition (say, bar.cat
),
issue the following command (while in the same directory as bar.cat
):
~/herdtools/herd -initwrites true -model bar.cat path/to/foo.litmusTo display the executions in graphical form, issue the following command, which launches Ghostview in an X11 session and sets some reasonable configuration options:
~/herdtools/herd -initwrites true -show all -gv -graph columns ↵ -showinitwrites false -squished true -scale 2 -symetric dr ↵ -model bar.cat path/to/foo.litmusThe various
.cat
files are located in the home
directory on our virtual machine; they are also available to
download via the links in the following two sections. The various
.litmus
files are located in the
~/litmustests
directory
on our virtual machine; they are also available to download via the
links in the following two sections.
The following .cat
file describes the original C11 memory model:
The following .cat
file describes the C11 memory
model and uses a partial order over SC operations:
The following .cat
file describes a simpler and
slightly stronger version of the C11 memory model:
Those definitions that are common to the three files above are
factored out into a separate .cat
file that is then
imported by those files:
cppmem
' are ported directly from examples
distributed with the CppMem
tool.
example1.litmus
mp_relacq.litmus
mp_relaxed.litmus
x
.mp_fences.litmus
mp_relacq_rs.litmus
y
that appears in the
release sequence.mp_sc.litmus
cppmem_iriw_relacq.litmus
x
and
y
in different orders.iriw_sc.litmus
cppmem_partial_sb.litmus
cppmem_sc_atomics.litmus
cppmem_unsequencedrace.litmus
.cat
file describes the OpenCL 2.0 memory
model:
The following .cat
file describes the OpenCL 2.0 memory
model after our scoped SC axiom has been applied:
Those definitions that are common to both files above are
factored out into a separate .cat
file that is then
imported by both files:
example4.litmus
example5.litmus
example6.litmus
example7a.litmus
, example7b.litmus
example8.litmus
example9a.litmus
, example9b.litmus
example10.litmus
MP_ra_dev.litmus
y
prevents reading a
stale value of x
. The use of device scope is sufficient
here because the two threads are executing on the same device.MP_ra_wg.litmus
y
. This is insufficient because the two threads are in
different work-groups, and causes the program to be faulty.MP_ra_dev_broken.litmus
MP_ra_dev.litmus
test above, but the two threads are in different devices. This means
that the release/acquire
synchronisation at device scope is no longer sufficient. Hence, the
program is faulty.MP_sc_dev.litmus
MP_ra_dev.litmus
test above,
but it uses SC atomics to access the atomic flag
y
. This is sufficient to induce synchronisation because
the SC memory order subsumes both the release and the acquire memory
orders.ISA2.litmus
x
from
thread P0
to thread P2
. This is achieved
by release/acquire synchronisation between P0
and
thread P1
in the same work-group (on the atomic flag
y
, using work-group scope), followed by release/acquire
synchronisation between P1
and P2
in a
different work-group (on the atomic flag z
, using device
scope). ISA2_broken.litmus
y
in local rather than global memory. This inhibits the
global happens-before edges that are necessary to separate the
non-atomic accesses to x
, thus causing a data race,
which renders the whole program faulty.
IRIW_sc_wg.litmus
iriw_sc.litmus
test given in C11 above, but the SC
operations are annotated with work-group scope. In the original
opencl.cat
model, the SC ordering constraints do not
apply, so the reading threads can see the writes to x
and
y
in two different orders. In the revised
opencl_scopedsc.cat
model, the SC ordering constraints
do apply, and the relaxed behaviour is forbidden.IRIW_sc_dev.litmus
IRIW_sc_wg.litmus
test above, except
that the SC operations are now upgraded to device scope. Both OpenCL
models now disallow the relaxed behaviour.c11_partialSC.cat
formalisation of the C11 memory model is equivalent to Batty et al.'s original Lem
formalisation. The proof is located in the
~/equivalence_proof
directory on our virtual machine, and
is also available to download directly:
To read the statement of the proof, refer to
lemHerdEquivalent_paper
in the file cpp11soundness_paper.lem
.
Remark. We point out that this proof of equivalence relates the Lem
formalisation not to c11_orig.cat
, but to
c11_partialSC.cat
. The essential difference between those
two .cat
files, which enables the 'partial SC' innovation, is that the former uses axiom S4 [see
Defn. 11 of the paper] where the latter uses S4a [see Sec 3.1 of the
paper]. In a slight departure from the text of the standard, Batty et al.'s original Lem formalisation
actually uses axiom S4a. This explains why the equivalence is with
respect to c11_partialSC.cat
.
Our paper presents a proposal for simplifying the sequential consistency axioms in the C11 model. We give here our suggestion for how the specification document can be rephrased to accommodate our proposal, while maintaining the prose style used throughout the rest of the document.
The following text is reproduced verbatim from the C11 standard (Section 7.17.3, paragraphs 6 and 9–11):
We propose that the paragraphs quoted above be removed and replaced with the following three:6. There shall be a single total order S on all
memory_order_seq_cst
operations, consistent with the "happens before" order and modification orders for all affected locations, such that eachmemory_order_seq_cst
operation B that loads a value from an atomic object M observes one of the following values:
- the result of the last modification A of M that precedes B in S, if it exists, or
- if A exists, the result of some modification of M in the visible sequence of side effects with respect to B that is not
memory_order_seq_cst
and that does not happen before A, or- if A does not exist, the result of some modification of M in the visible sequence of side effects with respect to B that is not
memory_order_seq_cst
.[...]
9. For an atomic operation B that reads the value of an atomic object M, if there is a
memory_order_seq_cst
fence X sequenced before B, then B observes either the lastmemory_order_seq_cst
modification of M preceding X in the total order S or a later modification of M in its modification order.10. For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a
memory_order_seq_cst
fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.11. For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are
memory_order_seq_cst
fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order
1. A value computation A of an object M "reads before" a side effect B on M if B follows, in the modification order of M, the side effect that A observes.
If X reads before Y, or happens before Y, or precedes Y in modification order, then X (and any fences sequenced before X) is "SC-before" Y (and any fences sequenced after Y).
3. There shall be a single total order S on all
memory_order_seq_cst
operations, consistent with the SC-before order.
Our paper presents a proposal for simplifying the sequential consistency axioms in the OpenCL model. We give here our suggestion for how the specification document can be rephrased to accommodate our proposal, while maintaining the prose style used throughout the rest of the document.
The following text is reproduced verbatim from the OpenCL 2.1 standard (51/14–52/13):
If one of the following two conditions holds:
then there shall exist a single total order S for all
- All
memory_order_seq_cst
operations have the scopememory_scope_all_svm_devices
and all affected memory locations are contained in system allocations or fine grain SVM buffers with atomics support- All
memory_order_seq_cst
operations have the scopememory_scope_device
and all affected memory locations are not located in system allocated regions or fine-grain SVM buffers with atomics supportmemory_order_seq_cst
operations that is consistent with the modification orders for all affected locations, as well as the appropriate global-happens-before and local-happens-before orders for those locations, such that eachmemory_order_seq_cst
operation B that loads a value from an atomic object M in global or local memory observes one of the following values:
- the result of the last modification A of M that precedes B in S, if it exists, or
- if A exists, the result of some modification of M in the visible sequence of side effects with respect to B that is not
memory_order_seq_cst
and that does not happen before A, or- if A does not exist, the result of some modification of M in the visible sequence of side effects with respect to B that is not
memory_order_seq_cst
.
[...]
If the total order S exists, the following rules hold:
- For an atomic operation B that reads the value of an atomic object M, if there is a
memory_order_seq_cst
fence X sequenced-before B, then B observes either the lastmemory_order_seq_cst
modification of M preceding X in the total order S or a later modification of M in its modification order.- For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a
memory_order_seq_cst
fence X such that A is sequenced-before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.- For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are
memory_order_seq_cst
fences X and Y such that A is sequenced-before X, Y is sequenced-before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.- For atomic operations A and B on an atomic object M, if there are
memory_order_seq_cst
fences X and Y such that A is sequenced-before X, Y is sequenced-before B, and X precedes Y in S, then B occurs later than A in the modification order of M.
We propose that the paragraphs quoted above be removed and replaced with the following three:
1. A value computation A of an object M "reads before" a side effect B on M if B follows, in the modification order of M, the side effect that A observes.
If X reads before Y, or global or local happens before Y, or precedes Y in modification order, then X (and any fences sequenced before X) is "SC-before" Y (and any fences sequenced after Y).
3. There shall be a single total order S on all
memory_order_seq_cst
operations, consistent with the SC-before order restricted to operations with inclusive scopes.