[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140218164723.GM4250@linux.vnet.ibm.com>
Date: Tue, 18 Feb 2014 08:47:23 -0800
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Peter Sewell <Peter.Sewell@...cam.ac.uk>
Cc: "mark.batty@...cam.ac.uk" <Mark.Batty@...cam.ac.uk>,
peterz@...radead.org, Torvald Riegel <triegel@...hat.com>,
torvalds@...ux-foundation.org, Will Deacon <will.deacon@....com>,
Ramana.Radhakrishnan@....com, dhowells@...hat.com,
linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
akpm@...ux-foundation.org, mingo@...nel.org, gcc@....gnu.org
Subject: Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 03:33:35PM +0000, Peter Sewell wrote:
> Hi Paul,
>
> On 18 February 2014 14:56, Paul E. McKenney <paulmck@...ux.vnet.ibm.com> wrote:
> > On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote:
> >> Several of you have said that the standard and compiler should not
> >> permit speculative writes of atomics, or (effectively) that the
> >> compiler should preserve dependencies. In simple examples it's easy
> >> to see what that means, but in general it's not so clear what the
> >> language should guarantee, because dependencies may go via non-atomic
> >> code in other compilation units, and we have to consider the extent to
> >> which it's desirable to limit optimisation there.
> >>
> >> For example, suppose we have, in one compilation unit:
> >>
> >> void f(int ra, int*rb) {
> >> if (ra==42)
> >> *rb=42;
> >> else
> >> *rb=42;
> >> }
> >
> > Hello, Peter!
> >
> > Nice example!
> >
> > The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
> > says the following about the control dependency in the above construct:
> >
> > ------------------------------------------------------------------------
> >
> > q = ACCESS_ONCE(a);
> > if (q) {
> > barrier();
> > ACCESS_ONCE(b) = p;
> > do_something();
> > } else {
> > barrier();
> > ACCESS_ONCE(b) = p;
> > do_something_else();
> > }
> >
> > The initial ACCESS_ONCE() is required to prevent the compiler from
> > proving the value of 'a', and the pair of barrier() invocations are
> > required to prevent the compiler from pulling the two identical stores
> > to 'b' out from the legs of the "if" statement.
>
> thanks
>
> > ------------------------------------------------------------------------
> >
> > So yes, current compilers need significant help if it is necessary to
> > maintain dependencies in that sort of code.
> >
> > Similar examples came up in the data-dependency discussions in the
> > standards committee, which led to the [[carries_dependency]] attribute for
> > C11 and C++11. Of course, current compilers don't have this attribute,
> > and the current Linux kernel code doesn't have any other marking for
> > data dependencies passing across function boundaries. (Maybe some time
> > as an assist for detecting pointer leaks out of RCU read-side critical
> > sections, but efforts along those lines are a bit stalled at the moment.)
> >
> > More on data dependencies below...
> >
> >> and in another compilation unit the bodies of two threads:
> >>
> >> // Thread 0
> >> r1 = x;
> >> f(r1,&r2);
> >> y = r2;
> >>
> >> // Thread 1
> >> r3 = y;
> >> f(r3,&r4);
> >> x = r4;
> >>
> >> where accesses to x and y are annotated C11 atomic
> >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
> >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
> >>
> >> (Of course, this is an artificial example, to make the point below as
> >> simply as possible - in real code the branches of the conditional
> >> might not be syntactically identical, just equivalent after macro
> >> expansion and other optimisation.)
> >>
> >> In the source program there's a dependency from the read of x to the
> >> write of y in Thread 0, and from the read of y to the write of x on
> >> Thread 1. Dependency-respecting compilation would preserve those and
> >> the ARM and POWER architectures both respect them, so the reads of x
> >> and y could not give 42.
> >>
> >> But a compiler might well optimise the (non-atomic) body of f() to
> >> just *rb=42, making the threads effectively
> >>
> >> // Thread 0
> >> r1 = x;
> >> y = 42;
> >>
> >> // Thread 1
> >> r3 = y;
> >> x = 42;
> >>
> >> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
> >> permit those two reads to see 42. That is moreover actually observable
> >> on current ARM hardware.
> >
> > I do agree that this could happen on current compilers and hardware.
> >
> > Agreed, but as Peter Zijlstra noted in this thread, this optimization
> > is to a control dependency, not a data dependency.
>
> Indeed. In principle (again as Hans has observed) a compiler might
> well convert between the two, e.g. if operating on single-bit values,
> or where value-range analysis has shown that a variable can only
> contain one of a small set of values. I don't know whether that
> happens in practice? Then there are also cases where a compiler is
> very likely to remove data/address dependencies, eg if some constant C
> is #define'd to be 0 then an array access indexed by x * C will have
> the dependency on x removed. The runtime and compiler development
> costs of preventing that are also unclear to me.
>
> Given that, whether it's reasonable to treat control and data
> dependencies differently seems to be an open question.
Here is another (admittedly fanciful and probably buggy) implementation
of f() that relies on data dependencies (according to C11 and C++11),
but which could not be relied on to preserve thosse data dependencies
given current pre-C11 compilers:
int arr[2] = { 42, 43 };
int *bigarr;
int f(int ra)
{
return arr[ra != 42];
}
// Thread 0
r1 = atomic_load_explicit(&gidx, memory_order_consume);
r2 = bigarr[f(r1)];
// Thread 1
r3 = random() % BIGARR_SIZE;
bigarr[r3] = some_integer();
atomic_store_explicit(&gidx, r3, memory_order_release);
// Mainprogram
bigarr = kmalloc(BIGARR_SIZE * sizeof(*bigarr), ...);
// Note: bigarr currently contains pre-initialization garbage
// Spawn threads 1 and 2
Many compilers would be happy to convert f() into something like the
following:
int f(int ra)
{
if (ra == 42)
return arr[0];
else
return arr[1];
}
And many would argue that this is a perfectly reasonable conversion.
However, this breaks the data dependency, and allows Thread 0's load
from bigarr[] to be speculated, so that r2 might end up containing
pre-initialization garbage. This is why the consume.2014.02.16c.pdf
document advises against attempting to carry dependencies through
relational operators and booleans (&& and ||) when using current compilers
(hmmm... I need to make that advice more strongly stated). And again,
this is one of the reasons for the [[carries_dependency]] attribute in
C11 -- to signal the compiler to be careful in a given function.
Again, this example is fanciful. It is intended to illustrate a data
dependency that could be broken given current compilers and hardware.
It is -not- intended as an example of good code for the Linux kernel,
much the opposite, in fact.
That said, I would very much welcome a more realistic example.
> >> So as far as we can see, either:
> >>
> >> 1) if you can accept the latter behaviour (if the Linux codebase does
> >> not rely on its absence), the language definition should permit it,
> >> and current compiler optimisations can be used,
> >>
> >> or
> >>
> >> 2) otherwise, the language definition should prohibit it but the
> >> compiler would have to preserve dependencies even in compilation
> >> units that have no mention of atomics. It's unclear what the
> >> (runtime and compiler development) cost of that would be in
> >> practice - perhaps Torvald could comment?
> >
> > For current compilers, we have to rely on coding conventions within
> > the Linux kernel in combination with non-standard extentions to gcc
> > and specified compiler flags to disable undesirable behavior. I have a
> > start on specifying this in a document I am preparing for the standards
> > committee, a very early draft of which may be found here:
> >
> > http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf
> >
> > Section 3 shows the results of a manual scan through the Linux kernel's
> > dependency chains, and Section 4.1 lists a probably incomplete (and no
> > doubt erroneous) list of coding standards required to make dependency
> > chains work on current compilers. Any comments and suggestions are more
> > than welcome!
>
> Thanks, that's very interesting (especially the non-local dependency chains).
>
> At a first glance, the "4.1 Rules for C-Language RCU Users" seem
> pretty fragile - they're basically trying to guess the limits of
> compiler optimisation smartness.
Agreed, but that is the world we currently must live in, given pre-C11
compilers and the tepid implementations of memory_order_consume in
the current C11 implementations that I am aware of. As long as the
Linux kernel must live in this world for some time to come, I might as
well document the limitations, fragile though they might be.
> >> For more context, this example is taken from a summary of the thin-air
> >> problem by Mark Batty and myself,
> >> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with
> >> dependencies via other compilation units was AFAIK first pointed out
> >> by Hans Boehm.
> >
> > Nice document!
> >
> > One point of confusion for me... Example 4 says "language must allow".
> > Shouldn't that be "language is permitted to allow"? Seems like an
> > implementation is always within its rights to avoid an optimization if
> > its implementation prevents it from safely detecting the oppportunity
> > for that optimization. Or am I missing something here?
>
> We're saying that the language definition must allow it, not that any
> particular implementation must be able to exhibit it.
Ah, got it. You had me worried there for a bit! ;-)
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists