lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1394039718.28840.15041.camel@triegel.csb>
Date:	Wed, 05 Mar 2014 18:15:18 +0100
From:	Torvald Riegel <triegel@...hat.com>
To:	Peter.Sewell@...cam.ac.uk
Cc:	Paul McKenney <paulmck@...ux.vnet.ibm.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Will Deacon <will.deacon@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ramana Radhakrishnan <Ramana.Radhakrishnan@....com>,
	David Howells <dhowells@...hat.com>,
	"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"mingo@...nel.org" <mingo@...nel.org>,
	"gcc@....gnu.org" <gcc@....gnu.org>
Subject: Re: [RFC][PATCH 0/5] arch: atomic rework

On Tue, 2014-03-04 at 22:11 +0000, Peter Sewell wrote:
> On 3 March 2014 20:44, Torvald Riegel <triegel@...hat.com> wrote:
> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote:
> >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@...ux.vnet.ibm.com> wrote:
> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
> >> >> Hi Paul,
> >> >>
> >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@...ux.vnet.ibm.com> wrote:
> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
> >> >> >> > <paulmck@...ux.vnet.ibm.com> wrote:
> >> >> >> > >
> >> >> >> > > 3.      The comparison was against another RCU-protected pointer,
> >> >> >> > >         where that other pointer was properly fetched using one
> >> >> >> > >         of the RCU primitives.  Here it doesn't matter which pointer
> >> >> >> > >         you use.  At least as long as the rcu_assign_pointer() for
> >> >> >> > >         that other pointer happened after the last update to the
> >> >> >> > >         pointed-to structure.
> >> >> >> > >
> >> >> >> > > I am a bit nervous about #3.  Any thoughts on it?
> >> >> >> >
> >> >> >> > I think that it might be worth pointing out as an example, and saying
> >> >> >> > that code like
> >> >> >> >
> >> >> >> >    p = atomic_read(consume);
> >> >> >> >    X;
> >> >> >> >    q = atomic_read(consume);
> >> >> >> >    Y;
> >> >> >> >    if (p == q)
> >> >> >> >         data = p->val;
> >> >> >> >
> >> >> >> > then the access of "p->val" is constrained to be data-dependent on
> >> >> >> > *either* p or q, but you can't really tell which, since the compiler
> >> >> >> > can decide that the values are interchangeable.
> >> >> >> >
> >> >> >> > I cannot for the life of me come up with a situation where this would
> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a
> >> >> >> > stronger ordering than anything the consume through "p" would
> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the
> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the
> >> >> >> > ordering to the access through "p" is through p or q is kind of
> >> >> >> > irrelevant. No?
> >> >> >>
> >> >> >> I can make a contrived litmus test for it, but you are right, the only
> >> >> >> time you can see it happen is when X has no barriers, in which case
> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can
> >> >> >> reorder the loads into p and q, and the read from p->val can, as you say,
> >> >> >> come from either pointer.
> >> >> >>
> >> >> >> For whatever it is worth, hear is the litmus test:
> >> >> >>
> >> >> >> T1:   p = kmalloc(...);
> >> >> >>       if (p == NULL)
> >> >> >>               deal_with_it();
> >> >> >>       p->a = 42;  /* Each field in its own cache line. */
> >> >> >>       p->b = 43;
> >> >> >>       p->c = 44;
> >> >> >>       atomic_store_explicit(&gp1, p, memory_order_release);
> >> >> >>       p->b = 143;
> >> >> >>       p->c = 144;
> >> >> >>       atomic_store_explicit(&gp2, p, memory_order_release);
> >> >> >>
> >> >> >> T2:   p = atomic_load_explicit(&gp2, memory_order_consume);
> >> >> >>       r1 = p->b;  /* Guaranteed to get 143. */
> >> >> >>       q = atomic_load_explicit(&gp1, memory_order_consume);
> >> >> >>       if (p == q) {
> >> >> >>               /* The compiler decides that q->c is same as p->c. */
> >> >> >>               r2 = p->c; /* Could get 44 on weakly order system. */
> >> >> >>       }
> >> >> >>
> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what
> >> >> >> you get.
> >> >> >>
> >> >> >> And publishing a structure via one RCU-protected pointer, updating it,
> >> >> >> then publishing it via another pointer seems to me to be asking for
> >> >> >> trouble anyway.  If you really want to do something like that and still
> >> >> >> see consistency across all the fields in the structure, please put a lock
> >> >> >> in the structure and use it to guard updates and accesses to those fields.
> >> >> >
> >> >> > And here is a patch documenting the restrictions for the current Linux
> >> >> > kernel.  The rules change a bit due to rcu_dereference() acting a bit
> >> >> > differently than atomic_load_explicit(&p, memory_order_consume).
> >> >> >
> >> >> > Thoughts?
> >> >>
> >> >> That might serve as informal documentation for linux kernel
> >> >> programmers about the bounds on the optimisations that you expect
> >> >> compilers to do for common-case RCU code - and I guess that's what you
> >> >> intend it to be for.   But I don't see how one can make it precise
> >> >> enough to serve as a language definition, so that compiler people
> >> >> could confidently say "yes, we respect that", which I guess is what
> >> >> you really need.  As a useful criterion, we should aim for something
> >> >> precise enough that in a verified-compiler context you can
> >> >> mathematically prove that the compiler will satisfy it  (even though
> >> >> that won't happen anytime soon for GCC), and that analysis tool
> >> >> authors can actually know what they're working with.   All this stuff
> >> >> about "you should avoid cancellation", and "avoid masking with just a
> >> >> small number of bits" is just too vague.
> >> >
> >> > Understood, and yes, this is intended to document current compiler
> >> > behavior for the Linux kernel community.  It would not make sense to show
> >> > it to the C11 or C++11 communities, except perhaps as an informational
> >> > piece on current practice.
> >> >
> >> >> The basic problem is that the compiler may be doing sophisticated
> >> >> reasoning with a bunch of non-local knowledge that it's deduced from
> >> >> the code, neither of which are well-understood, and here we have to
> >> >> identify some envelope, expressive enough for RCU idioms, in which
> >> >> that reasoning doesn't allow data/address dependencies to be removed
> >> >> (and hence the hardware guarantee about them will be maintained at the
> >> >> source level).
> >> >>
> >> >> The C11 syntactic notion of dependency, whatever its faults, was at
> >> >> least precise, could be reasoned about locally (just looking at the
> >> >> syntactic code in question), and did do that.  The fact that current
> >> >> compilers do optimisations that remove dependencies and will likely
> >> >> have many bugs at present is besides the point - this was surely
> >> >> intended as a *new* constraint on what they are allowed to do.  The
> >> >> interesting question is really whether the compiler writers think that
> >> >> they *could* implement it in a reasonable way - I'd like to hear
> >> >> Torvald and his colleagues' opinion on that.
> >> >>
> >> >> What you're doing above seems to be basically a very cut-down version
> >> >> of that, but with a fuzzy boundary.   If you want it to be precise,
> >> >> maybe it needs to be much simpler (which might force you into ruling
> >> >> out some current code idioms).
> >> >
> >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806)
> >> > can be developed to serve this purpose.
> >>
> >> (I missed that mail when it first came past, sorry)
> >>
> >> That's also going to be tricky, I'm afraid.  The key condition there is:
> >>
> >> "* at the time of execution of E, L       [PS: I assume that L is a
> >> typo and should be E]
> >
> > No, L was intended. (But see below.)
> 
> then I misunderstood...
> 
> >>         can possibly have returned at
> >>         least two different values under the assumption that L itself
> >>         could have returned any value allowed by L's type."
> >>
> >> First, the evaluation of E might be nondeterministic  - e.g., for an
> >> artificial example, if it's just a nondeterministic value obtained
> >> from the result of a race on SC atomics.  The above doesn't
> >> distinguish between that (which doesn't have a real dependency on L)
> >> and that XOR'd with L (which does).  And it does so in the wrong
> >> direction: it'll say there the former has a dependency on L.
> >
> > I'm not quite sure I understand the examples you want to point out
> > (could you add brief code snippets, perhaps?) -- but the informal
> > definition I proposed also says that E must have used L in some way to
> > make it's computation.  That's pretty vague, and the way I phrased the
> > requirement is probably not optimal.
> 
> ...so this example isn't so relevant, but it's maybe interesting
> anyway. In free-wheeling psuedocode:
> 
> x = read_consume(L)     // a memory_order_consume read of a 1-bit value from L
> y = nondet()                     // some code that
> nondeterministically produces another 1-bit value, eg by spawning two
> threads that each do an SC-atomic write to some other location,
> returning 0 or 1 depending on which wins
> 
> then evaluate either just
> 
> z=y
> 
> or
> 
> z = x XOR y
> 
> In my misinterpretation of what you wrote, your definition would say
> there's a dependency from the load of L to the evalution of y, even
> though there isn't.

The evaluations "z==y" or "z=y" wouldn't depend on L, assuming nondet()
doesn't depend on L (e.g., by using x).  They don't use x's value.  (And
I don't have a precise definition for this, unfortunately, but I hope
the intent is clear.)

"x XOR y" would depend on L, because this does take x into account, and
x's value remains relevant irrespective of which value y has.  y would
be non-vdp, and the compiler could have specialized the code into two
branches for both 1 and 0 values of y; but it would still need x to
compute z in the last evaluation.

> 
> > So let me expand a bit on the background first.  What I tried to capture
> > with the rules is that an evaluation (ie, E), really uses the value
> > returned by L, and not just a, for example, constant value that can be
> > inferred from whatever was executed between L and E.  This (is intended
> > to) prevent value prediction and such by programmers.  The compiler in
> > turn knows what a real program is allowed to do that still wants to rely
> > on the ordering by the consume.  Basing all of this on the values seemed
> > to be helpful because basing it on *purely* syntax didn't seem
> > implementable; for the latter, we'd need to disallow cases such as x-x
> > or require compilers to include artificial dependencies if a program
> > indeed does value speculation.  IOW, it didn't seem possible to draw a
> > clear distinction between a data dependency defined purely based on
> > syntax and control dependencies.
> >
> > Another way to perhaps phrase the requirement might be to construct it
> > inductively, so that "E uses the value returned by L" becomes clearer.
> > However, I currently don't really know how to do that without any holes.
> >
> > For example, let's say that an atomic mo_consume load has a value
> > dependency to itself; the dependency has an associated set of values,
> > which is equal to all values allowed by the type (but see also below).
> > Then, an evaluation A that uses B as operand has a value dependency on B
> > if the associated set of values can still have more than one element.
> > Whether that's the case depends on the operation, and the other
> > operands, including their values at the time of execution.
> > However, it's not just results of evaluations that effectively constrain
> > the set of values.  Conditional execution does as well, because the
> > condition being true might establish a constraint on L, which might
> > remove any dependency when L (or a result of a computation involving L)
> > is used.  So I'm not sure the inductive approach would work.
> >
> > You said you thought I might have wanted to say that:
> > "E  really-depends on L if there exist two different values that (just
> > according to typing) might be read for L that give rise to two
> > different values for E".
> >
> > Which should follow from the above, or at least should not conflict with
> > it.  My "definition" tried to capture that the program must not
> > establish so many constraints between E and L that the value of L is
> > clear in the sense of having exactly one value.  If we dereference such
> > an E whose execution in fact constrains L to one value, there is no real
> > dependency anymore.
> > However, by itself, this doesn't cover that E must use L in it's
> > computation -- which your formulation does, I think.
> 
> unfortunately not - that's what the example above shows.

Now I think I understand.

> > Another way might be to try to define which constraints established by
> > an execution (based on the executed code's semantics) actually remove
> > value dependencies on L.
> >
> > Do you have any suggestions for how to define this (ie, true value
> > dependencies) in a better way?
> 
> not at the moment.  I need to think some more about your vdps, though.
>  As I understand it, you're agreeing with the general intent of the
> original C11 design that some new mechanism to tell the compiler not
> to optimise in certain places in required, but differing in the way
> you identify those?

Yes, we still need such a mechanism because we need to prevent the
compiler from doing value prediction or something similar that uses
control dependencies to avoid data dependencies.  For example, it could
use a large switch statement to add code specialized for each possible
value returned from an mo_consume load.  Thus, at least for the
standard, we need some mechanism that prevents certain compiler
optimizations.

Where it differs to C11 is that we're not trying to use just such a
mechanism built on syntax to try to define when we actually have a real
value dependency.

> 
> >> Second, it involves reasoning about counterfactual executions.  That
> >> doesn't necessarily make it wrong, per se, but probably makes it hard
> >> to work with.  For example, suppose that in all the actual
> >> whole-program executions, a runtime occurrence of L only ever returns
> >> one particular value (perhaps because of some simple #define'd
> >> configuration)
> >
> > Right, or just because it's a deterministic program.
> >
> > This is a problem for the definition of value dependencies, AFAICT,
> > because where do you put the line for what the compiler is allowed to
> > speculate about?  When does the program indeed "reveal" the value of a
> > load, and when does it not?  Is a compiler to do out-of-band tracking
> > for certain memory locations, and thus predict the value?
> >
> > Adding the assumption that L, at least conceptually, might return any
> > value seemed to be a simple way to remove that uncertainty regarding
> > what the compiler might know about.
> 
> ...but it brings in a lot of baggage.

Which baggage do you mean?  In terms of verification, or definitions in
the standard, or effects on compilers?

> >> , and that the code used in the evaluation of E depends
> >> on some invariant which is related to that configuration.
> >
> > A related issue that I've been thinking about is whether things like
> > divide-by-zero (ie, operations that give rise to undefined behavior)
> > should be considered as constraints on the values or not.  I guess they
> > should, but then this seems closer what you mention, invariants
> > established by the whole program.
> >
> >> The
> >> hypothetical execution used above in which a different value is used
> >> is one in the code is being run in a situation with broken invariants.
> >>    Then there will be technical difficulties in using the definition:
> >> I don't see how one would persuade oneself that a compiler always
> >> satisfies it, because these hypothetical executions are far removed
> >> from what it's actually working on.
> >
> > I think the above is easy to implement for a compiler.  At the
> > mo_consume load, you simply forget any information you have about this
> > value / memory location; for operations on value_dep_preserving types,
> 
> are all the subexpressions likewise forced to be of vdp types?

I'd like to avoid that if possible.  Things like "vdp-pointer + int"
should Just Work in the sense of being still vdp-typed without requiring
an explicit cast for the int operand.  Operators ||, &&, ?: might
justify more fine-grained rules, but I guess it's better to just make
more stuff vdp if that's easier.  vdp is not a sufficient condition for
there being a value-dependency anyway, and the few optimizations it
prevents on expressions containing some vdp-typed operands probably
don't matter much.  Also, as I mentioned in the reply to Paul, I believe
that if the compiler can prove that there's no value-dependency, it can
ignore the vdp-annotation (IOW, as-if is still possible).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ