[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20181216185130.GB4170@linux.ibm.com>
Date: Sun, 16 Dec 2018 10:51:30 -0800
From: "Paul E. McKenney" <paulmck@...ux.ibm.com>
To: Alan Stern <stern@...land.harvard.edu>
Cc: David Goldblatt <davidtgoldblatt@...il.com>,
mathieu.desnoyers@...icios.com,
Florian Weimer <fweimer@...hat.com>, triegel@...hat.com,
libc-alpha@...rceware.org, andrea.parri@...rulasolutions.com,
will.deacon@....com, peterz@...radead.org, boqun.feng@...il.com,
npiggin@...il.com, dhowells@...hat.com, j.alglave@....ac.uk,
luc.maranget@...ia.fr, akiyks@...il.com, dlustig@...dia.com,
linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] Linux: Implement membarrier function
On Fri, Dec 14, 2018 at 04:39:34PM -0500, Alan Stern wrote:
> On Fri, 14 Dec 2018, Paul E. McKenney wrote:
>
> > I would say that sys_membarrier() has zero-sized read-side critical
> > sections, either comprising a single instruction (as is the case for
> > synchronize_sched(), actually), preempt-disable regions of code
> > (which are irrelevant to userspace execution), or the spaces between
> > consecutive pairs of instructions (as is the case for the newer
> > IPI-based implementation).
> >
> > The model picks the single-instruction option, and I haven't yet found
> > a problem with this -- which is no surprise given that, as you say,
> > an actual implementation makes this same choice.
>
> I believe that for RCU tests the LKMM gives the same results for
> length-zero critical sections interspersed between all the instructions
> and length-one critical sections surrounding all instructions (except
> synchronize_rcu). But the proof is tricky and I haven't checked it
> carefully.
That assertion is completely consistent with my implementation experience,
give or take the usual caveats about idle and offline execution.
> > > > The other thing that took some time to get used to is the possibility
> > > > of long delays during sys_membarrier() execution, allowing significant
> > > > execution and reordering between different CPUs' IPIs. This was key
> > > > to my understanding of the six-process example, and probably needs to
> > > > be clearly called out, including in an example or two.
> > >
> > > In all the examples I'm aware of, no more than one of the IPIs
> > > generated by each sys_membarrier call really matters. (Of course,
> > > there's no way to know in advance which one it will be, so you have to
> > > send an IPI to every CPU.) The execution delays and reordering
> > > between different CPUs' IPIs don't appear to be significant.
> >
> > Well, there are litmus tests that are allowed in which the allowed
> > execution is more easily explained in terms of delays between different
> > CPUs' IPIs, so it seems worth keeping track of.
> >
> > There might be a litmus test that can tell the difference between
> > simultaneous and non-simultaneous IPIs, but I cannot immediately think of
> > one that matters. Might be a failure of imagination on my part, though.
>
> P0 P1 P2
> Wc=1 [mb01] Rb=1
> memb Wa=1 Rc=0
> Ra=0 Wb=1 [mb02]
>
> The IPIs have to appear in the positions shown, which means they cannot
> be simultaneous. The test is allowed because P2's reads can be
> reordered.
OK, so "simultaneous" IPIs could be emulated in a real implementation by
having sys_membarrier() send each IPI (but not wait for a response), then
execute a full memory barrier and set a shared variable. Each IPI handler
would spin waiting for the shared variable to be set, then execute a full
memory barrier and atomically increment yet another shared variable and
return from interrupt. When that other shared variable's value reached
the number of IPIs sent, the sys_membarrier() would execute its final
(already existing) full memory barrier and return. Horribly expensive
and definitely not recommended, but eminently doable.
The difference between current sys_membarrier() and the "simultaneous"
variant described above is similar to the difference between
non-multicopy-atomic and multicopy-atomic memory ordering. So, after
thinking it through, my guess is that pretty much any litmus test that
can discern between multicopy-atomic and non-multicopy-atomic should
be transformable into something that can distinguish between the current
and the "simultaneous" sys_membarrier() implementation.
Seem reasonable?
Or alternatively, may I please apply your Signed-off-by to your earlier
sys_membarrier() patch so that I can queue it? I will probably also
change smp_memb() to membarrier() or some such. Again, within the
Linux kernel, membarrier() can be emulated with smp_call_function()
invoking a handler that does smp_mb().
Thanx, Paul
Powered by blists - more mailing lists