linux-kernel - Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state machine

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130726225212.GZ3889@linux.vnet.ibm.com>
Date:	Fri, 26 Jul 2013 15:52:12 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Frederic Weisbecker <fweisbec@...il.com>
Cc:	linux-kernel@...r.kernel.org, mingo@...e.hu, laijs@...fujitsu.com,
	dipankar@...ibm.com, akpm@...ux-foundation.org,
	mathieu.desnoyers@...ymtl.ca, josh@...htriplett.org,
	niv@...ibm.com, tglx@...utronix.de, peterz@...radead.org,
	rostedt@...dmis.org, dhowells@...hat.com, edumazet@...gle.com,
	darren@...art.com, sbw@....edu
Subject: Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle state
 machine

On Thu, Jul 25, 2013 at 01:26:44AM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 24, 2013 at 03:09:02PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 24, 2013 at 08:09:04PM +0200, Frederic Weisbecker wrote:
> > > On Thu, Jul 18, 2013 at 10:06:25PM -0700, Paul E. McKenney wrote:
> > > > > Lets summarize the last sequence, the following happens ordered by time:
> > > > > 
> > > > >         CPU 0                          CPU 1
> > > > > 
> > > > >      cmpxchg(&full_sysidle_state,
> > > > >              RCU_SYSIDLE_SHORT,
> > > > >              RCU_SYSIDLE_LONG);
> > > > > 
> > > > >      smp_mb() //cmpxchg
> > > > > 
> > > > >      atomic_read(rdtp(1)->dynticks_idle)
> > > > > 
> > > > >      //CPU 0 goes to sleep
> > > > >                                        //CPU 1 wakes up
> > > > >                                        atomic_inc(rdtp(1)->dynticks_idle)
> > > > > 
> > > > >                                        smp_mb()
> > > > > 
> > > > >                                        ACCESS_ONCE(full_sysidle_state)
> > > > > 
> > > > > 
> > > > > Are you suggesting that because the CPU 1 executes its atomic_inc() _after_ (in terms
> > > > > of absolute time) the atomic_read of CPU 0, the ordering settled in both sides guarantees
> > > > > that the value read from CPU 1 is the one from the cmpxchg that precedes the atomic_read,
> > > > > or FULL or FULL_NOTED that happen later.
> > > > > 
> > > > > If so that's a big lesson for me.                                     
> > > > 
> > > > It is not absolute time that matters.  Instead, it is the fact that
> > > > CPU 0, when reading from ->dynticks_idle, read the old value before the
> > > > atomic_inc().  Therefore, anything CPU 0 did before that memory barrier
> > > > preceding CPU 0's read must come before anything CPU 1 did after that
> > > > memory barrier following the atomic_inc().  For this to work, there
> > > > must be some access to the same variable on each CPU.
> > > 
> > > Aren't we in the following situation?
> > > 
> > >     CPU 0                          CPU 1
> > > 
> > >     STORE A                        STORE B
> > >     LOAD B                         LOAD A
> > > 
> > > 
> > > If so and referring to your perfbook, this is an "ears to mouth" situation.
> > > And it seems to describe there is no strong guarantee in that situation.
> > 
> > "Yes" to the first, but on modern hardware, "no" to the second.  The key
> > paragraph is Section 12.2.4.5:
> > 
> > 	The following pairings from Table 12.1 can be used on modern
> > 	hardware, but might fail on some systems that were produced in
> > 	the 1990s. However, these can safely be used on all mainstream
> > 	hardware introduced since the year 2000.
> 
> Right I missed that!

Nor are you alone.  ;-)

> > That said, you are not the first to be confused by this, so I might need
> > to rework this section to make it clear that each can in fact be used on
> > modern hardware.
> > 
> > If you happen to have an old Sequent NUMA-Q or Symmetry box lying around,
> > things are a bit different.  On the other hand, I don't believe that any
> > of these old boxes are still running Linux.  (Hey, I am as sentimental as
> > the next guy, but there are limits!)
> > 
> > I updated this section and pushed it, please let me know if this helps!
> 
> I don't know because I encountered some troubles to build it, I'm seeing thousand
> lines like this:
> 
> Name "main::opt_help" used only once: possible typo at /usr/bin/a2ping line 534.
> /usr/bin/a2ping: not a GS output from gs -dSAFER
> ./cartoons/whippersnapper300.eps -> ./cartoons/whippersnapper300.pdf
> Name "main::opt_extra" used only once: possible typo at /usr/bin/a2ping line 546.
> Name "main::opt_help" used only once: possible typo at /usr/bin/a2ping line 534.
> /usr/bin/a2ping: not a GS output from gs -dSAFER
> make: *** [embedfonts] Error 1

Strange.  My version of a2ping is Ubuntu 12.04's:

a2ping.pl 2.77p, 2006-11-15 -- Written by <pts@...ekas.hu> from April 2003.

You have something more recent?

> Anyway I looked at the diff and it looks indeed clearer, thanks!

Glad it helped!

> So back to the issue, I think we made nice progresses with my rusty brain ;-)
> But just to be clear, I'm pasting that again for just a few precisions:
> 
>         CPU 0                                CPU 1
> 
>        cmpxchg(&full_sysidle_state,          //CPU 1 wakes up
>                 RCU_SYSIDLE_SHORT,           atomic_inc(rdtp(1)->dynticks_idle)
>                 RCU_SYSIDLE_LONG);
> 
>        smp_mb() //cmpxchg                    smp_mb()
>        atomic_read(rdtp(1)->dynticks_idle)   ACCESS_ONCE(full_sysidle_state
>       //CPU 0 goes to sleep
> 
> 
> 
> 1) If CPU 0 sets RCU_SYSIDLE_LONG and sees dynticks_idle as even, do we have the _guarantee_
> that later CPU 1 sees full_sysidle_state == RCU_SYSIDLE_LONG (or any later full_sysidle_state value)
> due to the connection between atomic_read / atomic_inc and the barriers that come along?

No, we do not have this guarantee.

What happens instead is that CPU 0 later sets RCU_SYSIDLE_FULL after
having again seen CPU 1's ->dynticks_idle having an even (idle) value.
If CPU 1 later increments its ->dynticks_idle to an odd (non-idle) value,
then does a memory barrier, and then reads full_sysidle_state, then CPU
1 is guaranteed to see RCU_SYSIDLE_LONG.  Please note that CPU 1 is -not-
obligated to see RCU_SYSIDLE_FULL, much less RCU_SYSIDLE_NOTED.

However, when CPU 1 does an atomic operation on full_sysidle_state that
returns the old value, CPU 1 is guaranteed to get the most recent state.
(Without this guarantee, we could have a cactus state of values for
full_sysidle_state, which might make alternate-universe fans happy,
but which would be really hard to program.)

This is why we need an RCU_SYSIDLE_LONG state in addition to
RCU_SYSIDLE_SHORT and RCU_SYSIDLE_FULL.

> 2) You taught me once that barrier != memory committed, and it has been one of the hardest trauma
> in my life. How can we be sure that CPU 1 sees memory as committed from CPU 0? The only fact that
> we read an even value from CPU 0 is enough for the connection between the atomic_read() and atomic_inc()
> and all the barriers that come along?

My condolences on the trauma!  Now, if you will look directly into the
light...  ;-)

We cannot be sure.  All we can be sure of is the memory-barrier
relationships and coherence order of each individual memory location.

The basic memory-barrier relationships are of the following form:

	CPU 0		CPU 1
	ref A		ref C
	smp_mb()	smp_mb()
	ref B		ref D

If the results of these references show that ref C saw the results of
ref B, then you know that ref D is guaranteed to see the results of
ref A.  There are no guarantees on time delay.  At least there are no
guarantees on time delay that the various hardware vendors are willing
to tell us about.

For the coherence order of each individual memory location, let's focus
on full_sysidle_state.  Suppose that CPU 0 uses cmpxchg() to advance
it from RCU_SYSIDLE_NOT to RCU_SYSIDLE_SHORT to RCU_SYSIDLE_LONG to
RCU_SYSIDLE_FULL and finally to RCU_SYSIDLE_NOTED.  When CPU 1 uses
a normal load to access full_sysidle_state, it is only guaranteed to
see RCU_SYSIDLE_LONG, because RCU_SYSIDLE_LONG is the value that
CPU 0 last wrote prior to an access of CPU 1's ->dynticks_idle counter.

However, CPU 1 then uses a cmpxchg() loop to change the value of
full_sysidle_state back to RCU_SYSIDLE_NOT in rcu_sysidle_force_exit().
At this point, CPU 1 cannot be allowed to see an old value of
full_sysidle_state, because this would contradict the cmpxchg() return
values that CPU 0 saw in rcu_sysidle().  Therefore, the cmpxchg()
in rcu_sysidle_force_exit() is obligated to return the most recent
value.

How does the hardware do this?  Well, for loads, the hardware can look
just in its own cache, which might well contain an outdated value.
In contrast, for cmpxchg(), the CPU must invalidate full_sysidle_state
from all other caches, which ensures that the most recent value becomes
available.

So one way to guarantee most-recent results is to always
use atomic instructions to access the variables, for example,
atomic_add(&full_sysidle_state, 0) instead of a simple load instruction.
Might be some performance degradation, but you know what they way about
free lunches.  ;-)

> 3) In your book it says: "recent hardware would guarantee that at least one of the loads saw the value
> stored by the corresponding store".
> 
> At least one? So in our example, CPU 0 could see dynticks_idle as even (success to see some prior store done in
> CPU 1) but following the above statement reasoning, CPU 1 might not see the corresponding store and see, for example
> RCU_SYSIDLE_SHORT?

Ah, yes, the kernel of Dekker's algorithm.  ;-)

	CPU 0		CPU 1

	A=1		B=1
	smp_mb()	smp_mb()
	r1=B		r2=A

The initial values of both A and B are zero.

And yes, this does have the same form as the manipulations of
full_sysidle_state and ->dynticks_idle.

Let's suppose that neither CPU saw the results of either CPU's store.
The after both CPUs finished, we would have r1==0&&r2==0.  Let's start
with the r1==0.

The r1==0 means that CPU 0 loaded from B before CPU 1 stored to B:
After all, otherwise, we would have r1==1.  But because of memory-barrier
pairing, this means that CPU 1's load from A must see the results of
CPU 0's store to A, and therefore r2==1.

The procedure starting with r2==0 follows the same line of reasoning.
Therefore, r1==0&&r2==0 cannot happen, which implies that at least
one of the CPU's loads must see the other CPU's store.

> I'm really sorry to bother you with that... :-(

Absolutely not a problem!  In fact, the above discussion is now yet
another infamous Quick Quiz in perfbook.  ;-)

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/