linux-kernel - Re: Reconciling rcu_irq_enter()/rcu_nmi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150717225531.GM3717@linux.vnet.ibm.com>
Date:	Fri, 17 Jul 2015 15:55:31 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	Sasha Levin <sasha.levin@...cle.com>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>, X86 ML <x86@...nel.org>,
	Rik van Riel <riel@...hat.com>
Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 03:13:36PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney
> <paulmck@...ux.vnet.ibm.com> wrote:
> > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
> >> True.  But context tracking wouldn't object to being exact.  And I
> >> think we need context tracking to treat user mode as quiescent, so
> >> they're at least related.
> >
> > And RCU would be happy to be able to always detect usermode execution.
> > But there are configurations and architectures that exclude context
> > tracking, which means that RCU has to roll its own in those cases.
> 
> We could slowly fix them, perhaps.  I suspect that I'm half-way done
> with accidentally enabling it for x86_32 :)

If there was an appropriate Kconfig variable, I could do things one way
or the other, depending on what the architecture was doing.

So you are -unconditionally- enabling context tracking for x86_32?
Doesn't that increase kernel-user transition overhead?

> >> >> > 3.      In some configurations, RCU needs to be able to block entry into
> >> >> >         nohz state, both for idle and userspace.
> >> >>
> >> >> Hmm.  I suppose we could be CONTEXT_USER but still have RCU awake,
> >> >> although the tick would have to stay on.
> >> >
> >> > Right, there are situations where RCU needs a given CPU to keep the tick
> >> > going, for example, when there are RCU callbacks queued on that CPU.
> >> > Failing to keep the tick going could result in a system hang, because
> >> > that callback might never be invoked.
> >>
> >> Can't we just fire the callbacks right away?
> >
> > Absolutely not!!!
> >
> > At least not on multi-CPU systems.  There might be an RCU read-side
> > critical section on some other CPU that we still have to wait for.
> 
> Oh, right, obviously.  Could you kick them over to a different CPU, though?

For CONFIG_RCU_NOCB_CPUS=y kernels, no problem, they will execute on
whatever CPU the scheduler or the sysadm chooses, as the case may be.
Otherwise, it gets really hard to make sure that a given CPU's callbacks
execute in order, which is required for rcu_barrier() to work properly.

There was some thought of making CONFIG_RCU_NOCB_CPUS=y be the only
way that RCU callbacks were invoked, but the overhead is higher and
turns out to be all too noticeable on some workloads.  :-(

> >>                                               We should only go
> >> RCU-idle from reasonable contexts.  In fact, the nohz crowd likes the
> >> idea of nohz userspace being absolutely no hz.
> >
> > Guilty to charges as read, and this is why RCU needs to be informed
> > about userspace execution for nohz userspace.
> >
> >> NMIs can't queue RCU callbacks, right?  (I hope!)  As of a couple
> >> releases ago, on x86, we *always* have a clean state when
> >> transitioning from any non-NMI kernel context to user mode on x86.
> >
> > You are correct, NMIs cannot queue RCU callbacks.  That would be possible,
> > but it alws would be a bit painful and not so good for energy efficiency.
> > So if someone wants that, they need to have an extremely good reason.  ;-)
> 
> Eww, please no.  NMI is already a terrifying scary disaster, and
> keeping it simple would be for the best.

;-) ;-) ;-)

> >> >  Of course, something or another
> >> > will normally eventually disturb the CPU, but the resulting huge delay
> >> > would not be good.  And on deep embedded systems, it is quite possible
> >> > that the CPU would go for a good long time without being disturbed.
> >> > (This is not just a theoretical possibility, and I have the scars to
> >> > prove it.)
> >> >
> >> > And there is this one as well:
> >> >
> >> > 4.      In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace
> >> >         context differently than idle context, and still needs to be
> >> >         able to take two samples and determine if the CPU ever went idle
> >> >         (and only idle, not userspace) betweentimes.
> >>
> >> If the context tracking code, or whatever the hook is, tracked the
> >> number of transitions out of user mode, that would do it, right?
> >> We're talking literally a single per-cpu increment on user entry
> >> and/or exit, I think.
> >
> > With memory barriers, because RCU has to accurately sample the counters
> > remotely.  I currently use full-up atomic operations, possibly only due
> > to paranoia, but I need to further intensify testing to trust moving away
> > from full-up atomic operations.  In addition, RCU currently relies on a
> > single counter counting idle-to-RCU transitions, both userspace and idle.
> > It might be possible to wean RCU of this habit, or maybe have the two
> > counters be combined into a single word.  The checks would of course be
> > more complex in that case, but should be doable.
> 
> Why is idle-to-RCU different from user-to-RCU?

Full-system-idle checking that is supposed to someday allow CPU 0's
scheduling-clock interrupt to be turned off on CONFIG_NO_HZ_FULL
systems.  In this case, RCU treats user as non-idle for the purpose
of determining whether or not CPU 0's scheduling-clock interrupt can
be stopped.  But idle is still idle.  And for purposes of determining
whether the grace period has ended, idle and user are both extended
quiescent states, regardless.

> I feel like RCU and context tracking are implementing more or less the
> same thing, and the fact that they're not shared makes life
> complicated.
> 
> >From my perspective, I want to be able to say "I'm transitioning to
> user mode right now" and "I'm transitioning out of user mode right
> now" and have it Just Work.  In current -tip, on x86_64, we do that
> for literally every non-NMI entry with one stupid racy exception, and
> that racy exception is very much fixable.  I'd prefer not to think
> about whether I'm informing RCU about exiting user mode, informing
> context tracking about exiting user mode, or both.

RCU's tracking was in place for many years before context tracking
appeared.  If we can converge them, well and good, but it really does
have to fully work.

> > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
> > rcu_prepare_for_idle() just before incrementing the counter when
> > transitioning to an idle-like state.  Similarly, in CONFIG_RCU_NOCB_CPU
> > kernels, RCU needs a call to the following in the same place:
> >
> >         for_each_rcu_flavor(rsp) {
> >                 rdp = this_cpu_ptr(rsp->rda);
> >                 do_nocb_deferred_wakeup(rdp);
> >         }
> >
> > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
> > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
> > transition from an idle-like state.
> 
> If my hypothetical "I'm going to userspace now" function did that,
> great!  I call it from a context where percpu variable work and IRQs
> are off.  I further promise not to run any RCU code or enable IRQs
> between calling that function and actually entering user mode.

Probably need a pair of nonspecific RCU hook function that does whatever
RCU eventually needs to be done, but that could hopefully work.

Of course, just having a pair of hook functions assumes that RCU can
be convinced to not care about the difference between irq and process
transitions, and the difference between idle and user (at transition,
sysidle will continue to care about the difference between idle and
user when remotely checking a given CPU's state).

> > There is also some debug that complains if something transitions
> > to/from an idle-like state that shouldn't be doing so, but that could be
> > pulled into context tracking.  (Might already be there, for all I know.)
> > And there is event tracing, which might be subsumed into context tracking.
> > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
> > for the full story.
> >
> > This could all be handled by an RCU hook being invoked just before
> > incrementing the counter on entry to an idle-like state and just after
> > incrementing the counter on exit from an idle-like state.
> >
> > Ah, yes, and interrupts to/from idle.  Some architectures have
> > half-interrupts that never return.
> 
> WTF?  Some architectures are clearly nuts.

Heh.  That was -exactly- my reaction when I first ran into it.

> x86 gets this right, at least in the intel_idle and acpi_idle cases.
> There are no interrupts from RCU idle unless I badly misread the code.

These half-interrupts happen when running non-idle in kernel code.
Things like simulating exceptions and system calls from within the kernel.
I would not be sad to see it go, but while it is here, RCU must handle
it correctly.  If it really truly cannot happen for x86, then x86
arch code of course need not worry about it.

> >  RCU uses a compound counter
> > that is zeroed upon process-level entry to an idle-like state to
> > deal with this.  See kernel/rcu/rcu.h, the definitions starting with
> > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
> > the associated comment block.  But maybe context tracking has some
> > other way of handling these beasts?
> >
> > And transitions to idle-like states are not atomic.  In some cases
> > in some configurations, rcu_needs_cpu() says "OK" when asked about
> > stopping the tick, but by the time we get to rcu_prepare_for_idle(),
> > it is no longer OK.  RCU raises softirq to force a replay in these
> > sorts of cases.
> 
> Hmm.  If pending callbacks got kicked to another CPU, would that help?

Well, for CONFIG_RCU_NOCB_CPUS_ALL kernels, rcu_needs_cpu() unconditionally
says "OK", so for workloads where that is a reasonable strategy, it works
quite well.  Otherwise, kicking callbacks to other CPUs is extremely
painful at best, and perhaps even impossible, at least if rcu_barrier()
is to work correctly.

>  If that's impossible, when RCU finally detects that the grace period
> is over, could it send an IPI rather than relying on the timer tick?

In CONFIG_RCU_FAST_NO_HZ kernels, when the CPU is going idle, it sets
a timer to catch end of the grace period in the common case.  This is
the point of the time passed back from rcu_needs_cpu().  Making the
end-of-grace-period code send IPIs to CPUs that might (or might not)
be in this mode involves too much rummaging through other CPU's states.
Plus people are not complaining that the grace-period kthread is using
too little CPU.

Besides, you have to tolerate a CPU catching an interrupt that does a
wakeup just as that CPU was trying to go idle, so the current approach
is not introducing any additional pain from what I can see.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/