linux-kernel - Re: Reconciling rcu_irq_enter()/rcu_nmi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150717211943.GK3717@linux.vnet.ibm.com>
Date:	Fri, 17 Jul 2015 14:19:43 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	Sasha Levin <sasha.levin@...cle.com>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>, X86 ML <x86@...nel.org>,
	Rik van Riel <riel@...hat.com>
Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 17, 2015 at 1:12 PM, Paul E. McKenney
> <paulmck@...ux.vnet.ibm.com> wrote:
> > On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote:
> >> On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney
> >> <paulmck@...ux.vnet.ibm.com> wrote:
> >> > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote:
> >> >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
> >> >> > For reasons that mystify me a bit, we currently track context tracking
> >> >> > state separately from rcu's watching state.  This results in strange
> >> >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
> >> >> > can nest exceptions inside the IRQ handler (an example would be
> >> >> > wrmsr_safe failing), and, in -next, we splat a warning:
> >> >> >
> >> >> > https://gist.github.com/sashalevin/a006a44989312f6835e7
> >> >> >
> >> >> > I'm trying to make context tracking more exact, which will fix this
> >> >> > issue (the particular splat that Sasha hit shouldn't be possible when
> >> >> > I'm done), but I think it would be nice to unify all of this stuff.
> >> >> > Would it be plausible for us to guarantee that RCU state is always in
> >> >> > sync with context tracking state?  If so, we could maybe simplify
> >> >> > things and have fewer state variables.
> >> >>
> >> >> A noble goal.  Might even be possible, and maybe even advantageous.
> >> >>
> >> >> But it is usually easier to say than to do.  RCU really does need to make
> >> >> some adjustments when the state changes, as do the other subsystems.
> >> >> It might or might not be possible to do the transitions atomically.
> >> >> And if the transitions are not atomic, there will still be weird code
> >> >> paths where (say) the processor is considered non-idle, but RCU doesn't
> >> >> realize it yet.  Such a code path could not safely use rcu_read_lock(),
> >> >> so you still need RCU to be able to scream if someone tries it.
> >> >> Contrariwise, if there is a code path where the processor is considered
> >> >> idle, but RCU thinks it is non-idle, that code path can stall
> >> >> grace periods.  (Yes, not a problem if the code path is short enough.
> >> >> At least if the underlying VCPU is making progres...)
> >> >>
> >> >> Still, I cannot prove that it is impossible, and if it is possible,
> >> >> then as you say, there might well be benefits.
> >> >>
> >> >> > Doing this for NMIs might be weird.  Would it make sense to have a
> >> >> > CONTEXT_NMI that's somehow valid even if the NMI happened while
> >> >> > changing context tracking state.
> >> >>
> >> >> Face it, NMIs are weird.  ;-)
> >> >>
> >> >> > Thoughts?  As it stands, I think we might already be broken for real:
> >> >> >
> >> >> > Syscall -> user_exit.  Perf NMI hits *during* user_exit.  Perf does
> >> >> > copy_from_user_nmi, which can fault, causing do_page_fault to get
> >> >> > called, which calls exception_enter(), which can't be a good thing.
> >> >> >
> >> >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.
> >> >>
> >> >> Actually, I see more cases where people forget irq_enter() than
> >> >> rcu_nmi_enter().  "We will just nip in quickly and do something without
> >> >> actually letting the irq system know.  Oh, and we want some event tracing
> >> >> in that code path."  Boom!
> >> >>
> >> >> > Thoughts?  As it stands, I need to do something because -tip and thus
> >> >> > -next spews occasional warnings.
> >> >>
> >> >> Tell me more?
> >> >
> >> > And for completeness, RCU also has the following requirements on the
> >> > state-transition mechanism:
> >> >
> >> > 1.      It must be possible to reliably sample some other CPU's state.
> >> >         This is an energy-efficiency requirement, as RCU is not normally
> >> >         permitted to wake up idle CPUs.  Nor nohz CPUs, for that matter.
> >>
> >> NOHZ needs this for vtime accounting, too.  I think Rik might be
> >> thinking about this.  Maybe the underlying state could be shared?
> >
> > From what I understand, what Rik is looking at is accounting information,
> > which is a different type of state.  And a type of state where some
> > approximation is just fine.  Try that with RCU, and you will approximate
> > yourself into a segfault.
> 
> True.  But context tracking wouldn't object to being exact.  And I
> think we need context tracking to treat user mode as quiescent, so
> they're at least related.

And RCU would be happy to be able to always detect usermode execution.
But there are configurations and architectures that exclude context
tracking, which means that RCU has to roll its own in those cases.

> >> > 2.      RCU must be able to track passage through idle and nohz states.
> >> >         In other words, if RCU samples at t=0 and finds that the CPU
> >> >         is executing (say) in kernel mode, and RCU samples again at
> >> >         t=10 and again finds that the CPU is executing in kernel mode,
> >> >         RCU needs to be able to determine whether or not that CPU passed
> >> >         through idle or nohz betweentimes.
> >>
> >> And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the
> >> context tracking stuff notifies RCU.  The think I'm less than happy
> >> with is that we can currently be CONTEXT_USER but still rcu-awake.
> >> This is manageable, but it seems messy.
> >
> > Well, if you don't have CONFIG_NO_HZ_FULL, there normally isn't context
> > tracking, so RCU cannot see CONTEXT_USER.  Or are you thinking of making
> > context tracking unconditional?  (The tinification guys might have some
> > opinions on this.)
> 
> Without context tracking, user mode is not RCU idle, right?  Instead
> we have timer ticks.  We could get away with a very minimal context
> tracking implementation that just tracked CONTEXT_IDLE and
> CONTEXT_KERNEL.  (Hmm, there is no CONTEXT_IDLE right now.  Further
> grumbling.)

Without context tracking, usermode is still RCU idle.  However, in
that case, RCU detects idle using a hook in the timer tick handler.

> >> > 3.      In some configurations, RCU needs to be able to block entry into
> >> >         nohz state, both for idle and userspace.
> >>
> >> Hmm.  I suppose we could be CONTEXT_USER but still have RCU awake,
> >> although the tick would have to stay on.
> >
> > Right, there are situations where RCU needs a given CPU to keep the tick
> > going, for example, when there are RCU callbacks queued on that CPU.
> > Failing to keep the tick going could result in a system hang, because
> > that callback might never be invoked.
> 
> Can't we just fire the callbacks right away?

Absolutely not!!!

At least not on multi-CPU systems.  There might be an RCU read-side
critical section on some other CPU that we still have to wait for.

>                                               We should only go
> RCU-idle from reasonable contexts.  In fact, the nohz crowd likes the
> idea of nohz userspace being absolutely no hz.

Guilty to charges as read, and this is why RCU needs to be informed
about userspace execution for nohz userspace.

> NMIs can't queue RCU callbacks, right?  (I hope!)  As of a couple
> releases ago, on x86, we *always* have a clean state when
> transitioning from any non-NMI kernel context to user mode on x86.

You are correct, NMIs cannot queue RCU callbacks.  That would be possible,
but it alws would be a bit painful and not so good for energy efficiency.
So if someone wants that, they need to have an extremely good reason.  ;-)

> >  Of course, something or another
> > will normally eventually disturb the CPU, but the resulting huge delay
> > would not be good.  And on deep embedded systems, it is quite possible
> > that the CPU would go for a good long time without being disturbed.
> > (This is not just a theoretical possibility, and I have the scars to
> > prove it.)
> >
> > And there is this one as well:
> >
> > 4.      In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace
> >         context differently than idle context, and still needs to be
> >         able to take two samples and determine if the CPU ever went idle
> >         (and only idle, not userspace) betweentimes.
> 
> If the context tracking code, or whatever the hook is, tracked the
> number of transitions out of user mode, that would do it, right?
> We're talking literally a single per-cpu increment on user entry
> and/or exit, I think.

With memory barriers, because RCU has to accurately sample the counters
remotely.  I currently use full-up atomic operations, possibly only due
to paranoia, but I need to further intensify testing to trust moving away
from full-up atomic operations.  In addition, RCU currently relies on a
single counter counting idle-to-RCU transitions, both userspace and idle.
It might be possible to wean RCU of this habit, or maybe have the two
counters be combined into a single word.  The checks would of course be
more complex in that case, but should be doable.

In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
rcu_prepare_for_idle() just before incrementing the counter when
transitioning to an idle-like state.  Similarly, in CONFIG_RCU_NOCB_CPU
kernels, RCU needs a call to the following in the same place:

	for_each_rcu_flavor(rsp) {
		rdp = this_cpu_ptr(rsp->rda);
		do_nocb_deferred_wakeup(rdp);
	}

On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
transition from an idle-like state.

There is also some debug that complains if something transitions
to/from an idle-like state that shouldn't be doing so, but that could be
pulled into context tracking.  (Might already be there, for all I know.)
And there is event tracing, which might be subsumed into context tracking.
See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
for the full story.

This could all be handled by an RCU hook being invoked just before
incrementing the counter on entry to an idle-like state and just after
incrementing the counter on exit from an idle-like state.

Ah, yes, and interrupts to/from idle.  Some architectures have
half-interrupts that never return.  RCU uses a compound counter
that is zeroed upon process-level entry to an idle-like state to
deal with this.  See kernel/rcu/rcu.h, the definitions starting with
DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
the associated comment block.  But maybe context tracking has some
other way of handling these beasts?

And transitions to idle-like states are not atomic.  In some cases
in some configurations, rcu_needs_cpu() says "OK" when asked about
stopping the tick, but by the time we get to rcu_prepare_for_idle(),
it is no longer OK.  RCU raises softirq to force a replay in these
sorts of cases.

Hey, you asked!!!  ;-)

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/