linux-kernel - Re: Reconciling rcu_irq_enter()/rcu_nmi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrW8P2T1OMeqAbXGhwW19NewF_kVf_V5m9Tano1Jhi1C2g@mail.gmail.com>
Date:	Fri, 17 Jul 2015 15:13:36 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Paul McKenney <paulmck@...ux.vnet.ibm.com>
Cc:	Sasha Levin <sasha.levin@...cle.com>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>, X86 ML <x86@...nel.org>,
	Rik van Riel <riel@...hat.com>
Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney
<paulmck@...ux.vnet.ibm.com> wrote:
> On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
>> True.  But context tracking wouldn't object to being exact.  And I
>> think we need context tracking to treat user mode as quiescent, so
>> they're at least related.
>
> And RCU would be happy to be able to always detect usermode execution.
> But there are configurations and architectures that exclude context
> tracking, which means that RCU has to roll its own in those cases.
>

We could slowly fix them, perhaps.  I suspect that I'm half-way done
with accidentally enabling it for x86_32 :)

>
>> >> > 3.      In some configurations, RCU needs to be able to block entry into
>> >> >         nohz state, both for idle and userspace.
>> >>
>> >> Hmm.  I suppose we could be CONTEXT_USER but still have RCU awake,
>> >> although the tick would have to stay on.
>> >
>> > Right, there are situations where RCU needs a given CPU to keep the tick
>> > going, for example, when there are RCU callbacks queued on that CPU.
>> > Failing to keep the tick going could result in a system hang, because
>> > that callback might never be invoked.
>>
>> Can't we just fire the callbacks right away?
>
> Absolutely not!!!
>
> At least not on multi-CPU systems.  There might be an RCU read-side
> critical section on some other CPU that we still have to wait for.

Oh, right, obviously.  Could you kick them over to a different CPU, though?

>
>>                                               We should only go
>> RCU-idle from reasonable contexts.  In fact, the nohz crowd likes the
>> idea of nohz userspace being absolutely no hz.
>
> Guilty to charges as read, and this is why RCU needs to be informed
> about userspace execution for nohz userspace.
>
>> NMIs can't queue RCU callbacks, right?  (I hope!)  As of a couple
>> releases ago, on x86, we *always* have a clean state when
>> transitioning from any non-NMI kernel context to user mode on x86.
>
> You are correct, NMIs cannot queue RCU callbacks.  That would be possible,
> but it alws would be a bit painful and not so good for energy efficiency.
> So if someone wants that, they need to have an extremely good reason.  ;-)

Eww, please no.  NMI is already a terrifying scary disaster, and
keeping it simple would be for the best.

>
>> >  Of course, something or another
>> > will normally eventually disturb the CPU, but the resulting huge delay
>> > would not be good.  And on deep embedded systems, it is quite possible
>> > that the CPU would go for a good long time without being disturbed.
>> > (This is not just a theoretical possibility, and I have the scars to
>> > prove it.)
>> >
>> > And there is this one as well:
>> >
>> > 4.      In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace
>> >         context differently than idle context, and still needs to be
>> >         able to take two samples and determine if the CPU ever went idle
>> >         (and only idle, not userspace) betweentimes.
>>
>> If the context tracking code, or whatever the hook is, tracked the
>> number of transitions out of user mode, that would do it, right?
>> We're talking literally a single per-cpu increment on user entry
>> and/or exit, I think.
>
> With memory barriers, because RCU has to accurately sample the counters
> remotely.  I currently use full-up atomic operations, possibly only due
> to paranoia, but I need to further intensify testing to trust moving away
> from full-up atomic operations.  In addition, RCU currently relies on a
> single counter counting idle-to-RCU transitions, both userspace and idle.
> It might be possible to wean RCU of this habit, or maybe have the two
> counters be combined into a single word.  The checks would of course be
> more complex in that case, but should be doable.

Why is idle-to-RCU different from user-to-RCU?

I feel like RCU and context tracking are implementing more or less the
same thing, and the fact that they're not shared makes life
complicated.

>From my perspective, I want to be able to say "I'm transitioning to
user mode right now" and "I'm transitioning out of user mode right
now" and have it Just Work.  In current -tip, on x86_64, we do that
for literally every non-NMI entry with one stupid racy exception, and
that racy exception is very much fixable.  I'd prefer not to think
about whether I'm informing RCU about exiting user mode, informing
context tracking about exiting user mode, or both.

>
> In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
> rcu_prepare_for_idle() just before incrementing the counter when
> transitioning to an idle-like state.  Similarly, in CONFIG_RCU_NOCB_CPU
> kernels, RCU needs a call to the following in the same place:
>
>         for_each_rcu_flavor(rsp) {
>                 rdp = this_cpu_ptr(rsp->rda);
>                 do_nocb_deferred_wakeup(rdp);
>         }
>
> On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
> in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
> transition from an idle-like state.

If my hypothetical "I'm going to userspace now" function did that,
great!  I call it from a context where percpu variable work and IRQs
are off.  I further promise not to run any RCU code or enable IRQs
between calling that function and actually entering user mode.

>
> There is also some debug that complains if something transitions
> to/from an idle-like state that shouldn't be doing so, but that could be
> pulled into context tracking.  (Might already be there, for all I know.)
> And there is event tracing, which might be subsumed into context tracking.
> See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
> for the full story.
>
> This could all be handled by an RCU hook being invoked just before
> incrementing the counter on entry to an idle-like state and just after
> incrementing the counter on exit from an idle-like state.
>
> Ah, yes, and interrupts to/from idle.  Some architectures have
> half-interrupts that never return.

WTF?  Some architectures are clearly nuts.

x86 gets this right, at least in the intel_idle and acpi_idle cases.
There are no interrupts from RCU idle unless I badly misread the code.

>  RCU uses a compound counter
> that is zeroed upon process-level entry to an idle-like state to
> deal with this.  See kernel/rcu/rcu.h, the definitions starting with
> DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
> the associated comment block.  But maybe context tracking has some
> other way of handling these beasts?
>
> And transitions to idle-like states are not atomic.  In some cases
> in some configurations, rcu_needs_cpu() says "OK" when asked about
> stopping the tick, but by the time we get to rcu_prepare_for_idle(),
> it is no longer OK.  RCU raises softirq to force a replay in these
> sorts of cases.

Hmm.  If pending callbacks got kicked to another CPU, would that help?
 If that's impossible, when RCU finally detects that the grace period
is over, could it send an IPI rather than relying on the timer tick?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/