linux-kernel - Re: Reconciling rcu_irq_enter()/rcu_nmi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrV3jaMyMskMiZGVi-wpTCjFkibVBVFohoGDAim8Ky6L6g@mail.gmail.com>
Date:	Fri, 17 Jul 2015 16:20:57 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Paul McKenney <paulmck@...ux.vnet.ibm.com>
Cc:	Sasha Levin <sasha.levin@...cle.com>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>, X86 ML <x86@...nel.org>,
	Rik van Riel <riel@...hat.com>
Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 3:55 PM, Paul E. McKenney
<paulmck@...ux.vnet.ibm.com> wrote:
> On Fri, Jul 17, 2015 at 03:13:36PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney
>> <paulmck@...ux.vnet.ibm.com> wrote:
>> > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
>> >> True.  But context tracking wouldn't object to being exact.  And I
>> >> think we need context tracking to treat user mode as quiescent, so
>> >> they're at least related.
>> >
>> > And RCU would be happy to be able to always detect usermode execution.
>> > But there are configurations and architectures that exclude context
>> > tracking, which means that RCU has to roll its own in those cases.
>>
>> We could slowly fix them, perhaps.  I suspect that I'm half-way done
>> with accidentally enabling it for x86_32 :)
>
> If there was an appropriate Kconfig variable, I could do things one way
> or the other, depending on what the architecture was doing.

There's CONFIG_CONTEXT_TRACKING.

IMO it would be nice if there was a sort of clear spec for what
promises an arch needs to make for proper RCU operation and what
additional promises it needs to make for RCU idle during user mode.

>
> So you are -unconditionally- enabling context tracking for x86_32?
> Doesn't that increase kernel-user transition overhead?

No, I'm just adding the user_enter and user_exit calls.  This might
let us enable HAVE_CONTEXT_TRACKING.  Someone still needs to flip it
on.

>
>> >> >> > 3.      In some configurations, RCU needs to be able to block entry into
>> >> >> >         nohz state, both for idle and userspace.
>> >> >>
>> >> >> Hmm.  I suppose we could be CONTEXT_USER but still have RCU awake,
>> >> >> although the tick would have to stay on.
>> >> >
>> >> > Right, there are situations where RCU needs a given CPU to keep the tick
>> >> > going, for example, when there are RCU callbacks queued on that CPU.
>> >> > Failing to keep the tick going could result in a system hang, because
>> >> > that callback might never be invoked.
>> >>
>> >> Can't we just fire the callbacks right away?
>> >
>> > Absolutely not!!!
>> >
>> > At least not on multi-CPU systems.  There might be an RCU read-side
>> > critical section on some other CPU that we still have to wait for.
>>
>> Oh, right, obviously.  Could you kick them over to a different CPU, though?
>
> For CONFIG_RCU_NOCB_CPUS=y kernels, no problem, they will execute on
> whatever CPU the scheduler or the sysadm chooses, as the case may be.
> Otherwise, it gets really hard to make sure that a given CPU's callbacks
> execute in order, which is required for rcu_barrier() to work properly.
>
> There was some thought of making CONFIG_RCU_NOCB_CPUS=y be the only
> way that RCU callbacks were invoked, but the overhead is higher and
> turns out to be all too noticeable on some workloads.  :-(
>

Yuck.  What if we make CONFIG_RCU_NOCB_CPUS=y mandatory for NOHZ_FULL?
 In any case, the people who want their systems to be really truly
quiescent in user space will have CONFIG_RCU_NOCB_CPUS=y.

>>
>> Why is idle-to-RCU different from user-to-RCU?
>
> Full-system-idle checking that is supposed to someday allow CPU 0's
> scheduling-clock interrupt to be turned off on CONFIG_NO_HZ_FULL
> systems.  In this case, RCU treats user as non-idle for the purpose
> of determining whether or not CPU 0's scheduling-clock interrupt can
> be stopped.

I'm not quite sure I understand this.  We can turn off the tick if
truly idle but not if in user mode?  Why?


>
>> I feel like RCU and context tracking are implementing more or less the
>> same thing, and the fact that they're not shared makes life
>> complicated.
>>
>> >From my perspective, I want to be able to say "I'm transitioning to
>> user mode right now" and "I'm transitioning out of user mode right
>> now" and have it Just Work.  In current -tip, on x86_64, we do that
>> for literally every non-NMI entry with one stupid racy exception, and
>> that racy exception is very much fixable.  I'd prefer not to think
>> about whether I'm informing RCU about exiting user mode, informing
>> context tracking about exiting user mode, or both.
>
> RCU's tracking was in place for many years before context tracking
> appeared.  If we can converge them, well and good, but it really does
> have to fully work.

True.

One of my goals is to perfect coverage of the entering-usermode and
exiting-usermode callbacks on x86.  What those callbacks are called
and what they do is up for debate, but I intend to call them.  In
fact, I intend to promise that we will never execute non-NMI kernel
code with IRQs on at *any* point where I haven't called the
appropriate callback to tell the kernel that I'm executing kernel
code.

The splat that Sasha got was an assertion I added to help validate
this promise, but I'm asserting sort-of the wrong thing.  Currently
there's a narrow window in which RCU knows we're non-idle but context
tracking still thinks we're in user mode, and Sasha got it to take an
interrupt there.  I can change the assertion (it's currently
harmless), or I can fix it (needs a bit more work, and Ingo is
currently swamped so it'll be a couple weeks).

>
>> > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
>> > rcu_prepare_for_idle() just before incrementing the counter when
>> > transitioning to an idle-like state.  Similarly, in CONFIG_RCU_NOCB_CPU
>> > kernels, RCU needs a call to the following in the same place:
>> >
>> >         for_each_rcu_flavor(rsp) {
>> >                 rdp = this_cpu_ptr(rsp->rda);
>> >                 do_nocb_deferred_wakeup(rdp);
>> >         }
>> >
>> > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
>> > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
>> > transition from an idle-like state.
>>
>> If my hypothetical "I'm going to userspace now" function did that,
>> great!  I call it from a context where percpu variable work and IRQs
>> are off.  I further promise not to run any RCU code or enable IRQs
>> between calling that function and actually entering user mode.
>
> Probably need a pair of nonspecific RCU hook function that does whatever
> RCU eventually needs to be done, but that could hopefully work.
>
> Of course, just having a pair of hook functions assumes that RCU can
> be convinced to not care about the difference between irq and process
> transitions, and the difference between idle and user (at transition,
> sysidle will continue to care about the difference between idle and
> user when remotely checking a given CPU's state).

It can be more complex than just a pair of functions.  I could call
idle_to_kernel() and user_to_kernel(), both with IRQs off and exactly
at the right times.

There may be architectures that deliver IRQs directly from idle, which
would mean that IRQ handlers would need care to choose the right hook,
but x86 is not one of those architectures.

>
>> > There is also some debug that complains if something transitions
>> > to/from an idle-like state that shouldn't be doing so, but that could be
>> > pulled into context tracking.  (Might already be there, for all I know.)
>> > And there is event tracing, which might be subsumed into context tracking.
>> > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
>> > for the full story.
>> >
>> > This could all be handled by an RCU hook being invoked just before
>> > incrementing the counter on entry to an idle-like state and just after
>> > incrementing the counter on exit from an idle-like state.
>> >
>> > Ah, yes, and interrupts to/from idle.  Some architectures have
>> > half-interrupts that never return.
>>
>> WTF?  Some architectures are clearly nuts.
>
> Heh.  That was -exactly- my reaction when I first ran into it.
>
>> x86 gets this right, at least in the intel_idle and acpi_idle cases.
>> There are no interrupts from RCU idle unless I badly misread the code.
>
> These half-interrupts happen when running non-idle in kernel code.
> Things like simulating exceptions and system calls from within the kernel.
> I would not be sad to see it go, but while it is here, RCU must handle
> it correctly.  If it really truly cannot happen for x86, then x86
> arch code of course need not worry about it.

On x86, you can simulate a syscall by calling the syscall body.  I
clearly don't understand this weirdness...

>
>> >  RCU uses a compound counter
>> > that is zeroed upon process-level entry to an idle-like state to
>> > deal with this.  See kernel/rcu/rcu.h, the definitions starting with
>> > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
>> > the associated comment block.  But maybe context tracking has some
>> > other way of handling these beasts?
>> >
>> > And transitions to idle-like states are not atomic.  In some cases
>> > in some configurations, rcu_needs_cpu() says "OK" when asked about
>> > stopping the tick, but by the time we get to rcu_prepare_for_idle(),
>> > it is no longer OK.  RCU raises softirq to force a replay in these
>> > sorts of cases.
>>
>> Hmm.  If pending callbacks got kicked to another CPU, would that help?
>
> Well, for CONFIG_RCU_NOCB_CPUS_ALL kernels, rcu_needs_cpu() unconditionally
> says "OK", so for workloads where that is a reasonable strategy, it works
> quite well.  Otherwise, kicking callbacks to other CPUs is extremely
> painful at best, and perhaps even impossible, at least if rcu_barrier()
> is to work correctly.
>
>>  If that's impossible, when RCU finally detects that the grace period
>> is over, could it send an IPI rather than relying on the timer tick?
>
> In CONFIG_RCU_FAST_NO_HZ kernels, when the CPU is going idle, it sets
> a timer to catch end of the grace period in the common case.  This is
> the point of the time passed back from rcu_needs_cpu().  Making the
> end-of-grace-period code send IPIs to CPUs that might (or might not)
> be in this mode involves too much rummaging through other CPU's states.
> Plus people are not complaining that the grace-period kthread is using
> too little CPU.
>
> Besides, you have to tolerate a CPU catching an interrupt that does a
> wakeup just as that CPU was trying to go idle, so the current approach
> is not introducing any additional pain from what I can see.

Except that, in that case, we defer idle exactly long enough to handle
the interrupt and then either schedule or go idle for real (or have
another interrupt).  There's no weird state where we have no work to
do right now but still can't become idle.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/