linux-kernel - Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrWAVTjsKwih06GeK237w7RLSE2D2+naiunA=VFEJY1meQ@mail.gmail.com>
Date:   Wed, 20 May 2020 08:36:06 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     "Paul E. McKenney" <paulmck@...nel.org>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Alexandre Chartre <alexandre.chartre@...cle.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <sean.j.christopherson@...el.com>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Petr Mladek <pmladek@...e.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Joel Fernandes <joel@...lfernandes.org>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        Juergen Gross <jgross@...e.com>,
        Brian Gerst <brgerst@...il.com>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Will Deacon <will@...nel.org>,
        Tom Lendacky <thomas.lendacky@....com>,
        Wei Liu <wei.liu@...nel.org>,
        Michael Kelley <mikelley@...rosoft.com>,
        Jason Chen CJ <jason.cj.chen@...el.com>,
        Zhao Yakui <yakui.zhao@...el.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>
Subject: Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()

On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>
> On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@...utronix.de> wrote:
> > >
> > > Andy Lutomirski <luto@...nel.org> writes:
> > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@...utronix.de> wrote:
> > > >> Thomas Gleixner <tglx@...utronix.de> writes:
> > > >> It's about this:
> > > >>
> > > >> rcu_nmi_enter()
> > > >> {
> > > >>         if (!rcu_is_watching()) {
> > > >>             make it watch;
> > > >>         } else if (!in_nmi()) {
> > > >>             do_magic_nohz_dyntick_muck();
> > > >>         }
> > > >>
> > > >> So if we do all irq/system vector entries conditional then the
> > > >> do_magic() gets never executed. After that I got lost...
> > > >
> > > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > > to *all* entries -- just the not-super-magic ones that use
> > > > idtentry_enter().
> > > >
> > > > Paul, what is this code actually trying to do?
> > >
> > > Citing Paul from IRC:
> > >
> > >   "The way things are right now, you can leave out the rcu_irq_enter()
> > >    if this is not a nohz_full CPU.
> > >
> > >    Or if this is a nohz_full CPU, and the tick is already
> > >    enabled, in that case you could also leave out the rcu_irq_enter().
> > >
> > >    Or even if this is a nohz_full CPU and it does not have the tick
> > >    enabled, if it has been in the kernel less than a few tens of
> > >    milliseconds, still OK to avoid invoking rcu_irq_enter()
> > >
> > >    But my guess is that it would be a lot simpler to just always call
> > >    it.
> > >
> > > Hope that helps.
> >
> > Maybe?
> >
> > Unless I've missed something, the effect here is that #PF hitting in
> > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > (because you converted them) as well as other faults and traps will
> > call rcu_irq_enter().
> >
> > Once upon a time, we did this horrible thing where, on entry from user
> > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > means we could get an IRQ in an extended quiescent state.  This means
> > that the IRQ code had to end the EQS so that IRQ handlers could use
> > RCU.  But I killed this a few years ago -- x86 Linux now has a rule
> > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > idle code.
> >
> > In my dream world, we would never ever get IRQs while in an EQS -- we
> > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > interrupt.  But I guess we still need to support HLT, which means we
> > have this mess.
> >
> > But I still think we can plausibly get rid of the conditional.
>
> You mean the conditional in rcu_nmi_enter()?  In a NO_HZ_FULL=n system,
> this becomes:

So, I meant the conditional in tglx's patch that makes page faults special.

>
> >                                                                 If we
> > get an IRQ or (egads!) a fault in idle context, we'll have
> > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
>
> Or we could be early in the kernel-entry code or late in the kernel-exit
> code, but as far as I know, preemption is disabled on those code paths.
> As are interrupts, right?  And interrupts are disabled on the portions
> of the CPU-hotplug code where RCU is not watching, if I recall correctly.

Interrupts are off in the parts of the entry/exit that RCU considers
to be user mode.  We can get various faults, although these should be
either NMI-like or events that genuinely or effectively happened in
user mode.

>
> A nohz_full CPU does not enable the scheduling-clock interrupt upon
> entry to the kernel.  Normally, this is fine because that CPU will very
> quickly exit back to nohz_full userspace execution, so that RCU will
> see the quiescent state, either by sampling it directly or by deducing
> the CPU's passage through that quiescent state by comparing with state
> that was captured earlier.  The grace-period kthread notices the lack
> of a quiescent state and will eventually set ->rcu_urgent_qs to
> trigger this code.
>
> But if the nohz_full CPU stays in the kernel for an extended time,
> perhaps due to OOM handling or due to processing of some huge I/O that
> hits in-memory buffers/cache, then RCU needs some way of detecting
> quiescent states on that CPU.  This requires the scheduling-clock
> interrupt to be alive and well.
>
> Are there other ways to get this done?  But of course!  RCU could
> for example use smp_call_function_single() or use workqueues to force
> execution onto that CPU and enable the tick that way.  This gets a
> little involved in order to avoid deadlock, but if the added check
> in rcu_nmi_enter() is causing trouble, something can be arranged.
> Though that something would cause more latency excursions than
> does the current code.
>
> Or did you have something else in mind?

I'm trying to understand when we actually need to call the function.
Is it just the scheduling interrupt that's supposed to call
rcu_irq_enter()?  But the scheduling interrupt is off, so I'm
confused.