linux-kernel - Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1ec4dc29-8868-4d82-8c5e-c17ad025bc22@paulmck-laptop>
Date: Thu, 22 Feb 2024 11:11:34 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Mark Rutland <mark.rutland@....com>
Cc: Steven Rostedt <rostedt@...dmis.org>,
	Ankur Arora <ankur.a.arora@...cle.com>,
	linux-kernel@...r.kernel.org, tglx@...utronix.de,
	peterz@...radead.org, torvalds@...ux-foundation.org,
	akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
	dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	willy@...radead.org, mgorman@...e.de, jpoimboe@...nel.org,
	jgross@...e.com, andrew.cooper3@...rix.com, bristot@...nel.org,
	mathieu.desnoyers@...icios.com, glaubitz@...sik.fu-berlin.de,
	anton.ivanov@...bridgegreys.com, mattst88@...il.com,
	krypton@...ich-teichert.org, David.Laight@...lab.com,
	richard@....at, jon.grimm@....com, bharata@....com,
	boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote:
> On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote:
> > > On Wed, 21 Feb 2024 11:41:47 -0800
> > > "Paul E. McKenney" <paulmck@...nel.org> wrote:
> > > 
> > > > > I wonder if we can just see if the instruction pointer at preemption is at
> > > > > something that was allocated? That is, if it __is_kernel(addr) returns
> > > > > false, then we need to do more work. Of course that means modules will also
> > > > > trigger this. We could check __is_module_text() but that does a bit more
> > > > > work and may cause too much overhead. But who knows, if the module check is
> > > > > only done if the __is_kernel() check fails, maybe it's not that bad.  
> > > > 
> > > > I do like very much that idea, but it requires that we be able to identify
> > > > this instruction pointer perfectly, no matter what.  It might also require
> > > > that we be able to perfectly identify any IRQ return addresses as well,
> > > > for example, if the preemption was triggered within an interrupt handler.
> > > > And interrupts from softirq environments might require identifying an
> > > > additional level of IRQ return address.  The original IRQ might have
> > > > interrupted a trampoline, and then after transitioning into softirq,
> > > > another IRQ might also interrupt a trampoline, and this last IRQ handler
> > > > might have instigated a preemption.
> > > 
> > > Note, softirqs still require a real interrupt to happen in order to preempt
> > > executing code. Otherwise it should never be running from a trampoline.
> > 
> > Yes, the first interrupt interrupted a trampoline.  Then, on return,
> > that interrupt transitioned to softirq (as opposed to ksoftirqd).
> > While a softirq handler was executing within a trampoline, we got
> > another interrupt.  We thus have two interrupted trampolines.
> > 
> > Or am I missing something that prevents this?
> 
> Surely the problematic case is where the first interrupt is taken from a
> trampoline, but the inner interrupt is taken from not-a-trampoline? If the
> innermost interrupt context is a trampoline, that's the same as that without
> any nesting.

It depends.  If we wait for each task to not have a trampoline in effect
then yes, we only need to know whether or not a given task has at least
one trampoline in use.  One concern with this approach is that a given
task might have at least one trampoline in effect every time it is
checked, unlikely though that might seem.

If this is a problem, one way around it is to instead ask whether the
current task still has a reference to one of a set of trampolines that
has recently been removed.  This avoids the problem of a task always
being one some trampoline or another, but requires exact identification
of any and all trampolines a given task is currently using.

Either way, we need some way of determining whether or not a given
PC value resides in a trampoline.  This likely requires some data
structure (hash table?  tree?  something else?) that must be traversed
in order to carry out that determination.  Depending on the traversal
overhead, it might (or might not) be necessary to make sure that the
traversal is not on the entry/exit/scheduler fast paths.  It is also
necessary to keep the trampoline-use overhead low and the trampoline
call points small.

> We could handle nesting with a thread flag (e.g. TIF_IN_TRAMPOLINE) and a flag
> in irqentry_state_t (which is on the stack, and so each nested IRQ gets its
> own):
> 
> * At IRQ exception entry, if TIF_IN_TRAMPOLINE is clear and pt_regs::ip is a
>   trampoline, set TIF_IN_TRAMPOLINE and irqentry_state_t::entered_trampoline.
> 
> * At IRQ exception exit, if irqentry_state_t::entered_trampoline is set, clear
>   TIF_IN_TRAMPOLINE.
> 
> That naturally nests since the inner IRQ sees TIF_IN_TRAMPOLINE is already set
> and does nothing on entry or exit, and anything imbetween can inspect
> TIF_IN_TRAMPOLINE and see the right value.

If the overhead of determining whether pt_regs::ip is a trampoline
is sufficiently low, sounds good to me!  I suppose that different
architectures might have different answers to this question, just to
keep things entertaining.  ;-)

> On arm64 we don't dynamically allocate trampolines, *but* we potentially have a
> similar problem when changing the active ftrace_ops for a callsite, as all
> callsites share a common trampoline in the kernel text which reads a pointer to
> an ftrace_ops out of the callsite, then reads ftrace_ops::func from that.
> 
> Since the ops could be dynamically allocated, we want to wait for reads of that
> to complete before reusing the memory, and ideally we wouldn't have new
> entryies into the func after we think we'd completed the transition. So Tasks
> RCU might be preferable as it waits for both the trampoline *and* the func to
> complete.

OK, I am guessing that it is easier to work out whether pt_regs::ip is
a trampoline than cases involving dyanamic allocation of trampolines.

> > > > Are there additional levels or mechanisms requiring identifying
> > > > return addresses?
> > > 
> > > Hmm, could we add to irq_enter_rcu()
> > > 
> > > 	__this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs()));
> > > 
> > > That is to save off were the ip was when it was interrupted.
> > > 
> > > Hmm, but it looks like the get_irq_regs() is set up outside of
> > > irq_enter_rcu() :-(
> > > 
> > > I wonder how hard it would be to change all the architectures to pass in
> > > pt_regs to irq_enter_rcu()? All the places it is called, the regs should be
> > > available.
> > > 
> > > Either way, it looks like it will be a bit of work around the trampoline or
> > > around RCU to get this efficiently done.
> > 
> > One approach would be to make Tasks RCU be present for PREEMPT_AUTO
> > kernels as well as PREEMPTIBLE kernels, and then, as architectures provide
> > the needed return-address infrastructure, transition those architectures
> > to something more precise.
> 
> FWIW, that sounds good to me.

Thank you, and I will Cc you on the patches.

							Thanx, Paul