linux-kernel - Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZddtKszRH5Ak5tZ7@FVFF77S0Q05N>
Date: Thu, 22 Feb 2024 15:50:02 +0000
From: Mark Rutland <mark.rutland@....com>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Steven Rostedt <rostedt@...dmis.org>,
	Ankur Arora <ankur.a.arora@...cle.com>,
	linux-kernel@...r.kernel.org, tglx@...utronix.de,
	peterz@...radead.org, torvalds@...ux-foundation.org,
	akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
	dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	willy@...radead.org, mgorman@...e.de, jpoimboe@...nel.org,
	jgross@...e.com, andrew.cooper3@...rix.com, bristot@...nel.org,
	mathieu.desnoyers@...icios.com, glaubitz@...sik.fu-berlin.de,
	anton.ivanov@...bridgegreys.com, mattst88@...il.com,
	krypton@...ich-teichert.org, David.Laight@...lab.com,
	richard@....at, jon.grimm@....com, bharata@....com,
	boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote:
> > On Wed, 21 Feb 2024 11:41:47 -0800
> > "Paul E. McKenney" <paulmck@...nel.org> wrote:
> > 
> > > > I wonder if we can just see if the instruction pointer at preemption is at
> > > > something that was allocated? That is, if it __is_kernel(addr) returns
> > > > false, then we need to do more work. Of course that means modules will also
> > > > trigger this. We could check __is_module_text() but that does a bit more
> > > > work and may cause too much overhead. But who knows, if the module check is
> > > > only done if the __is_kernel() check fails, maybe it's not that bad.  
> > > 
> > > I do like very much that idea, but it requires that we be able to identify
> > > this instruction pointer perfectly, no matter what.  It might also require
> > > that we be able to perfectly identify any IRQ return addresses as well,
> > > for example, if the preemption was triggered within an interrupt handler.
> > > And interrupts from softirq environments might require identifying an
> > > additional level of IRQ return address.  The original IRQ might have
> > > interrupted a trampoline, and then after transitioning into softirq,
> > > another IRQ might also interrupt a trampoline, and this last IRQ handler
> > > might have instigated a preemption.
> > 
> > Note, softirqs still require a real interrupt to happen in order to preempt
> > executing code. Otherwise it should never be running from a trampoline.
> 
> Yes, the first interrupt interrupted a trampoline.  Then, on return,
> that interrupt transitioned to softirq (as opposed to ksoftirqd).
> While a softirq handler was executing within a trampoline, we got
> another interrupt.  We thus have two interrupted trampolines.
> 
> Or am I missing something that prevents this?

Surely the problematic case is where the first interrupt is taken from a
trampoline, but the inner interrupt is taken from not-a-trampoline? If the
innermost interrupt context is a trampoline, that's the same as that without
any nesting.

We could handle nesting with a thread flag (e.g. TIF_IN_TRAMPOLINE) and a flag
in irqentry_state_t (which is on the stack, and so each nested IRQ gets its
own):

* At IRQ exception entry, if TIF_IN_TRAMPOLINE is clear and pt_regs::ip is a
  trampoline, set TIF_IN_TRAMPOLINE and irqentry_state_t::entered_trampoline.

* At IRQ exception exit, if irqentry_state_t::entered_trampoline is set, clear
  TIF_IN_TRAMPOLINE.

That naturally nests since the inner IRQ sees TIF_IN_TRAMPOLINE is already set
and does nothing on entry or exit, and anything imbetween can inspect
TIF_IN_TRAMPOLINE and see the right value.

On arm64 we don't dynamically allocate trampolines, *but* we potentially have a
similar problem when changing the active ftrace_ops for a callsite, as all
callsites share a common trampoline in the kernel text which reads a pointer to
an ftrace_ops out of the callsite, then reads ftrace_ops::func from that.

Since the ops could be dynamically allocated, we want to wait for reads of that
to complete before reusing the memory, and ideally we wouldn't have new
entryies into the func after we think we'd completed the transition. So Tasks
RCU might be preferable as it waits for both the trampoline *and* the func to
complete.

> > > Are there additional levels or mechanisms requiring identifying
> > > return addresses?
> > 
> > Hmm, could we add to irq_enter_rcu()
> > 
> > 	__this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs()));
> > 
> > That is to save off were the ip was when it was interrupted.
> > 
> > Hmm, but it looks like the get_irq_regs() is set up outside of
> > irq_enter_rcu() :-(
> > 
> > I wonder how hard it would be to change all the architectures to pass in
> > pt_regs to irq_enter_rcu()? All the places it is called, the regs should be
> > available.
> > 
> > Either way, it looks like it will be a bit of work around the trampoline or
> > around RCU to get this efficiently done.
> 
> One approach would be to make Tasks RCU be present for PREEMPT_AUTO
> kernels as well as PREEMPTIBLE kernels, and then, as architectures provide
> the needed return-address infrastructure, transition those architectures
> to something more precise.

FWIW, that sounds good to me.

Mark.