linux-kernel - Re: linux-next: manual merge of the rcu tree with the ftrace tree

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3af3a952-43f9-4625-b87c-f45d14d8228e@paulmck-laptop>
Date: Fri, 14 Nov 2025 10:26:37 -0800
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: Steven Rostedt <rostedt@...dmis.org>,
	Stephen Rothwell <sfr@...b.auug.org.au>,
	Frederic Weisbecker <frederic@...nel.org>,
	Neeraj Upadhyay <neeraj.upadhyay@...nel.org>,
	Boqun Feng <boqun.feng@...il.com>,
	Uladzislau Rezki <urezki@...il.com>,
	Masami Hiramatsu <mhiramat@...nel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Linux Next Mailing List <linux-next@...r.kernel.org>,
	yonghong.song@...ux.dev
Subject: Re: linux-next: manual merge of the rcu tree with the ftrace tree

On Fri, Nov 14, 2025 at 06:41:59PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-11-14 09:25:06 [-0800], Paul E. McKenney wrote:
> > On Fri, Nov 14, 2025 at 06:10:52PM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2025-11-14 09:00:21 [-0800], Paul E. McKenney wrote:
> > > > > > Where in PREEMPT_RT we do not disable preemption around the tracepoint
> > > > > > callback, but in non RT we do. Instead it uses a srcu and migrate disable.
> > > > > 
> > > > > I appreciate the effort. I really do. But why can't we have SRCU on both
> > > > > configs?
> > > > 
> > > > Due to performance concerns for non-RT kernels and workloads, where we
> > > > really need preemption disabled.
> > > 
> > > This means srcu_read_lock_notrace() is much more overhead compared to
> > > rcu_read_lock_sched_notrace()?
> > > I am a bit afraid of different bugs here and there.
> > 
> > No, the concern is instead overhead due to any actual preemption.  So the
> > goal is to actually disable preemption across the BPF program *except*
> > in PREEMPT_RT kernels.
> 
> Overhead of actual preemption while the BPF callback of the trace-event
> is invoked?
> So we get rid of the preempt_disable() in the trace-point which we had
> due rcu_read_lock_sched_notrace() and we need to preserve it because
> preemption while the BPF program is invoked?
> This is also something we want for CONFIG_PREEMPT (LAZY)?
> 
> Sorry to be verbose but I try to catch up.

No need to apologize, given my tendency to be verbose.  ;-)

> The BPF invocation does not disable preemption for a long time. It
> disables migration since some code uses per-CPU variables here.
> 
> For XDP kind of BPF invocations, preemption is disabled (except for RT)
> because those run in NAPI/ softirq context.

Before Steven's pair of patches (one of which Frederic and I are handling
due to it depending on not-yet-mainline SRCU-fast commits), BPF programs
attached to tracepoints ran with preemption disabled.  This behavior is
still in mainline.  As you reported some time back, this caused problems
for PREEMPT_RT, hence Steven's pair of patches.  But although we do want
to fix PREEMPT_RT, we don't want to break other kernel configuration,
hence keeping preemption disabled in non-PREEMPT_RT kernels.

Now perhaps Yonghong will tell us that this has since been shown to not
be a problem for BPF programs attached to tracepoints in non-PREEMPT_RT
kernels.  But he has not yet done so, which strongly suggests we keep
the known-to-work preemption-disabled status of BPF programs attached
to tracepoints.

> > > > > Also why does tracepoint_guard() need to disable migration? The BPF
> > > > > program already disables migrations (see for instance
> > > > > bpf_prog_run_array()).
> > > > > This is true for RT and !RT. So there is no need to do it here.
> > > > 
> > > > The addition of migration disabling was in response to failures, which
> > > > this fixed.  Or at least greatly reduced the probability of!  Let's see...
> > > > That migrate_disable() has been there since 2022, so the failures were
> > > > happening despite it.  Adding Yonghong on CC for his perspective.
> > > 
> > > Okay. In general I would prefer that we know why we do it. BPF had
> > > preempt_disable() which was turned into migrate_disable() for RT reasons
> > > since remaining on the same CPU was enough and preempt_disable() was the
> > > only way to enforce it at the time.
> > > And I think Linus requested migrate_disable() to work regardless of RT
> > > which PeterZ made happen (for different reasons, not BPF related).
> > 
> > Yes, migrate_disable() prevents migration either way, but it does not
> > prevent preemption, which is what was needed in non-PREEMPT_RT kernels
> > last I checked.
> 
> BPF in general sometimes relies on per-CPU variables. Sometimes it is
> needed to avoid reentrancy which is what preempt_disable() provides for
> the same context. This is usually handled where it is required and when
> is removed, it is added back shortly. See for instance
> 	https://lore.kernel.org/all/20251114064922.11650-1-chandna.sahil@gmail.com/
> 
> :)

Agreed, and that was why I added the migrate_disable() calls earlier,
calls that in Steven's more recent version of this patch just now
conflicted with Steven's other patch in -next.  ;-)

							Thanx, Paul