linux-kernel - Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <4f1a7b15-cfdc-478e-8e43-98750608866c@app.fastmail.com>
Date: Fri, 14 Nov 2025 16:29:31 -0800
From: "Andy Lutomirski" <luto@...nel.org>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: "Valentin Schneider" <vschneid@...hat.com>,
 "Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>,
 linux-mm@...ck.org, rcu@...r.kernel.org,
 "the arch/x86 maintainers" <x86@...nel.org>,
 linux-arm-kernel@...ts.infradead.org, loongarch@...ts.linux.dev,
 linux-riscv@...ts.infradead.org, linux-arch@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org,
 "Thomas Gleixner" <tglx@...utronix.de>, "Ingo Molnar" <mingo@...hat.com>,
 "Borislav Petkov" <bp@...en8.de>,
 "Dave Hansen" <dave.hansen@...ux.intel.com>,
 "H. Peter Anvin" <hpa@...or.com>,
 "Peter Zijlstra (Intel)" <peterz@...radead.org>,
 "Arnaldo Carvalho de Melo" <acme@...nel.org>,
 "Josh Poimboeuf" <jpoimboe@...nel.org>,
 "Paolo Bonzini" <pbonzini@...hat.com>, "Arnd Bergmann" <arnd@...db.de>,
 "Frederic Weisbecker" <frederic@...nel.org>,
 "Jason Baron" <jbaron@...mai.com>,
 "Steven Rostedt" <rostedt@...dmis.org>,
 "Ard Biesheuvel" <ardb@...nel.org>,
 "Sami Tolvanen" <samitolvanen@...gle.com>,
 "David S. Miller" <davem@...emloft.net>,
 "Neeraj Upadhyay" <neeraj.upadhyay@...nel.org>,
 "Joel Fernandes" <joelagnelf@...dia.com>,
 "Josh Triplett" <josh@...htriplett.org>,
 "Boqun Feng" <boqun.feng@...il.com>,
 "Uladzislau Rezki" <urezki@...il.com>,
 "Mathieu Desnoyers" <mathieu.desnoyers@...icios.com>,
 "Mel Gorman" <mgorman@...e.de>,
 "Andrew Morton" <akpm@...ux-foundation.org>,
 "Masahiro Yamada" <masahiroy@...nel.org>,
 "Han Shen" <shenhan@...gle.com>, "Rik van Riel" <riel@...riel.com>,
 "Jann Horn" <jannh@...gle.com>,
 "Dan Carpenter" <dan.carpenter@...aro.org>,
 "Oleg Nesterov" <oleg@...hat.com>, "Juri Lelli" <juri.lelli@...hat.com>,
 "Clark Williams" <williams@...hat.com>,
 "Yair Podemsky" <ypodemsk@...hat.com>,
 "Marcelo Tosatti" <mtosatti@...hat.com>,
 "Daniel Wagner" <dwagner@...e.de>, "Petr Tesarik" <ptesarik@...e.com>,
 "Shrikanth Hegde" <sshegde@...ux.ibm.com>
Subject: Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a
 user->kernel transition



On Fri, Nov 14, 2025, at 12:03 PM, Paul E. McKenney wrote:
> On Fri, Nov 14, 2025 at 10:45:08AM -0800, Andy Lutomirski wrote:
>> 
>> 
>> On Fri, Nov 14, 2025, at 10:14 AM, Paul E. McKenney wrote:
>> > On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:
>> >> 
>> 
>> >> > Oh, any another primitive would be possible: one CPU could plausibly 
>> >> > execute another CPU's interrupts or soft-irqs or whatever by taking a 
>> >> > special lock that would effectively pin the remote CPU in user mode -- 
>> >> > you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
>> >> > the transition on that CPU to kernel mode would then spin on entry to 
>> >> > kernel mode and wait for the lock to be released.  This could plausibly 
>> >> > get a lot of the on_each_cpu callers to switch over in one fell swoop: 
>> >> > anything that needs to synchronize to the remote CPU but does not need 
>> >> > to poke its actual architectural state could be executed locally while 
>> >> > the remote CPU is pinned.
>> >
>> > It would be necessary to arrange for the remote CPU to remain pinned
>> > while the local CPU executed on its behalf.  Does the above approach
>> > make that happen without re-introducing our current context-tracking
>> > overhead and complexity?
>> 
>> Using the pseudo-implementation farther down, I think this would be like:
>> 
>> if (my_state->status == INDETERMINATE) {
>>    // we were lazy and we never promised to do work atomically.
>>    atomic_set(&my_state->status, KERNEL);
>>    this_entry_work = 0;
>>    /* we are definitely not pinned in this path */
>> } else {
>>    // we were not lazy and we promised we would do work atomically
>>    atomic exchange the entire state to { .work = 0, .status = KERNEL }
>>    this_entry_work = (whatever we just read);
>>    if (this_entry_work & PINNED) {
>>      u32 this_cpu_pin_count = this_cpu_ptr(pin_count);
>>      while (atomic_read(&this_cpu_pin_count)) {
>>        cpu_relax();
>>      }
>>    }
>>  }
>> 
>> and we'd have something like:
>> 
>> bool try_pin_remote_cpu(int cpu)
>> {
>>     u32 *remote_pin_count = ...;
>>     struct fancy_cpu_state *remote_state = ...;
>>     atomic_inc(remote_pin_count);  // optimistic
>> 
>>     // Hmm, we do not want that read to get reordered with the inc, so we probably
>>     // need a full barrier or seq_cst.  How does Linux spell that?  C++ has atomic::load
>>     // with seq_cst and maybe the optimizer can do the right thing.  Maybe it's:
>>     smp_mb__after_atomic();
>> 
>>     if (atomic_read(&remote_state->status) == USER) {
>>       // Okay, it's genuinely pinned.
>>       return true;
>> 
>>       // egads, if this is some arch with very weak ordering,
>>       // do we need to be concerned that we just took a lock but we
>>       // just did a relaxed read and therefore a subsequent access
>>       // that thinks it's locked might appear to precede the load and therefore
>>       // somehow get surprisingly seen out of order by the target cpu?
>>       // maybe we wanted atomic_read_acquire above instead?
>>     } else {
>>       // We might not have successfully pinned it
>>       atomic_dec(remote_pin_count);
>>     }
>> }
>> 
>> void unpin_remote_cpu(int cpu)
>> {
>>     atomic_dec(remote_pin_count();
>> }
>> 
>> and we'd use it like:
>> 
>> if (try_pin_remote_cpu(cpu)) {
>>   // do something useful
>> } else {
>>   send IPI;
>> }
>> 
>> but we'd really accumulate the set of CPUs that need the IPIs and do them all at once.
>> 
>> I ran the theorem prover that lives inside my head on this code using the assumption that the machine is a well-behaved x86 system and it said "yeah, looks like it might be correct".  I trust an actual formalized system or someone like you who is genuinely very good at this stuff much more than I trust my initial impression :)
>
> Let's start with requirements, non-traditional though that might be. ;-)

That's ridiculous! :-)

>
> An "RCU idle" CPU is either in deep idle or executing in nohz_full
> userspace.
>
> 1.	If the RCU grace-period kthread sees an RCU-idle CPU, then:
>
> 	a.	Everything that this CPU did before entering RCU-idle
> 		state must be visible to sufficiently later code executed
> 		by the grace-period kthread, and:
>
> 	b.	Everything that the CPU will do after exiting RCU-idle
> 		state must *not* have been visible to the grace-period
> 		kthread sufficiently prior to having sampled this
> 		CPU's state.
>
> 2.	If the RCU grace-period kthread sees an RCU-nonidle CPU, then
> 	it depends on whether this is the same nonidle sojourn as was
> 	initially seen.  (If the kthread initially saw the CPU in an
> 	RCU-idle state, it would not have bothered resampling.)
>
> 	a.	If this is the same nonidle sojourn, then there are no
> 		ordering requirements.	RCU must continue to wait on
> 		this CPU.
>
> 	b.	Otherwise, everything that this CPU did before entering
> 		its last RCU-idle state must be visible to sufficiently
> 		later code executed by the grace-period kthread.
> 		Similar to (1a) above.
>
> 3.	If a given CPU quickly switches into and out of RCU-idle
> 	state, and it is always in RCU-nonidle state whenever the RCU
> 	grace-period kthread looks, RCU must still realize that this
> 	CPU has passed through at least one quiescent state.
>
> 	This is why we have a counter for RCU rather than just a
> 	simple state.

...

> I am not seeing how the third requirement above is met, though.  I have
> not verified the sampling code that the RCU grace-period kthread is
> supposed to use because I am not seeing it right off-hand.

This is ct_rcu_watching_cpu_acquire, right?  Lemme think.  Maybe there's even a way to do everything I'm suggesting without changing that interface.


>> It wouldn't be outrageous to have real-time imply the full USER transition.
>
> We could have special code for RT, but this of course increases the
> complexity.  Which might be justified by sufficient speedup.

I'm thinking that the code would be arranged so that going to user mode in the USER or the INTERMEDIATE state would be fully valid and would just have different performance characteristics.  So the only extra complexity here would be the actual logic to choose which state to go to. and...
>
> And yes, many distros enable NO_HZ_FULL by default.  I will refrain from
> suggesting additional static branches.  ;-)

I'm not even suggesting a static branch.  I think the potential performance wins are big enough to justify a bona fide ordinary if statement or two :)  I'm also contemplating whether it could make sense to make this whole thing be unconditionally configured in if the performance in the case where no one uses it (i.e. everything is INTERMEDIATE instead of USER) is good enough.

I will ponder.