linux-kernel - Re: [PATCH v15 7/7] timers: Exclude isolated cpus from timer migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aR4-Wie1wyLEu460@pavilion.home>
Date: Wed, 19 Nov 2025 23:02:02 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Gabriele Monaco <gmonaco@...hat.com>, linux-kernel@...r.kernel.org,
	Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Waiman Long <longman@...hat.com>,
	"John B. Wyatt IV" <jwyatt@...hat.com>,
	"John B. Wyatt IV" <sageofredondo@...il.com>
Subject: Re: [PATCH v15 7/7] timers: Exclude isolated cpus from timer
 migration

Le Wed, Nov 19, 2025 at 10:23:11PM +0100, Thomas Gleixner a écrit :
> On Wed, Nov 19 2025 at 21:13, Frederic Weisbecker wrote:
> > Le Wed, Nov 19, 2025 at 07:15:42PM +0100, Thomas Gleixner a écrit :
> >> On Wed, Nov 19 2025 at 18:14, Frederic Weisbecker wrote:
> >> > Le Wed, Nov 19, 2025 at 05:50:15PM +0100, Thomas Gleixner a écrit :
> >> >> But thinking more about it. What's the actual point of moving this 'clear'
> >> >> out instead of just moving it further down?
> >> >> 
> >> >> It does not matter at all whether the isol/unisol muck clears an already
> >> >> cleared bit or not. But it would keep the function name comprehensible
> >> >> and avoid all this online/offline wrapper nonsense.
> >> >
> >> > That was my suggestion.
> >> >
> >> > It's because tmigr_clear_cpu_available() and tmigr_set_cpu_available()
> >> > can now all be called concurrently through the workqueues and race and
> >> > mess up the cpumask if they all try to clear/set at the same time...
> >> 
> >> Huch?
> >> 
> >> cpumask_set_cpu() uses set_bit() and cpumask_clear_cpu() uses
> >> clear_bit(). Both are atomic and nothing gets messed up.
> >
> > Urgh, right...
> >
> >> 
> >> The only undefined case would be if you end up setting/clearing the same
> >> bit, which would require that the unisol and isol maps overlap. But that
> >> would be a bug on it's own, no?
> >
> > Ok.
> >
> > But then the "unisolate" works must be flushed before this line:
> >
> > +       /* Set up the mask earlier to avoid races with the migrator CPU */
> > +       cpumask_andnot(tmigr_available_cpumask, tmigr_available_cpumask, cpumask_isol);
> >
> > Because that is non-atomic and can race with the cpumask_set_cpu() from the
> > works, right?
> 
> Correct, but I still have to understand why this has to happen
> upfront. As I said before this comment is useless:
> 
> > +       /* Set up the mask earlier to avoid races with the migrator CPU */
> 
> which decodes to:
> 
>    Set up the mask earlier than something unspecified to avoid
>    unspecified races with the migrator CPU.
> 
> Seriously?
> 
> But I finally understood what this is about after staring at it with
> less disgust again for five times in a row:
> 
>    tmigr_clear_cpu_available() requires a stable cpumask to find a
>    migrator.
> 
> So what this is about is to avoid touching the cpumask in those worker
> functions so that tmigr_isolated_exclude_cpumask() can fiddle with them
> asynchronously.
> 
> Right?
> 
> That's voodoo programming which nobody - even those involved - will
> understand anymore three months down the road.
> 
> The idea that updating the masks upfront will provide stable state is
> flawed to begin with. Let's assume the following scenario:
> 
>   1) The isol/unisol mask is just flipped around
> 
>   2) The newly available CPUs are marked in the mask
> 
>   3) The work is scheduled on those CPUs
> 
>   4) The newly unavailable CPUs are cleared in the mask
> 
>   5) The work is scheduled on those CPUs
> 
>   6) The newly available CPU workers are delayed so that the newly
>      unavailable workers get there first. They find a stable CPU mask,
>      but none of these now "available" CPUs are actually brought into
>      active state yet.
> 
> That's just a perfect recipe for some completely undecodable bug to
> happen anytime soon even it it supposed to "work" by chance.

Urgh, indeed I completely missed that...

> 
> The way how this was designed in the first place is that changing the
> "availability" is fully serialized by the CPU hotplug machinery. Which
> means zero surprises.
> 
> Now you create a side channel, which lacks these serialization
> guarantees for absolutely no reason. Changing this isolation muck at
> run-time is not a hotpath operation and trying to optimize it for no
> good reason is just stupid.
> 
> The most trivial solution is to schedule the works one by one walking
> through the newly available and then through the unavailable maps.

Right.

> 
> If you want to be a bit smarter, then you can just use a global mutex,
> which is taken inside the set/clear_available() functions which
> serializes the related functionality and bulk schedule/flush the unisol
> (make available) first and then proceed to the isol (make unavailable).
> 
> Then nothing has to change vs. the set/clear operations and everything
> just works.
> 
> That mutex does not do any harm in the CPU hotplug case and the
> serialization vs. the workers is not going to be the end of the world.
> 
> I'm willing to bet that no real-world use-case will ever notice the
> existance of this mutex. The microbenchmark which shows off the "I'm so
> smart" metric is completely irrelevant especially when the result is
> fragile, incomprehensible and therefore unmaintainable.
> 
> Keep it correct and simple is still the most important engineering
> principle. Premature optimization is a guaranteed path to failure.
> 
> If there is a compelling use case which justifies the resulting
> complexity, then it can be built on top. I'm not holding my breath. See
> above...

Perhaps the only thing that worries me is if an isolated partition
is inverted. Say 0-3 is non isolated and 4-7 is isolated. And then
cpuset is overwritten so that the reverse is applied: 0-3 is isolated
and 4-7 is not isolated. If all isol works reach before unisol works,
then tmigr_clear_cpu_available() -> cpumask_any(tmigr_available_mask)
won't find any CPU left on the last call.

Now last time I tried to invert an isolated cpumask in cpuset, I got
-EINVAL so something must be preventing from that. Just in case let's
have a WARN_ON_ONCE(cpumask_any(tmigr_available_mask) >= nr_cpu_ids).

But other than this detail, your solution looks good!

Thanks.

-- 
Frederic Weisbecker
SUSE Labs