linux-kernel - Re: [PATCH v15 7/7] timers: Exclude isolated cpus from timer migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87fra9lnsw.ffs@tglx>
Date: Wed, 19 Nov 2025 22:23:11 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Gabriele Monaco <gmonaco@...hat.com>, linux-kernel@...r.kernel.org,
 Anna-Maria Behnsen <anna-maria@...utronix.de>, Waiman Long
 <longman@...hat.com>, "John B. Wyatt IV" <jwyatt@...hat.com>, "John B.
 Wyatt IV" <sageofredondo@...il.com>
Subject: Re: [PATCH v15 7/7] timers: Exclude isolated cpus from timer migration

On Wed, Nov 19 2025 at 21:13, Frederic Weisbecker wrote:
> Le Wed, Nov 19, 2025 at 07:15:42PM +0100, Thomas Gleixner a écrit :
>> On Wed, Nov 19 2025 at 18:14, Frederic Weisbecker wrote:
>> > Le Wed, Nov 19, 2025 at 05:50:15PM +0100, Thomas Gleixner a écrit :
>> >> But thinking more about it. What's the actual point of moving this 'clear'
>> >> out instead of just moving it further down?
>> >> 
>> >> It does not matter at all whether the isol/unisol muck clears an already
>> >> cleared bit or not. But it would keep the function name comprehensible
>> >> and avoid all this online/offline wrapper nonsense.
>> >
>> > That was my suggestion.
>> >
>> > It's because tmigr_clear_cpu_available() and tmigr_set_cpu_available()
>> > can now all be called concurrently through the workqueues and race and
>> > mess up the cpumask if they all try to clear/set at the same time...
>> 
>> Huch?
>> 
>> cpumask_set_cpu() uses set_bit() and cpumask_clear_cpu() uses
>> clear_bit(). Both are atomic and nothing gets messed up.
>
> Urgh, right...
>
>> 
>> The only undefined case would be if you end up setting/clearing the same
>> bit, which would require that the unisol and isol maps overlap. But that
>> would be a bug on it's own, no?
>
> Ok.
>
> But then the "unisolate" works must be flushed before this line:
>
> +       /* Set up the mask earlier to avoid races with the migrator CPU */
> +       cpumask_andnot(tmigr_available_cpumask, tmigr_available_cpumask, cpumask_isol);
>
> Because that is non-atomic and can race with the cpumask_set_cpu() from the
> works, right?

Correct, but I still have to understand why this has to happen
upfront. As I said before this comment is useless:

> +       /* Set up the mask earlier to avoid races with the migrator CPU */

which decodes to:

   Set up the mask earlier than something unspecified to avoid
   unspecified races with the migrator CPU.

Seriously?

But I finally understood what this is about after staring at it with
less disgust again for five times in a row:

   tmigr_clear_cpu_available() requires a stable cpumask to find a
   migrator.

So what this is about is to avoid touching the cpumask in those worker
functions so that tmigr_isolated_exclude_cpumask() can fiddle with them
asynchronously.

Right?

That's voodoo programming which nobody - even those involved - will
understand anymore three months down the road.

The idea that updating the masks upfront will provide stable state is
flawed to begin with. Let's assume the following scenario:

  1) The isol/unisol mask is just flipped around

  2) The newly available CPUs are marked in the mask

  3) The work is scheduled on those CPUs

  4) The newly unavailable CPUs are cleared in the mask

  5) The work is scheduled on those CPUs

  6) The newly available CPU workers are delayed so that the newly
     unavailable workers get there first. They find a stable CPU mask,
     but none of these now "available" CPUs are actually brought into
     active state yet.

That's just a perfect recipe for some completely undecodable bug to
happen anytime soon even it it supposed to "work" by chance.

The way how this was designed in the first place is that changing the
"availability" is fully serialized by the CPU hotplug machinery. Which
means zero surprises.

Now you create a side channel, which lacks these serialization
guarantees for absolutely no reason. Changing this isolation muck at
run-time is not a hotpath operation and trying to optimize it for no
good reason is just stupid.

The most trivial solution is to schedule the works one by one walking
through the newly available and then through the unavailable maps.

If you want to be a bit smarter, then you can just use a global mutex,
which is taken inside the set/clear_available() functions which
serializes the related functionality and bulk schedule/flush the unisol
(make available) first and then proceed to the isol (make unavailable).

Then nothing has to change vs. the set/clear operations and everything
just works.

That mutex does not do any harm in the CPU hotplug case and the
serialization vs. the workers is not going to be the end of the world.

I'm willing to bet that no real-world use-case will ever notice the
existance of this mutex. The microbenchmark which shows off the "I'm so
smart" metric is completely irrelevant especially when the result is
fragile, incomprehensible and therefore unmaintainable.

Keep it correct and simple is still the most important engineering
principle. Premature optimization is a guaranteed path to failure.

If there is a compelling use case which justifies the resulting
complexity, then it can be built on top. I'm not holding my breath. See
above...

Thanks,

        tglx