[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aR4-Wie1wyLEu460@pavilion.home>
Date: Wed, 19 Nov 2025 23:02:02 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Gabriele Monaco <gmonaco@...hat.com>, linux-kernel@...r.kernel.org,
Anna-Maria Behnsen <anna-maria@...utronix.de>,
Waiman Long <longman@...hat.com>,
"John B. Wyatt IV" <jwyatt@...hat.com>,
"John B. Wyatt IV" <sageofredondo@...il.com>
Subject: Re: [PATCH v15 7/7] timers: Exclude isolated cpus from timer
migration
Le Wed, Nov 19, 2025 at 10:23:11PM +0100, Thomas Gleixner a écrit :
> On Wed, Nov 19 2025 at 21:13, Frederic Weisbecker wrote:
> > Le Wed, Nov 19, 2025 at 07:15:42PM +0100, Thomas Gleixner a écrit :
> >> On Wed, Nov 19 2025 at 18:14, Frederic Weisbecker wrote:
> >> > Le Wed, Nov 19, 2025 at 05:50:15PM +0100, Thomas Gleixner a écrit :
> >> >> But thinking more about it. What's the actual point of moving this 'clear'
> >> >> out instead of just moving it further down?
> >> >>
> >> >> It does not matter at all whether the isol/unisol muck clears an already
> >> >> cleared bit or not. But it would keep the function name comprehensible
> >> >> and avoid all this online/offline wrapper nonsense.
> >> >
> >> > That was my suggestion.
> >> >
> >> > It's because tmigr_clear_cpu_available() and tmigr_set_cpu_available()
> >> > can now all be called concurrently through the workqueues and race and
> >> > mess up the cpumask if they all try to clear/set at the same time...
> >>
> >> Huch?
> >>
> >> cpumask_set_cpu() uses set_bit() and cpumask_clear_cpu() uses
> >> clear_bit(). Both are atomic and nothing gets messed up.
> >
> > Urgh, right...
> >
> >>
> >> The only undefined case would be if you end up setting/clearing the same
> >> bit, which would require that the unisol and isol maps overlap. But that
> >> would be a bug on it's own, no?
> >
> > Ok.
> >
> > But then the "unisolate" works must be flushed before this line:
> >
> > + /* Set up the mask earlier to avoid races with the migrator CPU */
> > + cpumask_andnot(tmigr_available_cpumask, tmigr_available_cpumask, cpumask_isol);
> >
> > Because that is non-atomic and can race with the cpumask_set_cpu() from the
> > works, right?
>
> Correct, but I still have to understand why this has to happen
> upfront. As I said before this comment is useless:
>
> > + /* Set up the mask earlier to avoid races with the migrator CPU */
>
> which decodes to:
>
> Set up the mask earlier than something unspecified to avoid
> unspecified races with the migrator CPU.
>
> Seriously?
>
> But I finally understood what this is about after staring at it with
> less disgust again for five times in a row:
>
> tmigr_clear_cpu_available() requires a stable cpumask to find a
> migrator.
>
> So what this is about is to avoid touching the cpumask in those worker
> functions so that tmigr_isolated_exclude_cpumask() can fiddle with them
> asynchronously.
>
> Right?
>
> That's voodoo programming which nobody - even those involved - will
> understand anymore three months down the road.
>
> The idea that updating the masks upfront will provide stable state is
> flawed to begin with. Let's assume the following scenario:
>
> 1) The isol/unisol mask is just flipped around
>
> 2) The newly available CPUs are marked in the mask
>
> 3) The work is scheduled on those CPUs
>
> 4) The newly unavailable CPUs are cleared in the mask
>
> 5) The work is scheduled on those CPUs
>
> 6) The newly available CPU workers are delayed so that the newly
> unavailable workers get there first. They find a stable CPU mask,
> but none of these now "available" CPUs are actually brought into
> active state yet.
>
> That's just a perfect recipe for some completely undecodable bug to
> happen anytime soon even it it supposed to "work" by chance.
Urgh, indeed I completely missed that...
>
> The way how this was designed in the first place is that changing the
> "availability" is fully serialized by the CPU hotplug machinery. Which
> means zero surprises.
>
> Now you create a side channel, which lacks these serialization
> guarantees for absolutely no reason. Changing this isolation muck at
> run-time is not a hotpath operation and trying to optimize it for no
> good reason is just stupid.
>
> The most trivial solution is to schedule the works one by one walking
> through the newly available and then through the unavailable maps.
Right.
>
> If you want to be a bit smarter, then you can just use a global mutex,
> which is taken inside the set/clear_available() functions which
> serializes the related functionality and bulk schedule/flush the unisol
> (make available) first and then proceed to the isol (make unavailable).
>
> Then nothing has to change vs. the set/clear operations and everything
> just works.
>
> That mutex does not do any harm in the CPU hotplug case and the
> serialization vs. the workers is not going to be the end of the world.
>
> I'm willing to bet that no real-world use-case will ever notice the
> existance of this mutex. The microbenchmark which shows off the "I'm so
> smart" metric is completely irrelevant especially when the result is
> fragile, incomprehensible and therefore unmaintainable.
>
> Keep it correct and simple is still the most important engineering
> principle. Premature optimization is a guaranteed path to failure.
>
> If there is a compelling use case which justifies the resulting
> complexity, then it can be built on top. I'm not holding my breath. See
> above...
Perhaps the only thing that worries me is if an isolated partition
is inverted. Say 0-3 is non isolated and 4-7 is isolated. And then
cpuset is overwritten so that the reverse is applied: 0-3 is isolated
and 4-7 is not isolated. If all isol works reach before unisol works,
then tmigr_clear_cpu_available() -> cpumask_any(tmigr_available_mask)
won't find any CPU left on the last call.
Now last time I tried to invert an isolated cpumask in cpuset, I got
-EINVAL so something must be preventing from that. Just in case let's
have a WARN_ON_ONCE(cpumask_any(tmigr_available_mask) >= nr_cpu_ids).
But other than this detail, your solution looks good!
Thanks.
--
Frederic Weisbecker
SUSE Labs
Powered by blists - more mailing lists