[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <60b676cd-eeb4-4463-8ce6-fec89e6fbaae@redhat.com>
Date: Fri, 31 Oct 2025 12:14:11 -0400
From: Waiman Long <llong@...hat.com>
To: Frederic Weisbecker <frederic@...nel.org>, Waiman Long <llong@...hat.com>
Cc: Gabriele Monaco <gmonaco@...hat.com>, Thomas Gleixner
<tglx@...utronix.de>, linux-kernel@...r.kernel.org,
Anna-Maria Behnsen <anna-maria@...utronix.de>
Subject: Re: [RESEND PATCH v13 0/9] timers: Exclude isolated cpus from timer
migration
On 10/31/25 9:48 AM, Frederic Weisbecker wrote:
> Le Thu, Oct 30, 2025 at 01:57:50PM -0400, Waiman Long a écrit :
>> On 10/30/25 1:10 PM, Frederic Weisbecker wrote:
>>> Le Thu, Oct 30, 2025 at 12:37:08PM -0400, Waiman Long a écrit :
>>>> On 10/30/25 12:09 PM, Gabriele Monaco wrote:
>>>>> On Thu, 2025-10-30 at 11:37 -0400, Waiman Long wrote:
>>>>>> On 10/30/25 10:12 AM, Frederic Weisbecker wrote:
>>>>>>> Hi Waiman,
>>>>>>>
>>>>>>> Le Wed, Oct 29, 2025 at 10:56:06PM -0400, Waiman Long a écrit :
>>>>>>>> On 10/20/25 7:27 AM, Gabriele Monaco wrote:
>>>>>>>>> The timer migration mechanism allows active CPUs to pull timers from
>>>>>>>>> idle ones to improve the overall idle time. This is however undesired
>>>>>>>>> when CPU intensive workloads run on isolated cores, as the algorithm
>>>>>>>>> would move the timers from housekeeping to isolated cores, negatively
>>>>>>>>> affecting the isolation.
>>>>>>>>>
>>>>>>>>> Exclude isolated cores from the timer migration algorithm, extend the
>>>>>>>>> concept of unavailable cores, currently used for offline ones, to
>>>>>>>>> isolated ones:
>>>>>>>>> * A core is unavailable if isolated or offline;
>>>>>>>>> * A core is available if non isolated and online;
>>>>>>>>>
>>>>>>>>> A core is considered unavailable as isolated if it belongs to:
>>>>>>>>> * the isolcpus (domain) list
>>>>>>>>> * an isolated cpuset
>>>>>>>>> Except if it is:
>>>>>>>>> * in the nohz_full list (already idle for the hierarchy)
>>>>>>>>> * the nohz timekeeper core (must be available to handle global timers)
>>>>>>>>>
>>>>>>>>> CPUs are added to the hierarchy during late boot, excluding isolated
>>>>>>>>> ones, the hierarchy is also adapted when the cpuset isolation changes.
>>>>>>>>>
>>>>>>>>> Due to how the timer migration algorithm works, any CPU part of the
>>>>>>>>> hierarchy can have their global timers pulled by remote CPUs and have to
>>>>>>>>> pull remote timers, only skipping pulling remote timers would break the
>>>>>>>>> logic.
>>>>>>>>> For this reason, prevent isolated CPUs from pulling remote global
>>>>>>>>> timers, but also the other way around: any global timer started on an
>>>>>>>>> isolated CPU will run there. This does not break the concept of
>>>>>>>>> isolation (global timers don't come from outside the CPU) and, if
>>>>>>>>> considered inappropriate, can usually be mitigated with other isolation
>>>>>>>>> techniques (e.g. IRQ pinning).
>>>>>>>>>
>>>>>>>>> This effect was noticed on a 128 cores machine running oslat on the
>>>>>>>>> isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs,
>>>>>>>>> and the CPU with lowest count in a timer migration hierarchy (here 1
>>>>>>>>> and 65) appears as always active and continuously pulls global timers,
>>>>>>>>> from the housekeeping CPUs. This ends up moving driver work (e.g.
>>>>>>>>> delayed work) to isolated CPUs and causes latency spikes:
>>>>>>>>>
>>>>>>>>> before the change:
>>>>>>>>>
>>>>>>>>> # oslat -c 1-31,33-63,65-95,97-127 -D 62s
>>>>>>>>> ...
>>>>>>>>> Maximum: 1203 10 3 4 ... 5 (us)
>>>>>>>>>
>>>>>>>>> after the change:
>>>>>>>>>
>>>>>>>>> # oslat -c 1-31,33-63,65-95,97-127 -D 62s
>>>>>>>>> ...
>>>>>>>>> Maximum: 10 4 3 4 3 ... 5 (us)
>>>>>>>>>
>>>>>>>>> The same behaviour was observed on a machine with as few as 20 cores /
>>>>>>>>> 40 threads with isocpus set to: 1-9,11-39 with rtla-osnoise-top.
>>>>>>>>>
>>>>>>>>> The first 5 patches are preparatory work to change the concept of
>>>>>>>>> online/offline to available/unavailable, keep track of those in a
>>>>>>>>> separate cpumask cleanup the setting/clearing functions and change a
>>>>>>>>> function name in cpuset code.
>>>>>>>>>
>>>>>>>>> Patch 6 and 7 adapt isolation and cpuset to prevent domain isolated and
>>>>>>>>> nohz_full from covering all CPUs not leaving any housekeeping one. This
>>>>>>>>> can lead to problems with the changes introduced in this series because
>>>>>>>>> no CPU would remain to handle global timers.
>>>>>>>>>
>>>>>>>>> Patch 9 extends the unavailable status to domain isolated CPUs, which
>>>>>>>>> is the main contribution of the series.
>>>>>>>>>
>>>>>>>>> This series is equivalent to v13 but rebased on v6.18-rc2.
>>>>>>>> Thomas,
>>>>>>>>
>>>>>>>> This patch series have undergone multiple round of reviews. Do you think
>>>>>>>> it
>>>>>>>> is good enough to be merged into tip?
>>>>>>>>
>>>>>>>> It does contain some cpuset code, but most of the changes are in the timer
>>>>>>>> code. So I think it is better to go through the tip tree. It does have
>>>>>>>> some
>>>>>>>> minor conflicts with the current for-6.19 branch of the cgroup tree, but
>>>>>>>> it
>>>>>>>> can be easily resolved during merge.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>> Just wait a little, I realize I made a buggy suggestion to Gabriele and
>>>>>>> a detail needs to be fixed.
>>>>>>>
>>>>>>> My bad...
>>>>>> OK, I thought you were OK with the timer changes. I guess Gabriele will have
>>>>>> to send out a new version to address your finding.
>>>>> Sure, I'm going to have a look at this next week and send a V14.
>>>> I am going to extract out your 2 cpuset patches and send them to the cgroup
>>>> mailing list separately. So you don't need to include them in your next
>>>> version.
>>> I'm not sure this will help if you apply those to an external tree if the
>>> plan is to apply the whole to the timer tree. Or we'll create a dependency
>>> issue...
>> These 2 cpuset patches are actually independent of the timer related
>> changes. The purpose of these two patches are to prevent the cpuset code
>> from adding isolated CPUs in such a way that all the nohz_full HK CPUs
>> become domain-isolated. This is a corner case that normal users won't try to
>> do. The patches are just an insurance policy to ensure that users can't do
>> that. This is complementary to the sched/isolation patch that limits what
>> CPUs can be put to the isolcpus and nohz_full boot parameters. All these
>> patches are independent of the timer related changes, though you can say
>> that the solution will only be complete if all the pieces are in place.
> Right but there will be a conflict if the timer patches don't have
> the rename of update_unbound_workqueue_cpumask().
Yes, I missed the fact that patch 9 does have a cpuset.c hunk. Yes, it
will have a dependency on the prior cpuset patches.
>> There are another set of pending cpuset patches from Chen Ridong that does
>> some restructuring of the cpuset code that will likely have some conflicts
>> with these 2 patches. So I would like to settle the cpuset changes to avoid
>> future conflicts.
> Ok so it looks like there will be conflicts eventually during the merge
> window. In that case it makes sense to take Gabriel cpuset patches but
> he'll need to rebase the rest on top of the timer tree.
It depends on whether the patch set can be merged in time into tip for
the next v6.19 merge windows.
Cheers,
Longman
>
> Thanks.
>
>> Cheers,
>> Longman
>>
Powered by blists - more mailing lists