lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4421ec3d-e4df-4645-9b68-261080bd4760@redhat.com>
Date: Thu, 30 Oct 2025 13:57:50 -0400
From: Waiman Long <llong@...hat.com>
To: Frederic Weisbecker <frederic@...nel.org>, Waiman Long <llong@...hat.com>
Cc: Gabriele Monaco <gmonaco@...hat.com>, Thomas Gleixner
 <tglx@...utronix.de>, linux-kernel@...r.kernel.org,
 Anna-Maria Behnsen <anna-maria@...utronix.de>
Subject: Re: [RESEND PATCH v13 0/9] timers: Exclude isolated cpus from timer
 migration

On 10/30/25 1:10 PM, Frederic Weisbecker wrote:
> Le Thu, Oct 30, 2025 at 12:37:08PM -0400, Waiman Long a écrit :
>> On 10/30/25 12:09 PM, Gabriele Monaco wrote:
>>> On Thu, 2025-10-30 at 11:37 -0400, Waiman Long wrote:
>>>> On 10/30/25 10:12 AM, Frederic Weisbecker wrote:
>>>>> Hi Waiman,
>>>>>
>>>>> Le Wed, Oct 29, 2025 at 10:56:06PM -0400, Waiman Long a écrit :
>>>>>> On 10/20/25 7:27 AM, Gabriele Monaco wrote:
>>>>>>> The timer migration mechanism allows active CPUs to pull timers from
>>>>>>> idle ones to improve the overall idle time. This is however undesired
>>>>>>> when CPU intensive workloads run on isolated cores, as the algorithm
>>>>>>> would move the timers from housekeeping to isolated cores, negatively
>>>>>>> affecting the isolation.
>>>>>>>
>>>>>>> Exclude isolated cores from the timer migration algorithm, extend the
>>>>>>> concept of unavailable cores, currently used for offline ones, to
>>>>>>> isolated ones:
>>>>>>> * A core is unavailable if isolated or offline;
>>>>>>> * A core is available if non isolated and online;
>>>>>>>
>>>>>>> A core is considered unavailable as isolated if it belongs to:
>>>>>>> * the isolcpus (domain) list
>>>>>>> * an isolated cpuset
>>>>>>> Except if it is:
>>>>>>> * in the nohz_full list (already idle for the hierarchy)
>>>>>>> * the nohz timekeeper core (must be available to handle global timers)
>>>>>>>
>>>>>>> CPUs are added to the hierarchy during late boot, excluding isolated
>>>>>>> ones, the hierarchy is also adapted when the cpuset isolation changes.
>>>>>>>
>>>>>>> Due to how the timer migration algorithm works, any CPU part of the
>>>>>>> hierarchy can have their global timers pulled by remote CPUs and have to
>>>>>>> pull remote timers, only skipping pulling remote timers would break the
>>>>>>> logic.
>>>>>>> For this reason, prevent isolated CPUs from pulling remote global
>>>>>>> timers, but also the other way around: any global timer started on an
>>>>>>> isolated CPU will run there. This does not break the concept of
>>>>>>> isolation (global timers don't come from outside the CPU) and, if
>>>>>>> considered inappropriate, can usually be mitigated with other isolation
>>>>>>> techniques (e.g. IRQ pinning).
>>>>>>>
>>>>>>> This effect was noticed on a 128 cores machine running oslat on the
>>>>>>> isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs,
>>>>>>> and the CPU with lowest count in a timer migration hierarchy (here 1
>>>>>>> and 65) appears as always active and continuously pulls global timers,
>>>>>>> from the housekeeping CPUs. This ends up moving driver work (e.g.
>>>>>>> delayed work) to isolated CPUs and causes latency spikes:
>>>>>>>
>>>>>>> before the change:
>>>>>>>
>>>>>>>     # oslat -c 1-31,33-63,65-95,97-127 -D 62s
>>>>>>>     ...
>>>>>>>      Maximum:     1203 10 3 4 ... 5 (us)
>>>>>>>
>>>>>>> after the change:
>>>>>>>
>>>>>>>     # oslat -c 1-31,33-63,65-95,97-127 -D 62s
>>>>>>>     ...
>>>>>>>      Maximum:      10 4 3 4 3 ... 5 (us)
>>>>>>>
>>>>>>> The same behaviour was observed on a machine with as few as 20 cores /
>>>>>>> 40 threads with isocpus set to: 1-9,11-39 with rtla-osnoise-top.
>>>>>>>
>>>>>>> The first 5 patches are preparatory work to change the concept of
>>>>>>> online/offline to available/unavailable, keep track of those in a
>>>>>>> separate cpumask cleanup the setting/clearing functions and change a
>>>>>>> function name in cpuset code.
>>>>>>>
>>>>>>> Patch 6 and 7 adapt isolation and cpuset to prevent domain isolated and
>>>>>>> nohz_full from covering all CPUs not leaving any housekeeping one. This
>>>>>>> can lead to problems with the changes introduced in this series because
>>>>>>> no CPU would remain to handle global timers.
>>>>>>>
>>>>>>> Patch 9 extends the unavailable status to domain isolated CPUs, which
>>>>>>> is the main contribution of the series.
>>>>>>>
>>>>>>> This series is equivalent to v13 but rebased on v6.18-rc2.
>>>>>> Thomas,
>>>>>>
>>>>>> This patch series have undergone multiple round of reviews. Do you think
>>>>>> it
>>>>>> is good enough to be merged into tip?
>>>>>>
>>>>>> It does contain some cpuset code, but most of the changes are in the timer
>>>>>> code. So I think it is better to go through the tip tree. It does have
>>>>>> some
>>>>>> minor conflicts with the current for-6.19 branch of the cgroup tree, but
>>>>>> it
>>>>>> can be easily resolved during merge.
>>>>>>
>>>>>> What do you think?
>>>>> Just wait a little, I realize I made a buggy suggestion to Gabriele and
>>>>> a detail needs to be fixed.
>>>>>
>>>>> My bad...
>>>> OK, I thought you were OK with the timer changes. I guess Gabriele will have
>>>> to send out a new version to address your finding.
>>> Sure, I'm going to have a look at this next week and send a V14.
>> I am going to extract out your 2 cpuset patches and send them to the cgroup
>> mailing list separately. So you don't need to include them in your next
>> version.
> I'm not sure this will help if you apply those to an external tree if the
> plan is to apply the whole to the timer tree. Or we'll create a dependency
> issue...

These 2 cpuset patches are actually independent of the timer related 
changes. The purpose of these two patches are to prevent the cpuset code 
from adding isolated CPUs in such a way that all the nohz_full HK CPUs 
become domain-isolated. This is a corner case that normal users won't 
try to do. The patches are just an insurance policy to ensure that users 
can't do that. This is complementary to the sched/isolation patch that 
limits what CPUs can be put to the isolcpus and nohz_full boot 
parameters. All these patches are independent of the timer related 
changes, though you can say that the solution will only be complete if 
all the pieces are in place.

There are another set of pending cpuset patches from Chen Ridong that 
does some restructuring of the cpuset code that will likely have some 
conflicts with these 2 patches. So I would like to settle the cpuset 
changes to avoid future conflicts.

Cheers,
Longman


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ