lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhldnegqq4.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Wed, 20 Aug 2025 13:46:59 +0200
From: Valentin Schneider <vschneid@...hat.com>
To: Adam Li <adamli@...amperecomputing.com>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, cl@...ux.com, frederic@...nel.org,
 linux-kernel@...r.kernel.org, patches@...erecomputing.com
Subject: Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for
 ILB CPU

On 20/08/25 19:05, Adam Li wrote:
> On 8/20/2025 4:43 PM, Valentin Schneider wrote:
>> On 20/08/25 11:35, Adam Li wrote:
>>> I agree with your description about the housekeeping CPUs. In the worst case,
>>> the system only has one housekeeping CPU and this housekeeping CPU is so busy
>>> that:
>>> 1) This housekeeping CPU is unlikely idle;
>>> 2) and this housekeeping CPU is unlikely in 'nohz.idle_cpus_mask' because tick
>>> is not stopped.
>>> Therefore find_new_ilb() may very likely return -1. *No* CPU can be selected
>>> to do NOHZ idle load balancing.
>>>
>>> This patch tries to fix the imbalance of NOHZ idle CPUs (CPUs in nohz.idle_cpus_mask).
>>> Here is more background:
>>>
>>> When running llama on arm64 server, some CPUs *keep* idle while others
>>> are 100% busy. All CPUs are in 'nohz_full=' cpu list, and CONFIG_NO_HZ_FULL
>>> is set.
>>>
>>
>> I assume you mean all but one CPU is in 'nohz_full=' since you need at
>> least one housekeeping CPU. But in that case this becomes a slightly
>> different problem, since no CPU in 'nohz_full' will be in
>>
>>   housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
>>
>
> I ran llama workload on a system with 192 CPUs. I set "nohz_full=0-191" so all CPUs
> are in 'nohz_full' list. In this case, kernel uses the boot CPU for housekeeping:
>
> Kernel message: "Housekeeping: must include one present CPU, using boot CPU:0"
>
> find_new_ilb() looks for qualified CPU from housekeeping CPUs. The searching
> is likely to fail if there is only one housekeeping CPU.
>

Right

>>> The problem is caused by two issues:
>>> 1) Some idle CPUs cannot be added to 'nohz.idle_cpus_mask',
>>> this bug is fixed by another patch:
>>> https://lore.kernel.org/all/20250815065115.289337-2-adamli@os.amperecomputing.com/
>>>
>>> 2) Even if the idle CPUs are in 'nohz.idle_cpus_mask', *no* CPU can be selected to
>>> do NOHZ idle load balancing because conditions in find_new_ilb() is too strict.
>>> This patch tries to solve this issue.
>>>
>>> Hope this information helps.
>>>
>>
>> I hadn't seen that patch; that cclist is quite small, you'll want to add
>> the scheduler people to our next submission.
>>
>
> Sure. The first patch involves both 'tick' and 'scheduler' subsystem. I can resend
> the first patch to broader reviewers if you don't mind.
>

I'd say resend the whole series with the right folks cc'd.

>> So IIUC:
>> - Pretty much all your CPUs are NOHZ_FULL
>> - When they go idle they remain so for a while despite work being available
>>
>
> Exactly.
>
>> My first question would be: is NOHZ_FULL really right for your workload?
>> It's mainly designed to be used with always-running userspace tasks,
>> generally affined to a CPU by the system administrator.
>> Here AIUI you're relying on the scheduler load balancing to distribute work
>> to the NOHZ_FULL CPUs, so you're going to be penalized a lot by the
>> NOHZ_FULL context switch overheads. What's the point? Wouldn't you have
>> less overhead with just NOHZ_IDLE?
>>
>
> I ran the llama workload to do 'Large Language Model' reference.
> The workload creates 'always-running userspace' threads doing math computing.
> There is *few* sleep, wakeup and context switch. IIUC NOHZ_IDLE cannot help
> always-running task?
>

Right, NOHZ_IDLE is really about power savings while a CPU is idle (and
IIRC it helps some virtualization cases).

> 'nohz_full' option is supposed to benefit performance by reducing kernel
> noise I think. Could you please give more detail on
> 'NOHZ_FULL context switch overhead'?
>

The doc briefly touches on that:

  https://docs.kernel.org/timers/no_hz.html#omit-scheduling-clock-ticks-for-cpus-with-only-one-runnable-task

The longer story is have a look at kernel/context_tracking.c; every
transition into and out of the kernel to and from user or idle requires
additional atomic operations and synchronization.

It would be worth for you to quantify how much these processes
sleep/context switch, it could be that keep the tick enabled incurs a lower
throughput penalty than the NO_HZ_FULL overheads.

>> As for the actual balancing, yeah if you have idle NOHZ_FULL CPUs they
>> won't do the periodic balance; the residual 1Hz remote tick doesn't do that
>> either. But they should still do the newidle balance to pull work before
>> going tickless idle, and wakeup balance should help as well, albeit that
>> also depends on your topology.
>>
>
> I think the newidle balance and wakeup balance do not help in this case
> because the workload has few sleep and wakeup.
>

Right. So other than the NO_HZ_FULL vs NO_HZ_IDLE considerations above, you
could manually affine the threads of the workload. Depending on how much
control you have over how many threads it spawn, you could either pin on
thread per CPU, or just spawn the workload into a cpuset covering the
NO_HZ_FULL CPUs.

Having the scheduler do the balancing is bit of a precarious
situation. Your single housekeeping CPU is pretty much going to be always
running things, does it make sense to have it run the NOHZ idle balance
when there are available idle NOHZ_FULL CPUs? And in the same sense, does
it make sense to disturb an idle NOHZ_FULL CPU to get it to spread load on
other NOHZ_FULL CPUs? Admins that manually affine their threads will
probably say no.

9b019acb72e4 ("sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs")
also mentions SMT being an issue.

>> Could you share your system topology and your actual nohz_full cmdline?
>>
>
> The system has 192 CPUs. I set "nohz_full=0-191".
>
> Thanks,
> -adam


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ