linux-kernel - Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d421a5ba-95cb-42fb-a376-1e04c9d6a1ac@os.amperecomputing.com>
Date: Wed, 20 Aug 2025 19:05:16 +0800
From: Adam Li <adamli@...amperecomputing.com>
To: Valentin Schneider <vschneid@...hat.com>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, cl@...ux.com, frederic@...nel.org,
 linux-kernel@...r.kernel.org, patches@...erecomputing.com
Subject: Re: [PATCH] sched/nohz: Fix NOHZ imbalance by adding options for ILB
 CPU

On 8/20/2025 4:43 PM, Valentin Schneider wrote:
> On 20/08/25 11:35, Adam Li wrote:
>> On 8/19/2025 10:00 PM, Valentin Schneider wrote:
>>>
>>> I'm not understanding why, in the scenarios outlined above, more NOHZ idle
>>> balancing is a good thing.
>>>
>>> Considering only housekeeping CPUs, they're all covered by wakeup, periodic
>>> and idle balancing (on top of NOHZ idle balancing when relevant). So if
>>> find_new_ilb() never finds a NOHZ-idle CPU, then that means your HK CPUs
>>> are either always busy or never stopping the tick when going idle, IOW they
>>> always have some work to do within a jiffy boundary.
>>>> Am I missing something?
>>>
>>
>> I agree with your description about the housekeeping CPUs. In the worst case,
>> the system only has one housekeeping CPU and this housekeeping CPU is so busy
>> that:
>> 1) This housekeeping CPU is unlikely idle;
>> 2) and this housekeeping CPU is unlikely in 'nohz.idle_cpus_mask' because tick
>> is not stopped.
>> Therefore find_new_ilb() may very likely return -1. *No* CPU can be selected
>> to do NOHZ idle load balancing.
>>
>> This patch tries to fix the imbalance of NOHZ idle CPUs (CPUs in nohz.idle_cpus_mask).
>> Here is more background:
>>
>> When running llama on arm64 server, some CPUs *keep* idle while others
>> are 100% busy. All CPUs are in 'nohz_full=' cpu list, and CONFIG_NO_HZ_FULL
>> is set.
>>
> 
> I assume you mean all but one CPU is in 'nohz_full=' since you need at
> least one housekeeping CPU. But in that case this becomes a slightly
> different problem, since no CPU in 'nohz_full' will be in
> 
>   housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
> 

I ran llama workload on a system with 192 CPUs. I set "nohz_full=0-191" so all CPUs
are in 'nohz_full' list. In this case, kernel uses the boot CPU for housekeeping:

Kernel message: "Housekeeping: must include one present CPU, using boot CPU:0"

find_new_ilb() looks for qualified CPU from housekeeping CPUs. The searching
is likely to fail if there is only one housekeeping CPU.

>> The problem is caused by two issues:
>> 1) Some idle CPUs cannot be added to 'nohz.idle_cpus_mask',
>> this bug is fixed by another patch:
>> https://lore.kernel.org/all/20250815065115.289337-2-adamli@os.amperecomputing.com/
>>
>> 2) Even if the idle CPUs are in 'nohz.idle_cpus_mask', *no* CPU can be selected to
>> do NOHZ idle load balancing because conditions in find_new_ilb() is too strict.
>> This patch tries to solve this issue.
>>
>> Hope this information helps.
>>
> 
> I hadn't seen that patch; that cclist is quite small, you'll want to add
> the scheduler people to our next submission.
> 

Sure. The first patch involves both 'tick' and 'scheduler' subsystem. I can resend
the first patch to broader reviewers if you don't mind.

> So IIUC:
> - Pretty much all your CPUs are NOHZ_FULL
> - When they go idle they remain so for a while despite work being available
> 

Exactly.

> My first question would be: is NOHZ_FULL really right for your workload?
> It's mainly designed to be used with always-running userspace tasks,
> generally affined to a CPU by the system administrator.
> Here AIUI you're relying on the scheduler load balancing to distribute work
> to the NOHZ_FULL CPUs, so you're going to be penalized a lot by the
> NOHZ_FULL context switch overheads. What's the point? Wouldn't you have
> less overhead with just NOHZ_IDLE?
> 

I ran the llama workload to do 'Large Language Model' reference.
The workload creates 'always-running userspace' threads doing math computing.
There is *few* sleep, wakeup and context switch. IIUC NOHZ_IDLE cannot help
always-running task? 

'nohz_full' option is supposed to benefit performance by reducing kernel
noise I think. Could you please give more detail on
'NOHZ_FULL context switch overhead'?
  
> As for the actual balancing, yeah if you have idle NOHZ_FULL CPUs they
> won't do the periodic balance; the residual 1Hz remote tick doesn't do that
> either. But they should still do the newidle balance to pull work before
> going tickless idle, and wakeup balance should help as well, albeit that
> also depends on your topology.
>

I think the newidle balance and wakeup balance do not help in this case
because the workload has few sleep and wakeup.
 
> Could you share your system topology and your actual nohz_full cmdline?
>

The system has 192 CPUs. I set "nohz_full=0-191".

Thanks,
-adam