[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d2cec8f3-781b-4a7f-9ca9-e848167e5f30@linux.ibm.com>
Date: Fri, 9 Jan 2026 20:45:48 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Valentin Schneider <vschneid@...hat.com>
Cc: kprateek.nayak@....com, juri.lelli@...hat.com, tglx@...utronix.de,
dietmar.eggemann@....com, anna-maria@...utronix.de,
frederic@...nel.org, wangyang.guo@...el.com, mingo@...nel.org,
peterz@...radead.org, vincent.guittot@...aro.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 3/3] sched/fair: Remove nohz.nr_cpus and use weight of
cpumask instead
Hi Valentin. Thanks for going through.
On 1/9/26 8:14 PM, Valentin Schneider wrote:
> On 07/01/26 12:21, Shrikanth Hegde wrote:
>> nohz.nr_cpus was observed as contended cacheline when running
>> enterprise workload on large systems.
>>
>> Fundamental scalability challenge with nohz.idle_cpus_mask
>> and nohz.nr_cpus is the following:
>>
>> (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
>> (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's
>> any nohz balancing work to do, in every scheduler tick.
>>
>> (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
>> (through nohz_balancer_kick() via sched_tick()) modify (write)
>> nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
>>
>
> My first reaction on reading the whole changelog was: "but .nr_cpus and
> .idle_cpus_mask are in the same cacheline?!", which as Ingo pointed out
> somewhere down [1] isn't true for CPUMASK_OFFSTACK, so this change
> effectively gets rid of the dirtying of one extra cacheline during idle
> entry/exit.
>
> [1]: http://lore.kernel.org/r/aS3za7X9BLS5rg65@gmail.com
>
> I'd suggest adding something like so in this part of the changelog:
>
> """
> Note that nohz.idle_cpus_mask and nohz.nr_cpus reside in the same
> cacheline, however under CONFIG_CPUMASK_OFFSTACK the backing storage for
> nohz.idle_cpus_mask will be elsewhere. This implies two separate cachelines
> being dirtied upon idle entry / exit.
> """
>
ok. Will do that. Thanks.
Even for CONFIG_CPUMASK_OFFSTACK=n, usual configuration is like 512/1024/
2048 or higher.
For 64 byte cacheline, 1 cacheline can hold 512 CPUs.
So idle_cpus_mask and rest of nohz fields including nr_cpus will be in different
cacheline.
Even for powerpc(128 byte cacheline), where CONFIG_CPUMASK_OFFSTACK=n,
default is NR_CPUS=2048. that means idle_cpus_mask will take 2 cachelines and rest
of nohz fields will be in third cacheline.
So in most of the cases, this implies dirtying one less cacheline.
data points with CONFIG_CPUMASK_OFFSTACK=y/n
[1]: https://lore.kernel.org/all/fdb378e7-7797-4aeb-a79f-12af4cb1b81a@linux.ibm.com/
Powered by blists - more mailing lists