linux-kernel - Re: [RFC PATCH 18/19] sched/fair: Optimize global "nohz.nr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <17f1a1f2-c5dc-40ed-a69e-a3af499a7068@amd.com>
Date: Thu, 25 Sep 2025 08:07:36 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>, Peter Zijlstra
	<peterz@...radead.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, "Gautham R.
 Shenoy" <gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>,
	Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, "Vincent
 Guittot" <vincent.guittot@...aro.org>, Anna-Maria Behnsen
	<anna-maria@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>,
	"Thomas Gleixner" <tglx@...utronix.de>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 18/19] sched/fair: Optimize global "nohz.nr_cpus"
 tracking

Hello Shrikanth,

On 9/25/2025 1:32 AM, Shrikanth Hegde wrote:
> 
> 
> On 9/4/25 9:45 AM, K Prateek Nayak wrote:
>> Optimize "nohz.nr_cpus" by tracking number of "sd_nohz->shared" with
>> non-zero "nr_idle_cpus" count via "nohz.nr_doms" and only updating at
>> the boundary of "sd_nohz->shared->nr_idle_cpus" going from 0 -> 1 and
>> back from 1 -> 0.
>>
>> This also introduces a chance of double accounting when a nohz idle
>> entry or the tick races with hotplug or cpuset as described in
>> __nohz_exit_idle_tracking().
>>
>> __nohz_exit_idle_tracking() called when the sched_domain_shared nodes
>> tracking idle CPUs are freed is used to correct any potential double
>> accounting which can unnecessarily trigger nohz idle balances even when
>> all the CPUs have tick enabled.
>>
> Is it possible to get rid of this nr_cpus or nr_doms altogether?
> 
> The reason being, with current code, one updates the nohz.idle_cpus_mask and
> then inc/dec nr_cpus.
> 
> The only use it to decide to do periodic idle balancing or not.
> If instead, could use cpumask_empty(nohz.idle_cpus_mask) check no?
> It may not be every tick accurate, but that may be ok.
> 
> I haven't gone through your series in detail yet, but similar thing is doable,
> check if the list is empty or not.

Actually, we'll have to iterate over the list of "nohz_shared_list" and
check if any of the "sd_shared->nr_idle_cpus" is non-zero to bail out.

Since sched_balance_trigger() is called at every tick, on every CPU, it can
add considerable overhead but I suppose we can have a similar method as
{test,set}_idle_core()?

  sched_balance_trigger()
    nohz_balancer_kick()
      if (test_nohz_idle_cpus())
        set_nohz_idle_cpus(false)
        smp_mb();
        nr_doms += <iterate to check if nohz idle CPUs exist>
        ...
        if (!nr_doms)
          return;
      ...
      idle_cpus += <do no hz balance and check if nohz idle CPUs still exist>
      if (idle_cpus)
        set_nohz_idle_cpus(true)


In the meantime, if any CPU is going idle with tick disables, they can
do:

  nohz_balance_enter_idle()
    set_nohz_idle_cpus(true)


{test,set}_nohz_idle_cpus() is just a READ_ONCE()/WRITE_ONCE()
respectively on a global system-wide variable.

That way sched_balance_trigger() will only bail out if there are no nohz
idle CPUs are found after last nohz idle balance, and no CPUs have
transitioned to nohz idle since.

Or we go more radical and have a way to trigger nohz balance per LLC!

-- 
Thanks and Regards,
Prateek