linux-kernel - Re: [PATCH 4/4] sched/fair: Remove atomic nr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fdb378e7-7797-4aeb-a79f-12af4cb1b81a@linux.ibm.com>
Date: Tue, 2 Dec 2025 20:05:38 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: peterz@...radead.org, vincent.guittot@...aro.org,
        linux-kernel@...r.kernel.org, kprateek.nayak@....com,
        dietmar.eggemann@....com, vschneid@...hat.com, rostedt@...dmis.org,
        tglx@...utronix.de, tim.c.chen@...ux.intel.com,
        Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask
 instead



On 12/2/25 1:24 PM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
> 
>>> That the nr_cpus modification is an atomic op
>>> doesn't change the situation much in terms of
>>> cacheline bouncing, because the cacheline dirtying
>>> will still cause comparable levels of bouncing on
>>> modern CPUs with modern cache coherency protocols.
>>>
>>> If nr_cpus and nohz.nr_cpus are in separate
>>> cachelines, then this patch might eliminate about
>>> half of the bounces - but AFAICS they are right
>>> next to each other, so unless it's off-stack
>>> cpumasks, they should be in the same cacheline.
>>> Half of 'bad bouncing' is still kinda 'bad
>>> bouncing'. :-)
>>>
>>
>> You are right. If we have to get rid of cacheline
>> bouncing then we need to fix nohz.idle_cpus_mask too.
>>
>> I forgot about CPUMASK_OFFSTACK.
>>
>> If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and
>> nr_cpus are in same cacheline Right?. That data in
>> cover-letter is with =y. In that case, getting it to
>> cpumask_empty will give minimal gains by remvong an
>> additional atomic inc/dec operations.
>>
>> If CPUMASK_OFFSTACK=n, then they could be in
>> different cacheline. In that case gains should be
>> better. Very likely our performance team would have
>> done with =n. IIRC, on powerpc, based on NR_CPU we
>> change it. On x86 it chooses NR_CPUs.
> 
> Well, it's the other way around: in the 'off stack'
> case the cpumask_var_t is moved "off the stack" because
> it's too large - i.e. we allocate it separately, in a
> separate cacheline as a side effect. Even if the main
> cpumask pointer is next to nohz.nr_cpus, the mask
> itself is behind an indirect pointer, see
> <linux/cpumask_types.h>:
> 
>    #ifdef CONFIG_CPUMASK_OFFSTACK
>    typedef struct cpumask *cpumask_var_t;
>    #else
> 
> Note that idle_cpus_mask is defined as a cpumask_var_t
> and is thus affected by CONFIG_CPUMASK_OFFSTACK and may
> be allocated dynamically:
> 
>    kernel/sched/fair.c:    cpumask_var_t idle_cpus_mask;
>    ...
>    kernel/sched/fair.c:    zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
> 
> So I think it's quite possible that the performance
> measurements were done with CONFIG_CPUMASK_OFFSTACK=y:
> it's commonly enabled in server/enterprise distros -
> but even Ubuntu enables it on their desktop kernel, so
> the reduction in cacheline ping-pong is probably real
> and the change makes sense in that context.
> 

Yes. Numbers in cover-letter were done with CONFIG_CPUMASK_OFFSTACK=y.
Those numbers are with hackbench.

I was saying initial report of cacheline contention by our
performance team has it or not. It was running enterprise workload.

> But even with OFFSTACK=n, if NR_CPUS=512 it's possible
> that a fair chunk of the cpumask ends up on the
> previous (separate) cacheline from where nr_cpus is,
> with a resulting observable reduction of the cache
> bouncing rate:
> 
>    static struct {
>            cpumask_var_t idle_cpus_mask;
>            atomic_t nr_cpus;
> 
> Note that since 'nohz' is ____cacheline_aligned, in
> this case idle_cpus_mask will take a full cacheline in
> the NR_CPUS=512 case, and nr_cpus will always be on a
> separate cacheline.
> 
> If CONFIG_NR_CPUS is 128 or smaller, then
> idle_cpus_mask and nr_cpus will be on the same
> cacheline.
> 

I set NR_CPUS=2048 which makes CONFIG_CPUMASK_OFFSTACK=n in powerpc.

baseline:
    0.97%  [k] nohz_balance_exit_idle        -      -
    0.40%  [k] nohz_balancer_kick            -      -
    0.07%  [k] nohz_run_idle_balance         -      -

with patch:
    0.39%  [k] nohz_balance_exit_idle        -      -
    0.14%  [k] nohz_balancer_kick            -      -
    0.08%  [k] nohz_run_idle_balance         -      -

its in 50% reduction range.


> Anyway, if the reduction in cache ping-pong is higher
> than 50%, then either something weird is going on, or
> I'm missing something. :-)
>

Yes, in both cases reduction seems to be by 50% when running
hackbench as mentioned.

  
> But the measurement data you provided:
> 
>     baseline: tip sched/core at 3eb593560146
>     1.01%  [k] nohz_balance_exit_idle
>     0.31%  [k] nohz_balancer_kick
>     0.05%  [k] nohz_balance_enter_idle
> 
>     With series:
>     0.45%  [k] nohz_balance_exit_idle
>     0.18%  [k] nohz_balancer_kick
>     0.01%  [k] nohz_balance_enter_idle
> 
> ... is roughly in the 50% reduction range, if profiled
> overhead is a good proxy for cache bounce overhead
> (which it may be), which supports my hypothesis that
> the tests were run with CONFIG_CPUMASK_OFFSTACK=y and
> the cache pong-pong rate in these functions got roughly
> halved.
> 

Yes. This data is with CONFIG_CPUMASK_OFFSTACK=y

> BTW., I'd expect _nohz_idle_balance() to show up too in
> the profile.
> 
> 

It is rate limited by sd_balance_interval no? Likely it wont happen
as aggressive as idle enter/exit.

>> arm64/Kconfig:	select CPUMASK_OFFSTACK if NR_CPUS > 256
>> powerpc/Kconfig:	select CPUMASK_OFFSTACK			if NR_CPUS >= 8192
>> x86/Kconfig:	select CPUMASK_OFFSTACK
>> x86/Kconfig:	default 8192 if  SMP && CPUMASK_OFFSTACK
>> x86/Kconfig:	default  512 if  SMP && !CPUMASK_OFFSTACK
> 
> Yeah, we make the cpumask a direct mask up to 512 bits
> (64 bytes) - it's allocated indirectly from that point
> onwards.
> 
>> In either case, if we think,
>> 	nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)
>>
>> Since it is not a correctness stuff here, at worst we
>> will lose a chance to do idle load balance.
> 
> Yeah, I don't think it's a correctness issue, removing
> nr_cpus should not change the ordering of modifications
> to nohz.idle_cpus_mask and nohz.has_blocked.
> 
> ( The nohz.nr_cpus and nohz.idle_cpus_mask
>    modifications were not ordered versus each other
>    previously to begin with - they are only ordered
>    versus nohz.has_blocked. )
> 

Yes. This is true.

>> Let me re-write changelog. Also see a bit more into it.
> 
> Thank you!
> 
> Note that the fundamental scalability challenge with
> the nohz_balancer_kick(), nohz_balance_enter_idle() and
> nohz_balance_exit_idle() functions is the following:
> 
>   (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
>       (or nohz.idle_cpu_mask) and nohz.has_blocked to
>       see whether there's any nohz balancing work to do,
>       in every scheduler tick.
> 
>   (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
>       modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask)
>       and nohz.has_blocked.
> 
> The characteristic frequencies are the following:
> 
>   (1) happens at scheduler (busy-)tick frequency on
>       every CPU. This is a relatively constant frequency
>       in the ~1 kHz range or lower.
> 
>   (2) happens at idle enter/exit frequency on every CPU
>       that goes to idle. This is workload dependent, but
>       can easily be hundreds of kHz for IO-bound loads
>       and high CPU counts. Ie. can be orders of
>       magnitude higher than (1), in which case a
>       cachemiss at every invocation of (1) is almost
>       inevitable. [ Ie. the cost of getting really long
>       NOHZ idling times is the extra overhead of the
>       exit/enter nohz cycles for partially idle CPUs on
>       high-rate IO workloads. ]
> 
> There's two types of costs from these functions:
> 
>    (A) scheduler tick cost via (1): this happens on busy
>        CPUs too, and is thus a primary scalability cost.
>        But the rate here is constant and typically much
>        lower than (A), hence the absolute benefit to
>        workload scalability will be lower as well.
> 
>    (B) idle cost via (2): going-to-idle and
>        coming-from-idle costs are secondary concerns,
>        because they impact power efficiency more than
>        they impact scalability. (Ie. while 'wasting idle
>        time' isn't good, but it often doesn't hurt
>        scalability, at least as long as it's done for a
>        good reason and done in moderation.)
> 
>        but in terms of absolute cost this scales up with
>        nr_cpus as well, and a much faster rate, and thus
>        may also approach and negatively impact system
>        limits like memory bus/fabric bandwidth.
> 

Thank you. I am going to mostly copy this in next version :)

> So I'd argue that reductions in both (A) and (B) are
> useful, but for different reasons.
> 
> The *real* breakthrough in this area would be to reduce
> the unlimited upwards frequency of (2), by
> fundamentally changing the model of NOHZ idle
> balancing:
> 
> For example by measuring the rate (frequency) of idle
> cycles on each CPU (this can be done without any
> cross-CPU logic), we would turn off NOHZ-idle for that
> CPU when the rate goes beyond a threshold.
> 
> The resulting regular idle load-balancing passes will
> be rate-limited by balance intervals and won't be as
> aggressive as nohz_balance_enter+exit_idle(). (I hope...)
> 
> Truly idle CPUs would go into NOHZ mode automatically,
> as their measured rate of idling drops below the
> threshold.
> 
> Thoughts?

Interesting.

Let me see if i get this right.

So track the idle duration over certain past interval.
If is below certain threshould mark those CPUs in nohz state
while doing idle entry/exit.
If not, reset their bits in nohz mask and don't update the mask?

I think rq->avg_idle there already and we do similar checks for newidle_balance.
sched_balance_newidle
...
         if (!get_rd_overloaded(this_rq->rd) ||
             this_rq->avg_idle < sd->max_newidle_lb_cost) {

                 update_next_balance(sd, &next_balance);
                 rcu_read_unlock();
                 goto out;
         }