[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fdb378e7-7797-4aeb-a79f-12af4cb1b81a@linux.ibm.com>
Date: Tue, 2 Dec 2025 20:05:38 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: peterz@...radead.org, vincent.guittot@...aro.org,
linux-kernel@...r.kernel.org, kprateek.nayak@....com,
dietmar.eggemann@....com, vschneid@...hat.com, rostedt@...dmis.org,
tglx@...utronix.de, tim.c.chen@...ux.intel.com,
Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask
instead
On 12/2/25 1:24 PM, Ingo Molnar wrote:
>
> * Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
>
>>> That the nr_cpus modification is an atomic op
>>> doesn't change the situation much in terms of
>>> cacheline bouncing, because the cacheline dirtying
>>> will still cause comparable levels of bouncing on
>>> modern CPUs with modern cache coherency protocols.
>>>
>>> If nr_cpus and nohz.nr_cpus are in separate
>>> cachelines, then this patch might eliminate about
>>> half of the bounces - but AFAICS they are right
>>> next to each other, so unless it's off-stack
>>> cpumasks, they should be in the same cacheline.
>>> Half of 'bad bouncing' is still kinda 'bad
>>> bouncing'. :-)
>>>
>>
>> You are right. If we have to get rid of cacheline
>> bouncing then we need to fix nohz.idle_cpus_mask too.
>>
>> I forgot about CPUMASK_OFFSTACK.
>>
>> If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and
>> nr_cpus are in same cacheline Right?. That data in
>> cover-letter is with =y. In that case, getting it to
>> cpumask_empty will give minimal gains by remvong an
>> additional atomic inc/dec operations.
>>
>> If CPUMASK_OFFSTACK=n, then they could be in
>> different cacheline. In that case gains should be
>> better. Very likely our performance team would have
>> done with =n. IIRC, on powerpc, based on NR_CPU we
>> change it. On x86 it chooses NR_CPUs.
>
> Well, it's the other way around: in the 'off stack'
> case the cpumask_var_t is moved "off the stack" because
> it's too large - i.e. we allocate it separately, in a
> separate cacheline as a side effect. Even if the main
> cpumask pointer is next to nohz.nr_cpus, the mask
> itself is behind an indirect pointer, see
> <linux/cpumask_types.h>:
>
> #ifdef CONFIG_CPUMASK_OFFSTACK
> typedef struct cpumask *cpumask_var_t;
> #else
>
> Note that idle_cpus_mask is defined as a cpumask_var_t
> and is thus affected by CONFIG_CPUMASK_OFFSTACK and may
> be allocated dynamically:
>
> kernel/sched/fair.c: cpumask_var_t idle_cpus_mask;
> ...
> kernel/sched/fair.c: zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
>
> So I think it's quite possible that the performance
> measurements were done with CONFIG_CPUMASK_OFFSTACK=y:
> it's commonly enabled in server/enterprise distros -
> but even Ubuntu enables it on their desktop kernel, so
> the reduction in cacheline ping-pong is probably real
> and the change makes sense in that context.
>
Yes. Numbers in cover-letter were done with CONFIG_CPUMASK_OFFSTACK=y.
Those numbers are with hackbench.
I was saying initial report of cacheline contention by our
performance team has it or not. It was running enterprise workload.
> But even with OFFSTACK=n, if NR_CPUS=512 it's possible
> that a fair chunk of the cpumask ends up on the
> previous (separate) cacheline from where nr_cpus is,
> with a resulting observable reduction of the cache
> bouncing rate:
>
> static struct {
> cpumask_var_t idle_cpus_mask;
> atomic_t nr_cpus;
>
> Note that since 'nohz' is ____cacheline_aligned, in
> this case idle_cpus_mask will take a full cacheline in
> the NR_CPUS=512 case, and nr_cpus will always be on a
> separate cacheline.
>
> If CONFIG_NR_CPUS is 128 or smaller, then
> idle_cpus_mask and nr_cpus will be on the same
> cacheline.
>
I set NR_CPUS=2048 which makes CONFIG_CPUMASK_OFFSTACK=n in powerpc.
baseline:
0.97% [k] nohz_balance_exit_idle - -
0.40% [k] nohz_balancer_kick - -
0.07% [k] nohz_run_idle_balance - -
with patch:
0.39% [k] nohz_balance_exit_idle - -
0.14% [k] nohz_balancer_kick - -
0.08% [k] nohz_run_idle_balance - -
its in 50% reduction range.
> Anyway, if the reduction in cache ping-pong is higher
> than 50%, then either something weird is going on, or
> I'm missing something. :-)
>
Yes, in both cases reduction seems to be by 50% when running
hackbench as mentioned.
> But the measurement data you provided:
>
> baseline: tip sched/core at 3eb593560146
> 1.01% [k] nohz_balance_exit_idle
> 0.31% [k] nohz_balancer_kick
> 0.05% [k] nohz_balance_enter_idle
>
> With series:
> 0.45% [k] nohz_balance_exit_idle
> 0.18% [k] nohz_balancer_kick
> 0.01% [k] nohz_balance_enter_idle
>
> ... is roughly in the 50% reduction range, if profiled
> overhead is a good proxy for cache bounce overhead
> (which it may be), which supports my hypothesis that
> the tests were run with CONFIG_CPUMASK_OFFSTACK=y and
> the cache pong-pong rate in these functions got roughly
> halved.
>
Yes. This data is with CONFIG_CPUMASK_OFFSTACK=y
> BTW., I'd expect _nohz_idle_balance() to show up too in
> the profile.
>
>
It is rate limited by sd_balance_interval no? Likely it wont happen
as aggressive as idle enter/exit.
>> arm64/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS > 256
>> powerpc/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS >= 8192
>> x86/Kconfig: select CPUMASK_OFFSTACK
>> x86/Kconfig: default 8192 if SMP && CPUMASK_OFFSTACK
>> x86/Kconfig: default 512 if SMP && !CPUMASK_OFFSTACK
>
> Yeah, we make the cpumask a direct mask up to 512 bits
> (64 bytes) - it's allocated indirectly from that point
> onwards.
>
>> In either case, if we think,
>> nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)
>>
>> Since it is not a correctness stuff here, at worst we
>> will lose a chance to do idle load balance.
>
> Yeah, I don't think it's a correctness issue, removing
> nr_cpus should not change the ordering of modifications
> to nohz.idle_cpus_mask and nohz.has_blocked.
>
> ( The nohz.nr_cpus and nohz.idle_cpus_mask
> modifications were not ordered versus each other
> previously to begin with - they are only ordered
> versus nohz.has_blocked. )
>
Yes. This is true.
>> Let me re-write changelog. Also see a bit more into it.
>
> Thank you!
>
> Note that the fundamental scalability challenge with
> the nohz_balancer_kick(), nohz_balance_enter_idle() and
> nohz_balance_exit_idle() functions is the following:
>
> (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
> (or nohz.idle_cpu_mask) and nohz.has_blocked to
> see whether there's any nohz balancing work to do,
> in every scheduler tick.
>
> (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
> modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask)
> and nohz.has_blocked.
>
> The characteristic frequencies are the following:
>
> (1) happens at scheduler (busy-)tick frequency on
> every CPU. This is a relatively constant frequency
> in the ~1 kHz range or lower.
>
> (2) happens at idle enter/exit frequency on every CPU
> that goes to idle. This is workload dependent, but
> can easily be hundreds of kHz for IO-bound loads
> and high CPU counts. Ie. can be orders of
> magnitude higher than (1), in which case a
> cachemiss at every invocation of (1) is almost
> inevitable. [ Ie. the cost of getting really long
> NOHZ idling times is the extra overhead of the
> exit/enter nohz cycles for partially idle CPUs on
> high-rate IO workloads. ]
>
> There's two types of costs from these functions:
>
> (A) scheduler tick cost via (1): this happens on busy
> CPUs too, and is thus a primary scalability cost.
> But the rate here is constant and typically much
> lower than (A), hence the absolute benefit to
> workload scalability will be lower as well.
>
> (B) idle cost via (2): going-to-idle and
> coming-from-idle costs are secondary concerns,
> because they impact power efficiency more than
> they impact scalability. (Ie. while 'wasting idle
> time' isn't good, but it often doesn't hurt
> scalability, at least as long as it's done for a
> good reason and done in moderation.)
>
> but in terms of absolute cost this scales up with
> nr_cpus as well, and a much faster rate, and thus
> may also approach and negatively impact system
> limits like memory bus/fabric bandwidth.
>
Thank you. I am going to mostly copy this in next version :)
> So I'd argue that reductions in both (A) and (B) are
> useful, but for different reasons.
>
> The *real* breakthrough in this area would be to reduce
> the unlimited upwards frequency of (2), by
> fundamentally changing the model of NOHZ idle
> balancing:
>
> For example by measuring the rate (frequency) of idle
> cycles on each CPU (this can be done without any
> cross-CPU logic), we would turn off NOHZ-idle for that
> CPU when the rate goes beyond a threshold.
>
> The resulting regular idle load-balancing passes will
> be rate-limited by balance intervals and won't be as
> aggressive as nohz_balance_enter+exit_idle(). (I hope...)
>
> Truly idle CPUs would go into NOHZ mode automatically,
> as their measured rate of idling drops below the
> threshold.
>
> Thoughts?
Interesting.
Let me see if i get this right.
So track the idle duration over certain past interval.
If is below certain threshould mark those CPUs in nohz state
while doing idle entry/exit.
If not, reset their bits in nohz mask and don't update the mask?
I think rq->avg_idle there already and we do similar checks for newidle_balance.
sched_balance_newidle
...
if (!get_rd_overloaded(this_rq->rd) ||
this_rq->avg_idle < sd->max_newidle_lb_cost) {
update_next_balance(sd, &next_balance);
rcu_read_unlock();
goto out;
}
Powered by blists - more mailing lists