[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aS6bK4ad-wO2fsoo@gmail.com>
Date: Tue, 2 Dec 2025 08:54:19 +0100
From: Ingo Molnar <mingo@...nel.org>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: peterz@...radead.org, vincent.guittot@...aro.org,
linux-kernel@...r.kernel.org, kprateek.nayak@....com,
dietmar.eggemann@....com, vschneid@...hat.com, rostedt@...dmis.org,
tglx@...utronix.de, tim.c.chen@...ux.intel.com,
Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask
instead
* Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
> > That the nr_cpus modification is an atomic op
> > doesn't change the situation much in terms of
> > cacheline bouncing, because the cacheline dirtying
> > will still cause comparable levels of bouncing on
> > modern CPUs with modern cache coherency protocols.
> >
> > If nr_cpus and nohz.nr_cpus are in separate
> > cachelines, then this patch might eliminate about
> > half of the bounces - but AFAICS they are right
> > next to each other, so unless it's off-stack
> > cpumasks, they should be in the same cacheline.
> > Half of 'bad bouncing' is still kinda 'bad
> > bouncing'. :-)
> >
>
> You are right. If we have to get rid of cacheline
> bouncing then we need to fix nohz.idle_cpus_mask too.
>
> I forgot about CPUMASK_OFFSTACK.
>
> If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and
> nr_cpus are in same cacheline Right?. That data in
> cover-letter is with =y. In that case, getting it to
> cpumask_empty will give minimal gains by remvong an
> additional atomic inc/dec operations.
>
> If CPUMASK_OFFSTACK=n, then they could be in
> different cacheline. In that case gains should be
> better. Very likely our performance team would have
> done with =n. IIRC, on powerpc, based on NR_CPU we
> change it. On x86 it chooses NR_CPUs.
Well, it's the other way around: in the 'off stack'
case the cpumask_var_t is moved "off the stack" because
it's too large - i.e. we allocate it separately, in a
separate cacheline as a side effect. Even if the main
cpumask pointer is next to nohz.nr_cpus, the mask
itself is behind an indirect pointer, see
<linux/cpumask_types.h>:
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef struct cpumask *cpumask_var_t;
#else
Note that idle_cpus_mask is defined as a cpumask_var_t
and is thus affected by CONFIG_CPUMASK_OFFSTACK and may
be allocated dynamically:
kernel/sched/fair.c: cpumask_var_t idle_cpus_mask;
...
kernel/sched/fair.c: zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
So I think it's quite possible that the performance
measurements were done with CONFIG_CPUMASK_OFFSTACK=y:
it's commonly enabled in server/enterprise distros -
but even Ubuntu enables it on their desktop kernel, so
the reduction in cacheline ping-pong is probably real
and the change makes sense in that context.
But even with OFFSTACK=n, if NR_CPUS=512 it's possible
that a fair chunk of the cpumask ends up on the
previous (separate) cacheline from where nr_cpus is,
with a resulting observable reduction of the cache
bouncing rate:
static struct {
cpumask_var_t idle_cpus_mask;
atomic_t nr_cpus;
Note that since 'nohz' is ____cacheline_aligned, in
this case idle_cpus_mask will take a full cacheline in
the NR_CPUS=512 case, and nr_cpus will always be on a
separate cacheline.
If CONFIG_NR_CPUS is 128 or smaller, then
idle_cpus_mask and nr_cpus will be on the same
cacheline.
Anyway, if the reduction in cache ping-pong is higher
than 50%, then either something weird is going on, or
I'm missing something. :-)
But the measurement data you provided:
baseline: tip sched/core at 3eb593560146
1.01% [k] nohz_balance_exit_idle
0.31% [k] nohz_balancer_kick
0.05% [k] nohz_balance_enter_idle
With series:
0.45% [k] nohz_balance_exit_idle
0.18% [k] nohz_balancer_kick
0.01% [k] nohz_balance_enter_idle
... is roughly in the 50% reduction range, if profiled
overhead is a good proxy for cache bounce overhead
(which it may be), which supports my hypothesis that
the tests were run with CONFIG_CPUMASK_OFFSTACK=y and
the cache pong-pong rate in these functions got roughly
halved.
BTW., I'd expect _nohz_idle_balance() to show up too in
the profile.
> arm64/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS > 256
> powerpc/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS >= 8192
> x86/Kconfig: select CPUMASK_OFFSTACK
> x86/Kconfig: default 8192 if SMP && CPUMASK_OFFSTACK
> x86/Kconfig: default 512 if SMP && !CPUMASK_OFFSTACK
Yeah, we make the cpumask a direct mask up to 512 bits
(64 bytes) - it's allocated indirectly from that point
onwards.
> In either case, if we think,
> nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)
>
> Since it is not a correctness stuff here, at worst we
> will lose a chance to do idle load balance.
Yeah, I don't think it's a correctness issue, removing
nr_cpus should not change the ordering of modifications
to nohz.idle_cpus_mask and nohz.has_blocked.
( The nohz.nr_cpus and nohz.idle_cpus_mask
modifications were not ordered versus each other
previously to begin with - they are only ordered
versus nohz.has_blocked. )
> Let me re-write changelog. Also see a bit more into it.
Thank you!
Note that the fundamental scalability challenge with
the nohz_balancer_kick(), nohz_balance_enter_idle() and
nohz_balance_exit_idle() functions is the following:
(1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
(or nohz.idle_cpu_mask) and nohz.has_blocked to
see whether there's any nohz balancing work to do,
in every scheduler tick.
(2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask)
and nohz.has_blocked.
The characteristic frequencies are the following:
(1) happens at scheduler (busy-)tick frequency on
every CPU. This is a relatively constant frequency
in the ~1 kHz range or lower.
(2) happens at idle enter/exit frequency on every CPU
that goes to idle. This is workload dependent, but
can easily be hundreds of kHz for IO-bound loads
and high CPU counts. Ie. can be orders of
magnitude higher than (1), in which case a
cachemiss at every invocation of (1) is almost
inevitable. [ Ie. the cost of getting really long
NOHZ idling times is the extra overhead of the
exit/enter nohz cycles for partially idle CPUs on
high-rate IO workloads. ]
There's two types of costs from these functions:
(A) scheduler tick cost via (1): this happens on busy
CPUs too, and is thus a primary scalability cost.
But the rate here is constant and typically much
lower than (A), hence the absolute benefit to
workload scalability will be lower as well.
(B) idle cost via (2): going-to-idle and
coming-from-idle costs are secondary concerns,
because they impact power efficiency more than
they impact scalability. (Ie. while 'wasting idle
time' isn't good, but it often doesn't hurt
scalability, at least as long as it's done for a
good reason and done in moderation.)
but in terms of absolute cost this scales up with
nr_cpus as well, and a much faster rate, and thus
may also approach and negatively impact system
limits like memory bus/fabric bandwidth.
So I'd argue that reductions in both (A) and (B) are
useful, but for different reasons.
The *real* breakthrough in this area would be to reduce
the unlimited upwards frequency of (2), by
fundamentally changing the model of NOHZ idle
balancing:
For example by measuring the rate (frequency) of idle
cycles on each CPU (this can be done without any
cross-CPU logic), we would turn off NOHZ-idle for that
CPU when the rate goes beyond a threshold.
The resulting regular idle load-balancing passes will
be rate-limited by balance intervals and won't be as
aggressive as nohz_balance_enter+exit_idle(). (I hope...)
Truly idle CPUs would go into NOHZ mode automatically,
as their measured rate of idling drops below the
threshold.
Thoughts?
Ingo
Powered by blists - more mailing lists