[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAqD6wTn=Qfwes1iScpmk0Lrh8HoqMoE+h4OBK2c+26RA@mail.gmail.com>
Date: Tue, 13 Jan 2026 10:23:34 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: mingo@...nel.org, peterz@...radead.org, linux-kernel@...r.kernel.org,
kprateek.nayak@....com, juri.lelli@...hat.com, vschneid@...hat.com,
tglx@...nel.org, dietmar.eggemann@....com, anna-maria@...utronix.de,
frederic@...nel.org, wangyang.guo@...el.com
Subject: Re: [PATCH v4 3/3] sched/fair: Remove nohz.nr_cpus and use weight of
cpumask instead
On Mon, 12 Jan 2026 at 06:05, Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
>
> nohz.nr_cpus was observed as contended cacheline when running
> enterprise workload on large systems.
>
> Fundamental scalability challenge with nohz.idle_cpus_mask
> and nohz.nr_cpus is the following:
>
> (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
> (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's
> any nohz balancing work to do, in every scheduler tick.
>
> (2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
> (through nohz_balancer_kick() via sched_tick()) modify (write)
> nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.
>
> The characteristic frequencies are the following:
>
> (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency
> on CPU(which has not gone idle). This is a relatively constant
> frequency in the ~1 kHz range or lower.
>
> (2) happens at idle enter/exit frequency on every CPU that goes to idle.
> This is workload dependent, but can easily be hundreds of kHz for
> IO-bound loads and high CPU counts. Ie. can be orders of magnitude
> higher than (1), in which case a cachemiss at every invocation of (1)
> is almost inevitable. idle exit will trigger (1) on the CPU
> which is coming out of idle.
>
> There's two types of costs from these functions:
>
> (A) scheduler tick cost via (1): this happens on busy CPUs too, and is
> thus a primary scalability cost. But the rate here is constant and
> typically much lower than (B), hence the absolute benefit to workload
> scalability will be lower as well.
>
> (B) idle cost via (2): going-to-idle and coming-from-idle costs are
> secondary concerns, because they impact power efficiency more than
> they impact scalability. But in terms of absolute cost this scales
> up with nr_cpus as well, and a much faster rate, and thus may also
> approach and negatively impact system limits like
> memory bus/fabric bandwidth.
>
> Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the
> same cacheline, however under CONFIG_CPUMASK_OFFSTACK=y the backing storage for
> nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=n,
> the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines
> under typical NR_CPUS=512/2048. This implies two separate cachelines
> being dirtied upon idle entry / exit.
>
> nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant
> a functionally correct value. This means one less cacheline being dirtied in
> idle entry/exit path which helps to save some bus bandwidth w.r.t to those
> nohz functions(approx 50%). This in turn helps to improve enterprise
> workload throughput.
>
> On system with 480 CPUs, running "hackbench 40 process 10000 loops"
> (Avg of 3 runs)
> baseline:
> 0.81% hackbench [k] nohz_balance_exit_idle
> 0.21% hackbench [k] nohz_balancer_kick
> 0.09% swapper [k] nohz_run_idle_balance
>
> With patch:
> 0.35% hackbench [k] nohz_balance_exit_idle
> 0.09% hackbench [k] nohz_balancer_kick
> 0.07% swapper [k] nohz_run_idle_balance
>
> [Ingo Molnar: scalability analysis changlog]
> Signed-off-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
This change makes sense to me but I'm not convinced by patch1.
You wrote in patch 1 that It doesn't provide any real benefit, what
are the figures with only patch 3 ?
> ---
> kernel/sched/fair.c | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c03f963f6216..3408a5beb95b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7144,7 +7144,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
>
> static struct {
> cpumask_var_t idle_cpus_mask;
> - atomic_t nr_cpus;
> int has_blocked_load; /* Idle CPUS has blocked load */
> int needs_update; /* Newly idle CPUs need their next_balance collated */
> unsigned long next_balance; /* in jiffy units */
> @@ -12466,7 +12465,7 @@ static void nohz_balancer_kick(struct rq *rq)
> * None are in tickless mode and hence no need for NOHZ idle load
> * balancing
> */
> - if (unlikely(!atomic_read(&nohz.nr_cpus)))
> + if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
> return;
>
> if (rq->nr_running >= 2) {
> @@ -12579,7 +12578,6 @@ void nohz_balance_exit_idle(struct rq *rq)
>
> rq->nohz_tick_stopped = 0;
> cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
> - atomic_dec(&nohz.nr_cpus);
>
> set_cpu_sd_state_busy(rq->cpu);
> }
> @@ -12637,7 +12635,6 @@ void nohz_balance_enter_idle(int cpu)
> rq->nohz_tick_stopped = 1;
>
> cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> - atomic_inc(&nohz.nr_cpus);
>
> /*
> * Ensures that if nohz_idle_balance() fails to observe our
> --
> 2.47.3
>
Powered by blists - more mailing lists