[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <15f8f8c6-df8f-4218-a650-eaa8f7581d67@linux.ibm.com>
Date: Tue, 2 Dec 2025 10:59:13 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: peterz@...radead.org, vincent.guittot@...aro.org,
linux-kernel@...r.kernel.org, kprateek.nayak@....com,
dietmar.eggemann@....com, vschneid@...hat.com, rostedt@...dmis.org,
tglx@...utronix.de, tim.c.chen@...ux.intel.com
Subject: Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask
instead
Hi Ingo,
Thanks for taking a look at this.
On 12/2/25 1:28 AM, Ingo Molnar wrote:
>
> * Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
>
>> nohz_balance_enter_idle:
>> cpumask_set_cpu(cpu, nohz.idle_cpus_mask)
>> atomic_inc(&nohz.nr_cpus)
>>
>> nohz_balance_exit_idle:
>> cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask)
>> atomic_dec(&nohz.nr_cpus)
>>
>> kick_ilb:
>> if (likely(!atomic_read(&nohz.nr_cpus)))
>> return;
>>
>> So, idle_cpus_mask contains the same information. Instead of doing
>> costly atomic in large systems, its better to check if cpumask is empty
>> or not to make the same decision to trigger idle load balance.
>>
>> There might be race between cpumask_empty check and set of cpumask in
>> the remote CPUs. In such case at next tick idle load balance will be
>> triggered. Race of clearing the bit is not a concern, since _nohz_idle_balance
>> checks if CPU is idle or not before doing the balance.
>>
>> cpumask_empty uses ffs. So should not be very costly.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
>
>> static struct {
>> cpumask_var_t idle_cpus_mask;
>> - atomic_t nr_cpus;
>> int has_blocked; /* Idle CPUS has blocked load */
>> int needs_update; /* Newly idle CPUs need their next_balance collated */
>> unsigned long next_balance; /* in jiffy units */
>> @@ -12450,7 +12449,7 @@ static void nohz_balancer_kick(struct rq *rq)
>> * None are in tickless mode and hence no need for NOHZ idle load
>> * balancing, do stats update if its due
>> */
>> - if (unlikely(!atomic_read(&nohz.nr_cpus)))
>> + if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>> goto out;
>
> So the thing is, if the goal is to avoid cacheline
> bouncing, this won't fundamentally change the
> situation:
>
>> rq->nohz_tick_stopped = 0;
>> cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
>> - atomic_dec(&nohz.nr_cpus);
>
> nohz.idle_cpus_mask will be on a single 64-byte
> cacheline even on 512 CPU systems, and the
> cpumask_clear_cpu() and cpumask_set_cpu() calls will
> dirty the cacheline and make it bounce with exactly the
> same frequency as the atomic_inc/dec() of nohz.nr_cpus
> does today.
>
> From the 0/4 boilerplate description:
>
> > It was noted when running on large systems
> > nohz.nr_cpus cacheline was bouncing quite often.
> > There is atomic inc/dec and read happening on many
> > CPUs at a time and it is possible for this line to
> > bounce often.
>
> That the nr_cpus modification is an atomic op doesn't
> change the situation much in terms of cacheline
> bouncing, because the cacheline dirtying will still
> cause comparable levels of bouncing on modern CPUs with
> modern cache coherency protocols.
>
> If nr_cpus and nohz.nr_cpus are in separate cachelines,
> then this patch might eliminate about half of the
> bounces - but AFAICS they are right next to each other,
> so unless it's off-stack cpumasks, they should be in
> the same cacheline. Half of 'bad bouncing' is still
> kinda 'bad bouncing'. :-)
>
You are right. If we have to get rid of cacheline bouncing then
we need to fix nohz.idle_cpus_mask too.
I forgot about CPUMASK_OFFSTACK.
If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and nr_cpus are in same
cacheline Right?. That data in cover-letter is with =y. In that case, getting
it to cpumask_empty will give minimal gains by remvong an additional
atomic inc/dec operations.
If CPUMASK_OFFSTACK=n, then they could be in different cacheline.
In that case gains should be better. Very likely our performance team
would have done with =n.
IIRC, on powerpc, based on NR_CPU we change it. On x86 it chooses NR_CPUs.
arm64/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS > 256
powerpc/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS >= 8192
x86/Kconfig: select CPUMASK_OFFSTACK
x86/Kconfig: default 8192 if SMP && CPUMASK_OFFSTACK
x86/Kconfig: default 512 if SMP && !CPUMASK_OFFSTACK
In either case, if we think,
nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)
Since it is not a correctness stuff here, at worst we will lose a chance
to do idle load balance. But at next tick we will do the idle balance.
Looking at code it might happen even today, First we set/clear the mask
and then we do inc/dec. So if mask was set, but inc hasn't happened, but
read completed, then would lose a chance. (though very slim)
> I'm not really objecting to the patch, because it would
> reduce cacheline bouncing in the offstack-mask case,
> but the explanation isn't very clear about these
> details.
>
Let me re-write changelog. Also see a bit more into it.
> Thanks,
>
> Ingo
Powered by blists - more mailing lists