linux-kernel - Re: [PATCH 4/4] sched/fair: Remove atomic nr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <15f8f8c6-df8f-4218-a650-eaa8f7581d67@linux.ibm.com>
Date: Tue, 2 Dec 2025 10:59:13 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: peterz@...radead.org, vincent.guittot@...aro.org,
        linux-kernel@...r.kernel.org, kprateek.nayak@....com,
        dietmar.eggemann@....com, vschneid@...hat.com, rostedt@...dmis.org,
        tglx@...utronix.de, tim.c.chen@...ux.intel.com
Subject: Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask
 instead

Hi Ingo,

Thanks for taking a look at this.

On 12/2/25 1:28 AM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
> 
>> nohz_balance_enter_idle:
>> 	cpumask_set_cpu(cpu, nohz.idle_cpus_mask)
>> 	atomic_inc(&nohz.nr_cpus)
>>
>> nohz_balance_exit_idle:
>> 	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask)
>> 	atomic_dec(&nohz.nr_cpus)
>>
>> kick_ilb:
>> 	if (likely(!atomic_read(&nohz.nr_cpus)))
>> 		return;
>>
>> So, idle_cpus_mask contains the same information. Instead of doing
>> costly atomic in large systems, its better to check if cpumask is empty
>> or not to make the same decision to trigger idle load balance.
>>
>> There might be race between cpumask_empty check and set of cpumask in
>> the remote CPUs. In such case at next tick idle load balance will be
>> triggered. Race of clearing the bit is not a concern, since _nohz_idle_balance
>> checks if CPU is idle or not before doing the balance.
>>
>> cpumask_empty uses ffs. So should not be very costly.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
> 
>>   static struct {
>>   	cpumask_var_t idle_cpus_mask;
>> -	atomic_t nr_cpus;
>>   	int has_blocked;		/* Idle CPUS has blocked load */
>>   	int needs_update;		/* Newly idle CPUs need their next_balance collated */
>>   	unsigned long next_balance;     /* in jiffy units */
>> @@ -12450,7 +12449,7 @@ static void nohz_balancer_kick(struct rq *rq)
>>   	 * None are in tickless mode and hence no need for NOHZ idle load
>>   	 * balancing, do stats update if its due
>>   	 */
>> -	if (unlikely(!atomic_read(&nohz.nr_cpus)))
>> +	if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>>   		goto out;
> 
> So the thing is, if the goal is to avoid cacheline
> bouncing, this won't fundamentally change the
> situation:
> 
>>   	rq->nohz_tick_stopped = 0;
>>   	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
>> -	atomic_dec(&nohz.nr_cpus);
> 
> nohz.idle_cpus_mask will be on a single 64-byte
> cacheline even on 512 CPU systems, and the
> cpumask_clear_cpu() and cpumask_set_cpu() calls will
> dirty the cacheline and make it bounce with exactly the
> same frequency as the atomic_inc/dec() of nohz.nr_cpus
> does today.
> 
>  From the 0/4 boilerplate description:
> 
>   > It was noted when running on large systems
>   > nohz.nr_cpus cacheline was bouncing quite often.
>   > There is atomic inc/dec and read happening on many
>   > CPUs at a time and it is possible for this line to
>   > bounce often.
> 
> That the nr_cpus modification is an atomic op doesn't
> change the situation much in terms of cacheline
> bouncing, because the cacheline dirtying will still
> cause comparable levels of bouncing on modern CPUs with
> modern cache coherency protocols.
> 
> If nr_cpus and nohz.nr_cpus are in separate cachelines,
> then this patch might eliminate about half of the
> bounces - but AFAICS they are right next to each other,
> so unless it's off-stack cpumasks, they should be in
> the same cacheline. Half of 'bad bouncing' is still
> kinda 'bad bouncing'. :-)
> 

You are right. If we have to get rid of cacheline bouncing then
we need to fix nohz.idle_cpus_mask too.

I forgot about CPUMASK_OFFSTACK.

If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and nr_cpus are in same
cacheline Right?. That data in cover-letter is with =y. In that case, getting
it to cpumask_empty will give minimal gains by remvong an additional
atomic inc/dec operations.

If CPUMASK_OFFSTACK=n, then they could be in different cacheline.
In that case gains should be better. Very likely our performance team
would have done with =n.
IIRC, on powerpc, based on NR_CPU we change it. On x86 it chooses NR_CPUs.

arm64/Kconfig:	select CPUMASK_OFFSTACK if NR_CPUS > 256
powerpc/Kconfig:	select CPUMASK_OFFSTACK			if NR_CPUS >= 8192
x86/Kconfig:	select CPUMASK_OFFSTACK
x86/Kconfig:	default 8192 if  SMP && CPUMASK_OFFSTACK
x86/Kconfig:	default  512 if  SMP && !CPUMASK_OFFSTACK



In either case, if we think,
	nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)

Since it is not a correctness stuff here, at worst we will lose a chance
to do idle load balance. But at next tick we will do the idle balance.
Looking at code it might happen even today, First we set/clear the mask
and then we do inc/dec. So if mask was set, but inc hasn't happened, but
read completed, then would lose a chance. (though very slim)


> I'm not really objecting to the patch, because it would
> reduce cacheline bouncing in the offstack-mask case,
> but the explanation isn't very clear about these
> details.
>

Let me re-write changelog. Also see a bit more into it.


  
> Thanks,
> 
> 	Ingo