linux-kernel - Re: [PATCH] sched/fair: Save cpu id locally to avoid repeated smp_processor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <4813cd95-a176-4f2c-b45e-2b83e2e5d533@oss.qualcomm.com>
Date: Fri, 15 Aug 2025 18:18:24 +0800
From: Zhongqiu Han <zhongqiu.han@....qualcomm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
        zhongqiu.han@....qualcomm.com
Subject: Re: [PATCH] sched/fair: Save cpu id locally to avoid repeated
 smp_processor_id() calls

Sorry just resend as sending tolinux-kernel@...r.kernel.org failed.


在 2025/8/15 18:01, Zhongqiu Han 写道:
     
在 2025/8/14 22:26, Peter Zijlstra 写道:
> On Thu, Aug 14, 2025 at 10:01:41PM +0800, Zhongqiu Han wrote:
>> Avoid repeated smp_processor_id() by saving cpu id in a local variable.
>>
>> - find_new_ilb(): func called with interrupts disabled.
>> - sched_cfs_period_timer(): cpu id saved after raw_spin_lock_irqsave().
>>
>> This improves clarity and reduces overhead without changing functionality.
Hi Peter,
Thanks for your review~
> No, this makes things actively worse. It:
>
>   - violates coding style by declaring a variable in the middle of code
>   - fetches the CPU number even if its never used (hopefully the compiler
>     can optimize this for you)
>   - moves error path logic into the normal code path
Acked, I regret my oversight and sincerely apologize for the inconvenience
it may have caused.
>> Signed-off-by: Zhongqiu Han<zhongqiu.han@....qualcomm.com>
>> ---
>>   kernel/sched/fair.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e256793b9a08..60a9830fb8a4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6401,6 +6401,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>>   	int count = 0;
>>   
>>   	raw_spin_lock_irqsave(&cfs_b->lock, flags);
>> +	int cpu = smp_processor_id();
>> +
>>   	for (;;) {
>>   		overrun = hrtimer_forward_now(timer, cfs_b->period);
>>   		if (!overrun)
>> @@ -6424,13 +6426,13 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>>   
>>   				pr_warn_ratelimited(
>>   	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
>> -					smp_processor_id(),
>> +					cpu,
>>   					div_u64(new, NSEC_PER_USEC),
>>   					div_u64(cfs_b->quota, NSEC_PER_USEC));
>>   			} else {
>>   				pr_warn_ratelimited(
>>   	"cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
>> -					smp_processor_id(),
>> +					cpu,
>>   					div_u64(old, NSEC_PER_USEC),
>>   					div_u64(cfs_b->quota, NSEC_PER_USEC));
>>   			}
>> @@ -12195,12 +12197,13 @@ static inline int find_new_ilb(void)
>>   {
>>   	const struct cpumask *hk_mask;
>>   	int ilb_cpu;
>> +	int this_cpu = smp_processor_id();
> This again violates coding style.
May I know is the coding style issue caused by the naming of "this"_cpu?
Should I just use "int cpu = smp_processor_id();"


>>   	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>>   
>>   	for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
>>   
>> -		if (ilb_cpu == smp_processor_id())
>> +		if (ilb_cpu == this_cpu)
>>   			continue;
> And have you checked if the compiler did this lift for you? It can
> generally lift loads out of loops.
I compared the compiler behavior on ARM64 and x86, and found that
both architectures apply a small optimization. The short summary is:


For ARM64:
The instruction count is reduced, eliminating one ldr instruction.

Before Patch:
	if (ilb_cpu == smp_processor_id())
b9401303        ldr     w3, [x24, #16]
6b14007f        cmp     w3, w20
54000080        b.eq    ffff8000800af6f4 <sched_balance_trigger+0x144>

After Patch:
	if (ilb_cpu == cpu)
6b14033f        cmp     w25, w20
54000080        b.eq    ffff8000800b7240 <sched_balance_trigger+0x150>

For X86:
The instruction becomes simpler, as it no longer needs to access memory
and instead directly accesses a register

Before Patch:
	if (ilb_cpu == smp_processor_id())
65 39 1d 00 00 00 00    cmp    %ebx,%gs:0x0(%rip)
74 ca                   je     11def <sched_balance_trigger+0xff>
    

After Patch:
	if (ilb_cpu == cpu)
41 39 dd                cmp    %ebx,%r13d
74 ce                   je     11df7 <sched_balance_trigger+0x107>


However, there is a slight chance that this loop might not be entered
even once:
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask)

May I know whether this optimization is still regarded as worthwhile?
Thanks a lot~



The details reference info is:

ARM64 before Patch:

hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
52800040        mov     w0, #0x2                        // #2
94004ad1        bl      ffff8000800c21e8 <housekeeping_cpumask>
aa0003f7        mov     x23, x0
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
52800003        mov     w3, #0x0                        // #0
b9417302        ldr     w2, [x24, #368]
d5384118        mrs     x24, sp_el0
return _find_next_and_bit(addr1, addr2, size, offset);
f9402260        ldr     x0, [x19, #64]
2a0203e2        mov     w2, w2
93407c63        sxtw    x3, w3
aa1703e1        mov     x1, x23
9419b7cc        bl      ffff80008071d5f8 <_find_next_and_bit>
aa0003f4        mov     x20, x0
b94002a2        ldr     w2, [x21]
6b02001f        cmp     w0, w2
54000b02        b.cs    ffff8000800af838 <sched_balance_trigger+0x288>
	if (ilb_cpu == smp_processor_id())
b9401303        ldr     w3, [x24, #16]
6b14007f        cmp     w3, w20
54000080        b.eq    ffff8000800af6f4 <sched_balance_trigger+0x144>


ARM64 after Patch:

int cpu = smp_processor_id();
b9401019        ldr     w25, [x0, #16]
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
52800040        mov     w0, #0x2                        // #2
94004b4b        bl      ffff8000800c9f20 <housekeeping_cpumask>
aa0003f7        mov     x23, x0
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
b94dd302        ldr     w2, [x24, #3536]
52800003        mov     w3, #0x0                        // #0
d503201f        nop
return _find_next_and_bit(addr1, addr2, size, offset);
f9402260        ldr     x0, [x19, #64]
2a0203e2        mov     w2, w2
93407c63        sxtw    x3, w3
aa1703e1        mov     x1, x23
9419f83e        bl      ffff800080735310 <_find_next_and_bit>
aa0003f4        mov     x20, x0
b94002a2        ldr     w2, [x21]
6b02001f        cmp     w0, w2
54000a82        b.cs    ffff8000800b7378 <sched_balance_trigger+0x288>
         if (ilb_cpu == cpu)
6b14033f        cmp     w25, w20
54000080        b.eq    ffff8000800b7240 <sched_balance_trigger+0x150>




X86 before Patch:

hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
bf 02 00 00 00          mov    $0x2,%edi
e8 00 00 00 00          call   11de8 <sched_balance_trigger+0xf8>
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
31 c9                   xor    %ecx,%ecx
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
49 89 c4                mov    %rax,%r12
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
eb 03                   jmp    11df2 <sched_balance_trigger+0x102>
8d 4b 01                lea    0x1(%rbx),%ecx
8b 15 00 00 00 00       mov    0x0(%rip),%edx
48 63 c9                movslq %ecx,%rcx
return _find_next_and_bit(addr1, addr2, size, offset);
4c 89 e6                mov    %r12,%rsi
48 c7 c7 00 00 00 00    mov    $0x0,%rdi
e8 00 00 00 00          call   11e0a <sched_balance_trigger+0x11a>
3b 05 00 00 00 00       cmp    0x0(%rip),%eax
48 89 c3                mov    %rax,%rbx
41 89 c5                mov    %eax,%r13d
0f 83 44 ff ff ff       jae    11d60 <sched_balance_trigger+0x70>
	if (ilb_cpu == smp_processor_id())
65 39 1d 00 00 00 00    cmp    %ebx,%gs:0x0(%rip)
74 ca                   je     11def <sched_balance_trigger+0xff>

  

X86 after Patch:

hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
bf 02 00 00 00          mov    $0x2,%edi
int cpu = smp_processor_id();
65 44 8b 2d 00 00 00    mov    %gs:0x0(%rip),%r13d
00
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
e8 00 00 00 00          call   11df0 <sched_balance_trigger+0x100>
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
31 c9                   xor    %ecx,%ecx
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
49 89 c4                mov    %rax,%r12
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
eb 03                   jmp    11dfa <sched_balance_trigger+0x10a>
8d 4b 01                lea    0x1(%rbx),%ecx
8b 15 00 00 00 00       mov    0x0(%rip),%edx
48 63 c9                movslq %ecx,%rcx
return _find_next_and_bit(addr1, addr2, size, offset);
4c 89 e6                mov    %r12,%rsi
48 c7 c7 00 00 00 00    mov    $0x0,%rdi
e8 00 00 00 00          call   11e12 <sched_balance_trigger+0x122>
3b 05 00 00 00 00       cmp    0x0(%rip),%eax
48 89 c3                mov    %rax,%rbx
41 89 c6                mov    %eax,%r14d
0f 83 3c ff ff ff       jae    11d60 <sched_balance_trigger+0x70>
	if (ilb_cpu == cpu)
41 39 dd                cmp    %ebx,%r13d
74 ce                   je     11df7 <sched_balance_trigger+0x107>
>>   		if (idle_cpu(ilb_cpu))
>> -- 
>> 2.43.0
Thx and BRs,
Zhongqiu Han