[<prev] [next>] [day] [month] [year] [list]
Message-ID: <4813cd95-a176-4f2c-b45e-2b83e2e5d533@oss.qualcomm.com>
Date: Fri, 15 Aug 2025 18:18:24 +0800
From: Zhongqiu Han <zhongqiu.han@....qualcomm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
zhongqiu.han@....qualcomm.com
Subject: Re: [PATCH] sched/fair: Save cpu id locally to avoid repeated
smp_processor_id() calls
Sorry just resend as sending tolinux-kernel@...r.kernel.org failed.
在 2025/8/15 18:01, Zhongqiu Han 写道:
在 2025/8/14 22:26, Peter Zijlstra 写道:
> On Thu, Aug 14, 2025 at 10:01:41PM +0800, Zhongqiu Han wrote:
>> Avoid repeated smp_processor_id() by saving cpu id in a local variable.
>>
>> - find_new_ilb(): func called with interrupts disabled.
>> - sched_cfs_period_timer(): cpu id saved after raw_spin_lock_irqsave().
>>
>> This improves clarity and reduces overhead without changing functionality.
Hi Peter,
Thanks for your review~
> No, this makes things actively worse. It:
>
> - violates coding style by declaring a variable in the middle of code
> - fetches the CPU number even if its never used (hopefully the compiler
> can optimize this for you)
> - moves error path logic into the normal code path
Acked, I regret my oversight and sincerely apologize for the inconvenience
it may have caused.
>> Signed-off-by: Zhongqiu Han<zhongqiu.han@....qualcomm.com>
>> ---
>> kernel/sched/fair.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e256793b9a08..60a9830fb8a4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6401,6 +6401,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> int count = 0;
>>
>> raw_spin_lock_irqsave(&cfs_b->lock, flags);
>> + int cpu = smp_processor_id();
>> +
>> for (;;) {
>> overrun = hrtimer_forward_now(timer, cfs_b->period);
>> if (!overrun)
>> @@ -6424,13 +6426,13 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>>
>> pr_warn_ratelimited(
>> "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
>> - smp_processor_id(),
>> + cpu,
>> div_u64(new, NSEC_PER_USEC),
>> div_u64(cfs_b->quota, NSEC_PER_USEC));
>> } else {
>> pr_warn_ratelimited(
>> "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
>> - smp_processor_id(),
>> + cpu,
>> div_u64(old, NSEC_PER_USEC),
>> div_u64(cfs_b->quota, NSEC_PER_USEC));
>> }
>> @@ -12195,12 +12197,13 @@ static inline int find_new_ilb(void)
>> {
>> const struct cpumask *hk_mask;
>> int ilb_cpu;
>> + int this_cpu = smp_processor_id();
> This again violates coding style.
May I know is the coding style issue caused by the naming of "this"_cpu?
Should I just use "int cpu = smp_processor_id();"
>> hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>>
>> for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
>>
>> - if (ilb_cpu == smp_processor_id())
>> + if (ilb_cpu == this_cpu)
>> continue;
> And have you checked if the compiler did this lift for you? It can
> generally lift loads out of loops.
I compared the compiler behavior on ARM64 and x86, and found that
both architectures apply a small optimization. The short summary is:
For ARM64:
The instruction count is reduced, eliminating one ldr instruction.
Before Patch:
if (ilb_cpu == smp_processor_id())
b9401303 ldr w3, [x24, #16]
6b14007f cmp w3, w20
54000080 b.eq ffff8000800af6f4 <sched_balance_trigger+0x144>
After Patch:
if (ilb_cpu == cpu)
6b14033f cmp w25, w20
54000080 b.eq ffff8000800b7240 <sched_balance_trigger+0x150>
For X86:
The instruction becomes simpler, as it no longer needs to access memory
and instead directly accesses a register
Before Patch:
if (ilb_cpu == smp_processor_id())
65 39 1d 00 00 00 00 cmp %ebx,%gs:0x0(%rip)
74 ca je 11def <sched_balance_trigger+0xff>
After Patch:
if (ilb_cpu == cpu)
41 39 dd cmp %ebx,%r13d
74 ce je 11df7 <sched_balance_trigger+0x107>
However, there is a slight chance that this loop might not be entered
even once:
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask)
May I know whether this optimization is still regarded as worthwhile?
Thanks a lot~
The details reference info is:
ARM64 before Patch:
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
52800040 mov w0, #0x2 // #2
94004ad1 bl ffff8000800c21e8 <housekeeping_cpumask>
aa0003f7 mov x23, x0
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
52800003 mov w3, #0x0 // #0
b9417302 ldr w2, [x24, #368]
d5384118 mrs x24, sp_el0
return _find_next_and_bit(addr1, addr2, size, offset);
f9402260 ldr x0, [x19, #64]
2a0203e2 mov w2, w2
93407c63 sxtw x3, w3
aa1703e1 mov x1, x23
9419b7cc bl ffff80008071d5f8 <_find_next_and_bit>
aa0003f4 mov x20, x0
b94002a2 ldr w2, [x21]
6b02001f cmp w0, w2
54000b02 b.cs ffff8000800af838 <sched_balance_trigger+0x288>
if (ilb_cpu == smp_processor_id())
b9401303 ldr w3, [x24, #16]
6b14007f cmp w3, w20
54000080 b.eq ffff8000800af6f4 <sched_balance_trigger+0x144>
ARM64 after Patch:
int cpu = smp_processor_id();
b9401019 ldr w25, [x0, #16]
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
52800040 mov w0, #0x2 // #2
94004b4b bl ffff8000800c9f20 <housekeeping_cpumask>
aa0003f7 mov x23, x0
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
b94dd302 ldr w2, [x24, #3536]
52800003 mov w3, #0x0 // #0
d503201f nop
return _find_next_and_bit(addr1, addr2, size, offset);
f9402260 ldr x0, [x19, #64]
2a0203e2 mov w2, w2
93407c63 sxtw x3, w3
aa1703e1 mov x1, x23
9419f83e bl ffff800080735310 <_find_next_and_bit>
aa0003f4 mov x20, x0
b94002a2 ldr w2, [x21]
6b02001f cmp w0, w2
54000a82 b.cs ffff8000800b7378 <sched_balance_trigger+0x288>
if (ilb_cpu == cpu)
6b14033f cmp w25, w20
54000080 b.eq ffff8000800b7240 <sched_balance_trigger+0x150>
X86 before Patch:
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
bf 02 00 00 00 mov $0x2,%edi
e8 00 00 00 00 call 11de8 <sched_balance_trigger+0xf8>
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
31 c9 xor %ecx,%ecx
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
49 89 c4 mov %rax,%r12
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
eb 03 jmp 11df2 <sched_balance_trigger+0x102>
8d 4b 01 lea 0x1(%rbx),%ecx
8b 15 00 00 00 00 mov 0x0(%rip),%edx
48 63 c9 movslq %ecx,%rcx
return _find_next_and_bit(addr1, addr2, size, offset);
4c 89 e6 mov %r12,%rsi
48 c7 c7 00 00 00 00 mov $0x0,%rdi
e8 00 00 00 00 call 11e0a <sched_balance_trigger+0x11a>
3b 05 00 00 00 00 cmp 0x0(%rip),%eax
48 89 c3 mov %rax,%rbx
41 89 c5 mov %eax,%r13d
0f 83 44 ff ff ff jae 11d60 <sched_balance_trigger+0x70>
if (ilb_cpu == smp_processor_id())
65 39 1d 00 00 00 00 cmp %ebx,%gs:0x0(%rip)
74 ca je 11def <sched_balance_trigger+0xff>
X86 after Patch:
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
bf 02 00 00 00 mov $0x2,%edi
int cpu = smp_processor_id();
65 44 8b 2d 00 00 00 mov %gs:0x0(%rip),%r13d
00
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
e8 00 00 00 00 call 11df0 <sched_balance_trigger+0x100>
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
31 c9 xor %ecx,%ecx
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
49 89 c4 mov %rax,%r12
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
eb 03 jmp 11dfa <sched_balance_trigger+0x10a>
8d 4b 01 lea 0x1(%rbx),%ecx
8b 15 00 00 00 00 mov 0x0(%rip),%edx
48 63 c9 movslq %ecx,%rcx
return _find_next_and_bit(addr1, addr2, size, offset);
4c 89 e6 mov %r12,%rsi
48 c7 c7 00 00 00 00 mov $0x0,%rdi
e8 00 00 00 00 call 11e12 <sched_balance_trigger+0x122>
3b 05 00 00 00 00 cmp 0x0(%rip),%eax
48 89 c3 mov %rax,%rbx
41 89 c6 mov %eax,%r14d
0f 83 3c ff ff ff jae 11d60 <sched_balance_trigger+0x70>
if (ilb_cpu == cpu)
41 39 dd cmp %ebx,%r13d
74 ce je 11df7 <sched_balance_trigger+0x107>
>> if (idle_cpu(ilb_cpu))
>> --
>> 2.43.0
Thx and BRs,
Zhongqiu Han
Powered by blists - more mailing lists