[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251219112007.2827302-3-edumazet@google.com>
Date: Fri, 19 Dec 2025 11:20:07 +0000
From: Eric Dumazet <edumazet@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, Uros Bizjak <ubizjak@...il.com>,
Linus Torvalds <torvalds@...ux-foundation.org>, x86@...nel.org,
"H . Peter Anvin" <hpa@...or.com>
Cc: linux-kernel <linux-kernel@...r.kernel.org>, Eric Dumazet <eric.dumazet@...il.com>,
Eric Dumazet <edumazet@...gle.com>
Subject: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()
clang is generating very inefficient code for native_save_fl() which
is used for local_irq_save() in critical spots.
Allowing the "pop %0" to use memory:
1) forces the compiler to add annoying stack canaries when
CONFIG_STACKPROTECTOR_STRONG=y in many places.
2) Almost always is followed by an immediate "move memory,register"
One good example is _raw_spin_lock_irqsave, with 8 extra instructions
ffffffff82067a30 <_raw_spin_lock_irqsave>:
ffffffff82067a30: ...
ffffffff82067a39: 53 push %rbx
// Three instructions to ajust the stack, read the per-cpu canary
// and copy it to 8(%rsp)
ffffffff82067a3a: 48 83 ec 10 sub $0x10,%rsp
ffffffff82067a3e: 65 48 8b 05 da 15 45 02 mov %gs:0x24515da(%rip),%rax # <__stack_chk_guard>
ffffffff82067a46: 48 89 44 24 08 mov %rax,0x8(%rsp)
ffffffff82067a4b: 9c pushf
// instead of pop %rbx, compiler uses 2 instructions.
ffffffff82067a4c: 8f 04 24 pop (%rsp)
ffffffff82067a4f: 48 8b 1c 24 mov (%rsp),%rbx
ffffffff82067a53: fa cli
ffffffff82067a54: b9 01 00 00 00 mov $0x1,%ecx
ffffffff82067a59: 31 c0 xor %eax,%eax
ffffffff82067a5b: f0 0f b1 0f lock cmpxchg %ecx,(%rdi)
ffffffff82067a5f: 75 1d jne ffffffff82067a7e <_raw_spin_lock_irqsave+0x4e>
// three instructions to check the stack canary
ffffffff82067a61: 65 48 8b 05 b7 15 45 02 mov %gs:0x24515b7(%rip),%rax # <__stack_chk_guard>
ffffffff82067a69: 48 3b 44 24 08 cmp 0x8(%rsp),%rax
ffffffff82067a6e: 75 17 jne ffffffff82067a87
...
// One extra instruction to adjust the stack.
ffffffff82067a73: 48 83 c4 10 add $0x10,%rsp
...
// One more instruction in case the stack was mangled.
ffffffff82067a87: e8 a4 35 ff ff call ffffffff8205b030 <__stack_chk_fail>
This patch changes nothing for gcc, but for clang saves ~20000 bytes of text
even though more functions are inlined.
$ size vmlinux.gcc.before vmlinux.gcc.after vmlinux.clang.before vmlinux.clang.after
text data bss dec hex filename
45565821 25005462 4704800 75276083 47c9f33 vmlinux.gcc.before
45565821 25005462 4704800 75276083 47c9f33 vmlinux.gcc.after
45121072 24638617 5533040 75292729 47ce039 vmlinux.clang.before
45093887 24638633 5536808 75269328 47c84d0 vmlinux.clang.after
$ scripts/bloat-o-meter -t vmlinux.clang.before vmlinux.clang.after
add/remove: 1/2 grow/shrink: 21/533 up/down: 2250/-22112 (-19862)
Function old new delta
wakeup_cpu_via_vmgexit 1002 1447 +445
rcu_tasks_trace_pregp_step 1052 1454 +402
snp_kexec_finish 1290 1527 +237
check_all_holdout_tasks_trace 909 1106 +197
x2apic_send_IPI_mask_allbutself 38 198 +160
hpet_set_rtc_irq_bit 118 265 +147
x2apic_send_IPI_mask 38 184 +146
ring_buffer_poll_wait 261 405 +144
rb_watermark_hit 253 386 +133
__unlikely_text_end 368 416 +48
printk_trigger_flush 262 298 +36
__softirqentry_text_end - 32 +32
pstore_dump 1145 1164 +19
printk_legacy_allow_panic_sync 159 178 +19
netlink_insert 979 995 +16
console_try_replay_all 268 283 +15
do_flush_tlb_all 151 165 +14
__flush_tlb_all 151 165 +14
synchronize_rcu_expedited 2248 2259 +11
...
tcp_wfree 402 332 -70
stacktrace_trigger 133 62 -71
w1_touch_bit 418 343 -75
w1_triplet 446 370 -76
link_create 980 902 -78
drain_dead_softirq_workfn 425 347 -78
kcryptd_queue_crypt 253 174 -79
perf_event_aux_pause 448 368 -80
idle_worker_timeout 320 240 -80
srcu_funnel_exp_start 418 333 -85
call_rcu 751 666 -85
enable_IR_x2apic 279 191 -88
bpf_link_free 432 342 -90
synchronize_rcu 497 403 -94
identify_cpu 2665 2569 -96
ftrace_modify_all_code 355 258 -97
load_gs_index 212 104 -108
verity_end_io 369 257 -112
bpf_prog_detach 672 555 -117
__x2apic_send_IPI_mask 552 275 -277
snp_cleanup_vmsa 284 - -284
__noinstr_text_start 3072 1920 -1152
Total: Before=28577936, After=28558074, chg -0.07%
Signed-off-by: Eric Dumazet <edumazet@...gle.com>
Cc: Uros Bizjak <ubizjak@...il.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
---
v2: use ASM_OUTPUT_RM (Uros Bizjak)
v1: https://lore.kernel.org/lkml/CANn89iJ+HKXRn7qF4KrT6gghw6CwWcsvoj8Scw17CkCqhGbk=A@mail.gmail.com/T/#mc2322d458f07118580eca7c5fa1f0bc931c32d30
arch/x86/include/asm/irqflags.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index b30e5474c18e1be63b7c69354c26ae6a6cb02731..a1193e9d65f2000d6de88468bee58f2dae9c6cd5 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -25,7 +25,7 @@ extern __always_inline unsigned long native_save_fl(void)
*/
asm volatile("# __raw_save_flags\n\t"
"pushf ; pop %0"
- : "=rm" (flags)
+ : ASM_OUTPUT_RM (flags)
: /* no input */
: "memory");
--
2.52.0.322.g1dd061c0dc-goog
Powered by blists - more mailing lists