lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+rNFaFmbJ-VMVmMvm6rJZ-tfHEmrPwXgfZtBbL_wbYBg@mail.gmail.com>
Date: Sun, 21 Dec 2025 22:47:04 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: "H. Peter Anvin" <hpa@...or.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, Uros Bizjak <ubizjak@...il.com>, 
	Linus Torvalds <torvalds@...ux-foundation.org>, x86@...nel.org, 
	linux-kernel <linux-kernel@...r.kernel.org>, Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [PATCH 2/2] x86/irqflags: Use ASM_OUTPUT_RM in native_save_fl()

On Sun, Dec 21, 2025 at 10:40 PM H. Peter Anvin <hpa@...or.com> wrote:
>
> On December 19, 2025 3:20:07 AM PST, Eric Dumazet <edumazet@...gle.com> wrote:
> >clang is generating very inefficient code for native_save_fl() which
> >is used for local_irq_save() in critical spots.
> >
> >Allowing the "pop %0" to use memory:
> >
> >1) forces the compiler to add annoying stack canaries when
> >   CONFIG_STACKPROTECTOR_STRONG=y in many places.
> >
> >2) Almost always is followed by an immediate "move memory,register"
> >
> >One good example is _raw_spin_lock_irqsave, with 8 extra instructions
> >
> >ffffffff82067a30 <_raw_spin_lock_irqsave>:
> >ffffffff82067a30:              ...
> >ffffffff82067a39:              53                                              push   %rbx
> >
> >// Three instructions to ajust the stack, read the per-cpu canary
> >// and copy it to 8(%rsp)
> >ffffffff82067a3a:              48 83 ec 10                     sub    $0x10,%rsp
> >ffffffff82067a3e:              65 48 8b 05 da 15 45 02 mov    %gs:0x24515da(%rip),%rax            # <__stack_chk_guard>
> >ffffffff82067a46:              48 89 44 24 08                  mov    %rax,0x8(%rsp)
> >
> >ffffffff82067a4b:              9c                                              pushf
> >
> >// instead of pop %rbx, compiler uses 2 instructions.
> >ffffffff82067a4c:              8f 04 24                                pop    (%rsp)
> >ffffffff82067a4f:              48 8b 1c 24                     mov    (%rsp),%rbx
> >
> >ffffffff82067a53:              fa                                              cli
> >ffffffff82067a54:              b9 01 00 00 00                  mov    $0x1,%ecx
> >ffffffff82067a59:              31 c0                                   xor    %eax,%eax
> >ffffffff82067a5b:              f0 0f b1 0f                     lock cmpxchg %ecx,(%rdi)
> >ffffffff82067a5f:              75 1d                                   jne    ffffffff82067a7e <_raw_spin_lock_irqsave+0x4e>
> >
> >// three instructions to check the stack canary
> >ffffffff82067a61:              65 48 8b 05 b7 15 45 02 mov    %gs:0x24515b7(%rip),%rax            # <__stack_chk_guard>
> >ffffffff82067a69:              48 3b 44 24 08                  cmp    0x8(%rsp),%rax
> >ffffffff82067a6e:              75 17                                   jne    ffffffff82067a87
> >
> >...
> >
> >// One extra instruction to adjust the stack.
> >ffffffff82067a73:              48 83 c4 10                     add    $0x10,%rsp
> >...
> >
> >// One more instruction in case the stack was mangled.
> >ffffffff82067a87:              e8 a4 35 ff ff                  call   ffffffff8205b030 <__stack_chk_fail>
> >
> >This patch changes nothing for gcc, but for clang saves ~20000 bytes of text
> >even though more functions are inlined.
> >
> >$ size vmlinux.gcc.before vmlinux.gcc.after vmlinux.clang.before vmlinux.clang.after
> >   text           data         bss             dec             hex     filename
> >45565821       25005462        4704800 75276083        47c9f33 vmlinux.gcc.before
> >45565821       25005462        4704800 75276083        47c9f33 vmlinux.gcc.after
> >45121072       24638617        5533040 75292729        47ce039 vmlinux.clang.before
> >45093887       24638633        5536808 75269328        47c84d0 vmlinux.clang.after
> >
> >$ scripts/bloat-o-meter -t vmlinux.clang.before vmlinux.clang.after
> >add/remove: 1/2 grow/shrink: 21/533 up/down: 2250/-22112 (-19862)
> >Function                                                                        old     new   delta
> >wakeup_cpu_via_vmgexit                                         1002    1447    +445
> >rcu_tasks_trace_pregp_step                                     1052    1454    +402
> >snp_kexec_finish                                                       1290    1527    +237
> >check_all_holdout_tasks_trace                           909    1106    +197
> >x2apic_send_IPI_mask_allbutself                          38     198    +160
> >hpet_set_rtc_irq_bit                                            118     265    +147
> >x2apic_send_IPI_mask                                             38     184    +146
> >ring_buffer_poll_wait                                           261     405    +144
> >rb_watermark_hit                                                        253     386    +133
> >__unlikely_text_end                                             368     416     +48
> >printk_trigger_flush                                            262     298     +36
> >__softirqentry_text_end                                           -      32     +32
> >pstore_dump                                                            1145    1164     +19
> >printk_legacy_allow_panic_sync                          159     178     +19
> >netlink_insert                                                          979     995     +16
> >console_try_replay_all                                          268     283     +15
> >do_flush_tlb_all                                                        151     165     +14
> >__flush_tlb_all                                                         151     165     +14
> >synchronize_rcu_expedited                                      2248    2259     +11
> >...
> >tcp_wfree                                                                       402     332     -70
> >stacktrace_trigger                                                      133      62     -71
> >w1_touch_bit                                                            418     343     -75
> >w1_triplet                                                                      446     370     -76
> >link_create                                                             980     902     -78
> >drain_dead_softirq_workfn                                       425     347     -78
> >kcryptd_queue_crypt                                             253     174     -79
> >perf_event_aux_pause                                            448     368     -80
> >idle_worker_timeout                                             320     240     -80
> >srcu_funnel_exp_start                                           418     333     -85
> >call_rcu                                                                        751     666     -85
> >enable_IR_x2apic                                                        279     191     -88
> >bpf_link_free                                                           432     342     -90
> >synchronize_rcu                                                         497     403     -94
> >identify_cpu                                                           2665    2569     -96
> >ftrace_modify_all_code                                          355     258     -97
> >load_gs_index                                                           212     104    -108
> >verity_end_io                                                           369     257    -112
> >bpf_prog_detach                                                         672     555    -117
> >__x2apic_send_IPI_mask                                          552     275    -277
> >snp_cleanup_vmsa                                                        284       -    -284
> >__noinstr_text_start                                           3072    1920   -1152
> >Total: Before=28577936, After=28558074, chg -0.07%
> >
> >Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> >Cc: Uros Bizjak <ubizjak@...il.com>
> >Cc: Linus Torvalds <torvalds@...ux-foundation.org>
> >---
> >v2: use ASM_OUTPUT_RM (Uros Bizjak)
> >v1: https://lore.kernel.org/lkml/CANn89iJ+HKXRn7qF4KrT6gghw6CwWcsvoj8Scw17CkCqhGbk=A@mail.gmail.com/T/#mc2322d458f07118580eca7c5fa1f0bc931c32d30
> >
> > arch/x86/include/asm/irqflags.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> >index b30e5474c18e1be63b7c69354c26ae6a6cb02731..a1193e9d65f2000d6de88468bee58f2dae9c6cd5 100644
> >--- a/arch/x86/include/asm/irqflags.h
> >+++ b/arch/x86/include/asm/irqflags.h
> >@@ -25,7 +25,7 @@ extern __always_inline unsigned long native_save_fl(void)
> >        */
> >       asm volatile("# __raw_save_flags\n\t"
> >                        "pushf ; pop %0"
> >-                       : "=rm" (flags)
> >+                       : ASM_OUTPUT_RM (flags)
> >                        : /* no input */
> >                        : "memory");
> >
> >--
> >2.52.0.322.g1dd061c0dc-goog
> >
>
> Maybe they should provide an intrinsic for this operation?

They already do provide it, I mentioned this in V1 discussion:

https://lore.kernel.org/lkml/CANn89i+3F6O30Tg5kh-+O2nFa3xJMsverm1bTh-K1G0vk=-gig@mail.gmail.com/

Unfortunately I was told it was not better, some clang versions had
bugs with it.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ