linux-kernel - Re: [PATCH -tip] x86/locking/atomic: Use asm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFULd4btTdUrF6fTqafyViuaB+V8QD-s0pLE6XWb7BYzYAPmZA@mail.gmail.com>
Date: Thu, 6 Mar 2025 11:26:24 +0100
From: Uros Bizjak <ubizjak@...il.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, 
	Peter Zijlstra <peterz@...radead.org>, Thomas Gleixner <tglx@...utronix.de>, 
	Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, 
	"H. Peter Anvin" <hpa@...or.com>, Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH -tip] x86/locking/atomic: Use asm_inline for atomic
 locking insns

On Thu, Mar 6, 2025 at 10:57 AM Ingo Molnar <mingo@...nel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@...il.com> wrote:
>
> > According to:
> >
> >   https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html
> >
> > the usage of asm pseudo directives in the asm template can confuse
> > the compiler to wrongly estimate the size of the generated
> > code.
> >
> > The LOCK_PREFIX macro expands to several asm pseudo directives, so
> > its usage in atomic locking insns causes instruction length estimate
> > to fail significantly (the specially instrumented compiler reports
> > the estimated length of these asm templates to be 6 instructions long).
> >
> > This incorrect estimate further causes unoptimal inlining decisions,
> > unoptimal instruction scheduling and unoptimal code block alignments
> > for functions that use these locking primitives.
> >
> > Use asm_inline instead:
> >
> >   https://gcc.gnu.org/pipermail/gcc-patches/2018-December/512349.html
> >
> > which is a feature that makes GCC pretend some inline assembler code
> > is tiny (while it would think it is huge), instead of just asm.
> >
> > For code size estimation, the size of the asm is then taken as
> > the minimum size of one instruction, ignoring how many instructions
> > compiler thinks it is.
> >
> > The code size of the resulting x86_64 defconfig object file increases
> > for 33.264 kbytes, representing 1.2% code size increase:
> >
> >    text    data     bss     dec     hex filename
> > 27450107        4633332  814148 32897587        1f5fa33 vmlinux-old.o
> > 27483371        4633784  814148 32931303        1f67de7 vmlinux-new.o
> >
> > mainly due to different inlining decisions of -O2 build.
>
> So my request here would be not more benchmark figures (I don't think
> it's a realistic expectation for contributors to be able to measure
> much of an effect with such a type of change, let alone be certain
> what a macro or micro-benchmark measures is causally connected with
> the patch), but I'd like to ask for some qualitative analysis on the
> code generation side:
>
>  - +1.2% code size increase is a lot, especially if it's under the
>    default build flags of the kernel. Where does the extra code come
>    from?
>
>  - Is there any effect on Clang? Are its inlining decisions around
>    these asm() statements comparable, worse/better?

Bah, apparently I can't calculate... the code increase is in fact
0.14%, but still 33k.

bloat-o-meter shows top grows (> 500 bytes):

add/remove: 82/283 grow/shrink: 870/372 up/down: 76272/-43618 (32654)
Function                                     old     new   delta
copy_process                                6465   10191   +3726
balance_dirty_pages_ratelimited_flags        237    2949   +2712
icl_plane_update_noarm                      5800    7969   +2169
samsung_input_mapping                       3375    5170   +1795
ext4_do_update_inode.isra                      -    1526   +1526
__schedule                                  2416    3472   +1056
__i915_vma_resource_unhold                     -     946    +946
sched_mm_cid_after_execve                    175    1097    +922
__do_sys_membarrier                            -     862    +862
filemap_fault                               2666    3462    +796
nl80211_send_wiphy                         11185   11874    +689
samsung_input_mapping.cold                   900    1500    +600
virtio_gpu_queue_fenced_ctrl_buffer          839    1410    +571
ilk_update_pipe_csc                         1201    1735    +534
enable_step                                    -     525    +525
icl_color_commit_noarm                      1334    1847    +513
tg3_read_bc_ver                                -     501    +501

and top shrinks ( < 500 bytes):

nl80211_send_iftype_data                     580       -    -580
samsung_gamepad_input_mapping.isra.cold      604       -    -604
virtio_gpu_queue_ctrl_sgs                    724       -    -724
tg3_get_invariants                          9218    8376    -842
__i915_vma_resource_unhold.part              899       -    -899
ext4_mark_iloc_dirty                        1735     106   -1629
samsung_gamepad_input_mapping.isra          2046       -   -2046
icl_program_input_csc                       2203       -   -2203
copy_mm                                     2242       -   -2242
balance_dirty_pages                         2657       -   -2657
Total: Before=22770320, After=22802974, chg +0.14%

> A couple of concrete examples would go a long way:
>
>  - "Function XXX was inlined 3 times before the patch, and it was
>     inlined 30 times after the patch. I have reviewed two such inlining
>     locations, and they have added more code to unlikely or
>     failure-handling branches collected near the function epilogue,
>     while the fast-path of the function was more optimal."
>
> Or you might end up finding:
>
>  - "Function YYY was inlined 3x more frequently after the patch, but
>     the inlining decision increased register pressure and created less
>     optimal code in the fast-path, increasing both code size and likely
>     decreasing fast-path performance."
>
> Obviously we'd be sad about the second case, but it's well within the
> spectrum of possibilities when we look at "+1.2% object code size
> increase".
>
> What we cannot do is to throw up our hands and claim "-O2 trades
> performance for size, and thus this patch improves performance".
> We don't know that for sure and 30 years of kernel development
> created a love-and-hate relationship and a fair level of distrust
> between kernel developers and compiler inlining decisions,
> especially around x86 asm() statements ...
>
> So these are roughly the high level requirements around such patches.
> Does this make sense?

In my opinion, the writing above makes perfect sense. As far as I'm
concerned, I'm able and can do the above code analysis, the
problematic part would be precise performance measurements. Although
with your instructions, I can also try that.

Thanks,
Uros.