lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAFULd4ZvQkHzvaa_cpbt1x_RAj_7FHb1FyPJxacVOrsK_55=xw@mail.gmail.com>
Date: Tue, 18 Mar 2025 13:35:28 +0100
From: Uros Bizjak <ubizjak@...il.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org, 
	Thomas Gleixner <tglx@...utronix.de>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, 
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v2] x86/asm: Use asm_inline() instead of asm() in __untagged_addr()

On Tue, Mar 18, 2025 at 12:21 AM Ingo Molnar <mingo@...nel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@...il.com> wrote:
>
> > Use asm_inline() to instruct the compiler that the size of asm()
> > is the minimum size of one instruction, ignoring how many instructions
> > the compiler thinks it is. ALTERNATIVE macro that expands to several
> > pseudo directives causes instruction length estimate to count
> > more than 20 instructions.
> >
> > bloat-o-meter reports minimal code size increase
> > (x86_64 defconfig with CONFIG_ADDRESS_MASKING, gcc-14.2.1):
> >
> >   add/remove: 2/2 grow/shrink: 5/1 up/down: 2365/-1995 (370)
> >
> >       Function                          old     new   delta
> >       -----------------------------------------------------
> >       do_get_mempolicy                    -    1449   +1449
> >       copy_nodes_to_user                  -     226    +226
> >       __x64_sys_get_mempolicy            35     213    +178
> >       syscall_user_dispatch_set_config  157     332    +175
> >       __ia32_sys_get_mempolicy           31     206    +175
> >       set_syscall_user_dispatch          29     181    +152
> >       __do_sys_mremap                  2073    2083     +10
> >       sp_insert                         133     117     -16
> >       task_set_syscall_user_dispatch    172       -    -172
> >       kernel_get_mempolicy             1807       -   -1807
> >
> >   Total: Before=21423151, After=21423521, chg +0.00%
> >
> > The code size increase is due to the compiler inlining
> > more functions that inline untagged_addr(), e.g:
> >
> > task_set_syscall_user_dispatch() is now fully inlined in
> > set_syscall_user_dispatch():
> >
> >       000000000010b7e0 <set_syscall_user_dispatch>:
> >         10b7e0:       f3 0f 1e fa             endbr64
> >         10b7e4:       49 89 c8                mov    %rcx,%r8
> >         10b7e7:       48 89 d1                mov    %rdx,%rcx
> >         10b7ea:       48 89 f2                mov    %rsi,%rdx
> >         10b7ed:       48 89 fe                mov    %rdi,%rsi
> >         10b7f0:       65 48 8b 3d 00 00 00    mov    %gs:0x0(%rip),%rdi
> >         10b7f7:       00
> >         10b7f8:       e9 03 fe ff ff          jmp    10b600 <task_set_syscall_user_dispatch>
>
> So this was a tail-call optimization that jumped to a standalone
> <task_set_syscall_user_dispatch>, right? So now we'll avoid the
> tail-jump and maybe a bit of the register parameter shuffling? Which
> would explain the bloatometer delta of -172 for
> task_set_syscall_user_dispatch?

Yes, this is correct. Register shuffling is part of the call setup
(because parameters have to be passed to the called function in
certain registers according to the ps-ABI). Inlining avoids this
because parameters can now be freely allocated to any register of the
required class.

> Could you also cite the first relevant bits of <task_set_syscall_user_dispatch>?

000000000010b600 <task_set_syscall_user_dispatch>:
  10b600:    48 89 f8                 mov    %rdi,%rax
  10b603:    48 85 f6                 test   %rsi,%rsi
  10b606:    74 54                    je     10b65c
<task_set_syscall_user_dispatch+0x5c>
  10b608:    48 83 fe 01              cmp    $0x1,%rsi
  10b60c:    74 06                    je     10b614
<task_set_syscall_user_dispatch+0x14>
  10b60e:    b8 ea ff ff ff           mov    $0xffffffea,%eax
  10b613:    c3                       ret
  10b614:    48 85 d2                 test   %rdx,%rdx
  10b617:    75 7b                    jne    10b694
<task_set_syscall_user_dispatch+0x94>
  10b619:    4d 85 c0                 test   %r8,%r8
  10b61c:    74 1a                    je     10b638
<task_set_syscall_user_dispatch+0x38>
  10b61e:    4c 89 c6                 mov    %r8,%rsi
  10b621:    48 bf ef cd ab 89 67     movabs $0x123456789abcdef,%rdi
  10b628:    45 23 01
  10b62b:    90                       nop
  10b62c:    90                       nop
  10b62d:    90                       nop
  10b62e:    90                       nop
  10b62f:    90                       nop
  10b630:    90                       nop
  10b631:    90                       nop
  10b632:    90                       nop
  10b633:    48 39 f7                 cmp    %rsi,%rdi
  10b636:    72 6e                    jb     10b6a6
<task_set_syscall_user_dispatch+0xa6>
  10b638:    4c 89 80 48 08 00 00     mov    %r8,0x848(%rax)
  10b63f:    48 89 90 50 08 00 00     mov    %rdx,0x850(%rax)
  10b646:    48 89 88 58 08 00 00     mov    %rcx,0x858(%rax)
  10b64d:    c6 80 60 08 00 00 00     movb   $0x0,0x860(%rax)
  10b654:    f0 80 48 08 20           lock orb $0x20,0x8(%rax)
  10b659:    31 c0                    xor    %eax,%eax
  10b65b:    c3                       ret
  10b65c:    49 09 c8                 or     %rcx,%r8
  10b65f:    49 09 d0                 or     %rdx,%r8
  10b662:    75 aa                    jne    10b60e
<task_set_syscall_user_dispatch+0xe>
  10b664:    48 c7 87 48 08 00 00     movq   $0x0,0x848(%rdi)
  10b66b:    00 00 00 00
  10b66f:    48 c7 87 50 08 00 00     movq   $0x0,0x850(%rdi)
  10b676:    00 00 00 00
  10b67a:    48 c7 87 58 08 00 00     movq   $0x0,0x858(%rdi)
  10b681:    00 00 00 00
  10b685:    c6 87 60 08 00 00 00     movb   $0x0,0x860(%rdi)
  10b68c:    f0 80 67 08 df           lock andb $0xdf,0x8(%rdi)
  10b691:    31 c0                    xor    %eax,%eax
  10b693:    c3                       ret
  10b694:    48 8d 34 0a              lea    (%rdx,%rcx,1),%rsi
  10b698:    48 39 f2                 cmp    %rsi,%rdx
  10b69b:    0f 82 78 ff ff ff        jb     10b619
<task_set_syscall_user_dispatch+0x19>
  10b6a1:    e9 68 ff ff ff           jmp    10b60e
<task_set_syscall_user_dispatch+0xe>
  10b6a6:    b8 f2 ff ff ff           mov    $0xfffffff2,%eax
  10b6ab:    c3                       ret

> I don't seem to be able to reproduce this inlining decision, my version
> of GCC is:
>
>   gcc version 14.2.0 (Ubuntu 14.2.0-4ubuntu2)
>
> which is one patch version older than your 14.2.1.
>
> I tried GCC 11, 12, 13 as well, but none did this tail optimization,
> all appear to be inlining <task_set_syscall_user_dispatch> into
> <set_syscall_user_dispatch>. What am I missing?

CONFIG_ADDRESS_MASKING in current -tip depends on:

  Depends on: X86_64 [=y] && (COMPILE_TEST [=n] || !CPU_MITIGATIONS [=y])

and I chose to disable CPU_MITIGATIONS (this is the reason no __pfx
functions are reported and the compiler is able to do sibling call
optimization).

BTW: I found the cited functions a representation of how the compiler
inlines bigger functions into smaller ones (one would expect the
opposite) depending on and respecting the code size growth factor
criteria for combined function.

> Another question, where do the size increases in these functions come
> from:
>
> >       __x64_sys_get_mempolicy            35     213    +178

This one now inlines kernel_get_mempolicy().

> >       syscall_user_dispatch_set_config  157     332    +175

This one now inlines task_set_syscall_user_dispatch().

> >       __ia32_sys_get_mempolicy           31     206    +175

This one now inlines kernel_get_mempolicy().

> >       set_syscall_user_dispatch          29     181    +152

This one now inlines task_set_syscall_user_dispatch().

> (I have to ask, because I have trouble reproducing with my toolchain so
> I cannot look at this myself.)

BTW: the indication of inline changes w.r.t. __untagged_addr() is the
number of references to the tlbstate_untag_mask variable in the
.altinstr_replacement section. If these go up, more inlining happens.
Also, bloat-o-meter indicates which function changes, and looking for
a string of 8 NOPs in the function will show where the AND instruction
is to be patched in.

Uros.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ