linux-kernel - Re: [tip: x86/asm] x86/percpu: Define {raw,this}_cpu_try

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wiOH-VK8XLUBU-=kzPij9X=m7HwnviXF-o8X54Z=Ey_xw@mail.gmail.com>
Date:   Fri, 15 Sep 2023 09:45:16 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     linux-kernel@...r.kernel.org
Cc:     linux-tip-commits@...r.kernel.org, Uros Bizjak <ubizjak@...il.com>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>, x86@...nel.org
Subject: Re: [tip: x86/asm] x86/percpu: Define {raw,this}_cpu_try_cmpxchg{64,128}

On Fri, 15 Sept 2023 at 04:25, tip-bot2 for Uros Bizjak
<tip-bot2@...utronix.de> wrote:
>
> Several places in mm/slub.o improve from e.g.:
>
[...]
>
> to:
>
>     53bc:       48 8d 4a 40             lea    0x40(%rdx),%rcx
>     53c0:       49 8b 1c 07             mov    (%r15,%rax,1),%rbx
>     53c4:       4c 89 f8                mov    %r15,%rax
>     53c7:       48 8d 37                lea    (%rdi),%rsi
>     53ca:       e8 00 00 00 00          call   53cf <...>
>                         53cb: R_X86_64_PLT32     this_cpu_cmpxchg16b_emu-0x4
>     53cf:       75 bb                   jne    538c <...>

Honestly, if y ou care deeply about this code sequence, I think you
should also move the "lea" out of the inline asm.

Both

    call this_cpu_cmpxchg16b_emu

and

    cmpxchg16b %gs:(%rsi)

are 5 bytes, and I suspect it's easiest to just always put the address
in %rsi - whether you call the function or not.

It doesn't really make the code generation for the non-call sequence
worse, and it gives the compiler more information (ie instead of
clobbering %rsi, the compiler knows what %rsi contains).

IOW, something like this:

-       asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call
this_cpu_cmpxchg16b_emu", \
+       asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",           \
...
-                   "c" (new__.high)                                    \
-                 : "memory", "rsi");                                   \
+                   "c" (new__.high),                                   \
+                   "S" (&_var)                                   \
+                 : "memory");                                          \

should do it.

Note that I think this is particularly true of the slub code, because
afaik, the slub code will *only* use the slow call-out.

Why? Because if the CPU actually supports the cmpxchgb16 instruction,
then the slub code won't even take this path at all - it will do the
__CMPXCHG_DOUBLE path, which does an unconditional locked cmpxchg16b.

Maybe I'm misreading it. And no, none of this matters. But since I saw
the patch fly by, and slub.o mentioned, I thought I'd point out how
silly this all is. It's optimizing a code-path that is basically never
taken, and when it *is* taken, it can be improved further, I think.

                   Linus