linux-kernel - Re: [tip: x86/asm] x86/percpu: Define {raw,this}_cpu_try

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFULd4YLzRWmXG2DHeRHYY33FrX+wNOVfQFyms5Mki1mmn0VZA@mail.gmail.com>
Date:   Sun, 17 Sep 2023 20:31:25 +0200
From:   Uros Bizjak <ubizjak@...il.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>, x86@...nel.org
Subject: Re: [tip: x86/asm] x86/percpu: Define {raw,this}_cpu_try_cmpxchg{64,128}

On Fri, Sep 15, 2023 at 6:45 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Fri, 15 Sept 2023 at 04:25, tip-bot2 for Uros Bizjak
> <tip-bot2@...utronix.de> wrote:
> >
> > Several places in mm/slub.o improve from e.g.:
> >
> [...]
> >
> > to:
> >
> >     53bc:       48 8d 4a 40             lea    0x40(%rdx),%rcx
> >     53c0:       49 8b 1c 07             mov    (%r15,%rax,1),%rbx
> >     53c4:       4c 89 f8                mov    %r15,%rax
> >     53c7:       48 8d 37                lea    (%rdi),%rsi
> >     53ca:       e8 00 00 00 00          call   53cf <...>
> >                         53cb: R_X86_64_PLT32     this_cpu_cmpxchg16b_emu-0x4
> >     53cf:       75 bb                   jne    538c <...>
>
> Honestly, if y ou care deeply about this code sequence, I think you
> should also move the "lea" out of the inline asm.

I have to say that the above asm code was shown mostly as an example
of the improvement, to illustrate how the compare sequence at the end
of the cmpxchg loop gets eliminated. Being a fairly mechanical change,
I didn't put much thought in the surrounding code.

> Both
>
>     call this_cpu_cmpxchg16b_emu
>
> and
>
>     cmpxchg16b %gs:(%rsi)
>
> are 5 bytes, and I suspect it's easiest to just always put the address
> in %rsi - whether you call the function or not.
>
> It doesn't really make the code generation for the non-call sequence
> worse, and it gives the compiler more information (ie instead of
> clobbering %rsi, the compiler knows what %rsi contains).
>
> IOW, something like this:
>
> -       asm qual (ALTERNATIVE("leaq %P[var], %%rsi; call
> this_cpu_cmpxchg16b_emu", \
> +       asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",           \
> ...
> -                   "c" (new__.high)                                    \
> -                 : "memory", "rsi");                                   \
> +                   "c" (new__.high),                                   \
> +                   "S" (&_var)                                   \
> +                 : "memory");                                          \
>
> should do it.

Yes, and the above change improves slub.o assembly from (current tip
tree with try_cmpxchg patch applied):

    53b3:    41 8b 44 24 28           mov    0x28(%r12),%eax
    53b8:    49 8b 3c 24              mov    (%r12),%rdi
    53bc:    48 8d 4a 40              lea    0x40(%rdx),%rcx
    53c0:    49 8b 1c 07              mov    (%r15,%rax,1),%rbx
    53c4:    4c 89 f8                 mov    %r15,%rax
    53c7:    48 8d 37                 lea    (%rdi),%rsi
    53ca:    e8 00 00 00 00           call   53cf <kmem_cache_alloc+0x9f>
            53cb: R_X86_64_PLT32    this_cpu_cmpxchg16b_emu-0x4
    53cf:    75 bb                    jne    538c <kmem_cache_alloc+0x5c>

to:

    53b3:    41 8b 44 24 28           mov    0x28(%r12),%eax
    53b8:    49 8b 34 24              mov    (%r12),%rsi
    53bc:    48 8d 4a 40              lea    0x40(%rdx),%rcx
    53c0:    49 8b 1c 07              mov    (%r15,%rax,1),%rbx
    53c4:    4c 89 f8                 mov    %r15,%rax
    53c7:    e8 00 00 00 00           call   53cc <kmem_cache_alloc+0x9c>
            53c8: R_X86_64_PLT32    this_cpu_cmpxchg16b_emu-0x4
    53cc:    75 be                    jne    538c <kmem_cache_alloc+0x5c>

where an effective reg-reg move "lea (%rdi), %rsi" at 537c gets
removed. And indeed, GCC figures out that %rsi holds the address of
the variable and emits:

   5:    65 48 0f c7 0e           cmpxchg16b %gs:(%rsi)

alternative replacement.

Now, here comes the best part: We can get rid of the %P modifier. With
named address spaces (__seg_gs), older GCCs had some problems with %P
and emitted "%gs:foo" instead of foo, resulting in "Warning: segment
override on `lea' is ineffectual" assembly warning. With the proposed
change, we use:

--cut here--
int __seg_gs g;

void foo (void)
{
  asm ("%0 %1" :: "m"(g), "S"(&g));
}
--cut here--

and get the desired assembly:

       movl    $g, %esi
       %gs:g(%rip) %rsi

The above is also in line with [1], where it is said that
"[__seg_gs/__seg_fs] address spaces are not considered to be subspaces
of the generic (flat) address space." So, cmpxchg16b_emu.S must use
%gs to apply segment base offset, which it does.

> Note that I think this is particularly true of the slub code, because
> afaik, the slub code will *only* use the slow call-out.
>
> Why? Because if the CPU actually supports the cmpxchgb16 instruction,
> then the slub code won't even take this path at all - it will do the
> __CMPXCHG_DOUBLE path, which does an unconditional locked cmpxchg16b.
>
> Maybe I'm misreading it. And no, none of this matters. But since I saw
> the patch fly by, and slub.o mentioned, I thought I'd point out how
> silly this all is. It's optimizing a code-path that is basically never
> taken, and when it *is* taken, it can be improved further, I think.

True, but as mentioned above, the slub.o code was used to illustrate
the effect of the patch. The new locking primitive should be usable in
a general way and could be also used in other places.

[1] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html#x86-Named-Address-Spaces

Uros.