lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAFULd4bt0ZjU7S+FKmSe6FG1OBPEgm1nyh_YG6=O0FazgBVaRw@mail.gmail.com>
Date:   Sat, 14 Oct 2023 12:34:14 +0200
From:   Uros Bizjak <ubizjak@...il.com>
To:     Ingo Molnar <mingo@...nel.org>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Nadav Amit <namit@...are.com>,
        Andy Lutomirski <luto@...nel.org>,
        Brian Gerst <brgerst@...il.com>,
        Denys Vlasenko <dvlasenk@...hat.com>,
        "H . Peter Anvin" <hpa@...or.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>
Subject: Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()

On Sat, Oct 14, 2023 at 12:04 PM Ingo Molnar <mingo@...nel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@...il.com> wrote:
>
> > Implement arch_raw_cpu_ptr() as a load from this_cpu_off and then
> > add the ptr value to the base. This way, the compiler can propagate
> > addend to the following instruction and simplify address calculation.
> >
> > E.g.: address calcuation in amd_pmu_enable_virt() improves from:
> >
> >     48 c7 c0 00 00 00 00      mov    $0x0,%rax
> >       87b7: R_X86_64_32S      cpu_hw_events
> >
> >     65 48 03 05 00 00 00      add    %gs:0x0(%rip),%rax
> >     00
> >       87bf: R_X86_64_PC32     this_cpu_off-0x4
> >
> >     48 c7 80 28 13 00 00      movq   $0x0,0x1328(%rax)
> >     00 00 00 00
> >
> > to:
> >
> >     65 48 8b 05 00 00 00      mov    %gs:0x0(%rip),%rax
> >     00
> >       8798: R_X86_64_PC32     this_cpu_off-0x4
> >     48 c7 80 00 00 00 00      movq   $0x0,0x0(%rax)
> >     00 00 00 00
> >       87a6: R_X86_64_32S      cpu_hw_events+0x1328
> >
> > The compiler can also eliminate redundant loads from this_cpu_off,
> > reducing the number of percpu offset reads (either from this_cpu_off
> > or with rdgsbase) from 1663 to 1571.
> >
> > Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> > X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> > only decoding in the first decoder etc. But we're talking single-cycle
> > kind of effects, and the rdgsbase case should be much better from
> > a cache perspective and might use fewer memory pipeline resources to
> > offset the fact that it uses an unusual front end decoder resource...
>
> So the 'additionally' wording in the changelog should have been a big hint
> already that the introduction of RDGSBASE usage needs to be a separate
> patch. ;-)

Indeed. I think that the first part should be universally beneficial,
as it converts

mov symbol, %rax
add %gs:this_cpu_off, %rax

to:

mov %gs:this_cpu_off, %rax
add symbol, %rax

and allows the compiler to propagate addition into address calculation
(the latter is also similar to the code, generated by _seg_gs
approach).

At this point, the "experimental" part could either

a) introduce RDGSBASE:

As discussed with Sean, this could be problematic, at least with KVM,
and has some other drawbacks (e.g. larger binary size, limited CSE of
asm).

b) move to __seg_gs approach via _raw_cpu_read[1]:

This approach solves the "limited CSE with assembly" compiler issue,
since it exposes load to the compiler, and has greater optimization
potential.

[1] https://lore.kernel.org/lkml/20231010164234.140750-1-ubizjak@gmail.com/

Unfortunately, these two are mutually exclusive, since RDGSBASE is
implemented as asm.

To move things forward, I propose to proceed conservatively with the
original patch [1], but this one should be split into two parts. The
first will introduce the switch to MOV with tcp_ptr__ += (unsigned
long)(ptr), and the second will add __seg_gs part.

At this point, we can experiment with RDGSBASE, and compare it with
both approaches, with and without __seg_gs, by just changing the asm
template to:

+       asm (ALTERNATIVE("mov " __percpu_arg(1) ", %0",        \
+                        "rdgsbase %0",                         \
+                        X86_FEATURE_FSGSBASE)                  \

Uros.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ