linux-kernel - Re: [PATCH v3] x86: use builtins to read eflags

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKwvOdmXxmYgqEOT4vSbN60smSkQRcc3B5duQAhaaYoaDo=Riw@mail.gmail.com>
Date:   Mon, 14 Feb 2022 15:53:24 -0800
From:   Nick Desaulniers <ndesaulniers@...gle.com>
To:     David Laight <David.Laight@...lab.com>,
        Bill Wendling <morbo@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>
Cc:     Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "H . Peter Anvin" <hpa@...or.com>,
        Nathan Chancellor <nathan@...nel.org>,
        Juergen Gross <jgross@...e.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Andy Lutomirski <luto@...nel.org>,
        "llvm@...ts.linux.dev" <llvm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3] x86: use builtins to read eflags

On Tue, Feb 8, 2022 at 1:14 AM David Laight <David.Laight@...lab.com> wrote:
>
> From: Nick Desaulniers
> > Sent: 07 February 2022 22:12
> >
> > On Thu, Feb 3, 2022 at 4:57 PM Bill Wendling <morbo@...gle.com> wrote:
> > >
> > > GCC and Clang both have builtins to read and write the EFLAGS register.
> > > This allows the compiler to determine the best way to generate this
> > > code, which can improve code generation.
> > >
> > > This issue arose due to Clang's issue with the "=rm" constraint.  Clang
> > > chooses to be conservative in these situations, and so uses memory
> > > instead of registers. This is a known issue, which is currently being
> > > addressed.
>
> How much performance would be lost by just using "=r"?

It adds undue register pressure.  When native_save_fl() is inlined
into callers, requiring an operand to be in a register clobbers it, so
we may need to insert stack spills + reloads for values produced
before but consumed after the asm stmt if we run out of registers to
allocate for the live range.

Looking at the disassembly of native_save_fl() or calls to
__builtin_ia32_readeflags_u64() in isolation are less interesting than
what happens once native_save_fl() starts getting inlined into
callers.

>
> You get two instructions if the actual target is memory.
> This might be a marginal code size increase - but not much,
> It might also slow things down if the execution is limited
> by the instruction decoder.

"=rm" vs "=r" is more so about whether the operands to the asm are
required to be in memory or not, rather than number of instructions
generated.

>
> But on Intel cpu 'pop memory' is 2 uops, exactly the same
> as 'pop register' 'store register' (and I think amd is similar).
> So the actual execution time is exactly the same for both.

Page 10 of Agner Fog's
https://www.agner.org/optimize/instruction_tables.pdf
shows that a pop to memory is 3 "ops" while pop to register is 2. Page
9 defines "ops" as

    Number of macro-operations issued from instruction decoder to
schedulers. Instructions with more than 2 macro-operations use
microcode.

Internally, it looks like we have benchmarks that show that
kmem_cache_free spends 5.18% of cycles in that pop-to-memory
instruction (looks like this is CONFIG_SLAB).  Further,
native_save_fl() is called by local_irq_save() all over the kernel.
After this patch, the hot spot disappears from the profile (the
push+pop becomes 0.003% of samples from 5.18%).

So I think with those above two notes, that should answer the question
"why is any of this important?"

Further, page 10 also notes that push to memory is 2 "ops" and push to
register is 1, but that doesn't make a difference in the profiles.

Additionally, I compare the disassembly of kmem_cache_free
(CONFIG_SLAB) before+after this patch. There is no difference for GCC,
but you can see quite an improvement for clang:
https://gist.github.com/nickdesaulniers/7f143bae821d86603294bf00b8bd5074
This is the kind of info I would give Thomas when he asks about
changes to GCC codegen.

To the point I made last Friday about support for actually released
versions of clang, here's how v4 of the patch makes clang codegen
worse. This is ToT llvm with f3481f43bbe2c8a24e74210d310cf3be291bf52d
reverted. (That's the commit that fixed the forcibly inserted stack
frame).
https://gist.github.com/nickdesaulniers/b4d0f6e26f8cbef0ae4c5352cfeaca67

Specifically, CONFIG_UNWINDER_FRAME_POINTER=y is improved;
CONFIG_UNWINDER_ORC=y is made worse (for all released versions of
clang, since they don't yet have
f3481f43bbe2c8a24e74210d310cf3be291bf52d).

Therefore, v5 of this patch should have something like:

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index f31a035f3c6a..2d2d4d8ab823 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -19,7 +19,12 @@
 extern inline unsigned long native_save_fl(void);
 extern __always_inline unsigned long native_save_fl(void)
 {
-#ifdef CONFIG_X86_64
+/* clang would force frame pointers via the builtins until clang-14 */
+#if defined(CC_IS_CLANG) && defined(UNWINDER_ORC) && CLANG_VERSION < 140000
+       unsigned long flags;
+       asm volatile ("pushf; pop %0" : "=rm"(flags));
+       return flags;
+#elif defined(CONFIG_X86_64)
        return __builtin_ia32_readeflags_u64();
 #else
        return __builtin_ia32_readeflags_u32();

(though I would keep the existing code and comment for the complex
case, rather than blindly applying my full diff above; and I would
wait for https://github.com/llvmbot/llvm-project/pull/42 to actually
be merged before claiming clang was fixed. oh, looks like it just was.
I'd add `Link: https://github.com/llvm/llvm-project/issues/46875`
explicitly to the commit message.)

Then once clang-14 is the minimally supported version of clang
according to Documentation/process/changes.rst we can drop the
existing inline asm and simply retain the builtins.

Thomas also asked in review of v1 that additional info about previous
commits be included in the commit message.
https://lore.kernel.org/llvm/87o85faum5.ffs@tglx/

Finally, if we need exhaustive proof that the builtins are better for
codegen, even with GCC, I think this case demonstrates so perfectly:
https://godbolt.org/z/5n3Eov1xT

Because compilers treat inline asm a black boxes, they can't schedule
instructions in the inline asm with instructions lowered from the
surrounding C statements.  With the compiler builtins, they now can,
because the assembler isn't involved.  Perhaps noting that in the
commit message would help improve the first paragraph of the commit
message of v4.

>
> Also it looks like clang's builtin is effectively "=r".

It's not; it's closer to how "=rm" should work; add register pressure
and you may find a memory operand rather than a register operand.  But
the builtin is resolved by different machinery than the inline asm.
The inline asm is conservative in the presence of both r and m and
picks m to avoid undue register pressure; r and m together is treated
as "r or m" with a priority for m.  In this case, we'd prefer r's
codegen when possible.

I read "=rm" as "r" or "m"; we'd prefer if llvm chose "r" for most
cases, but it will choose "m" conservatively which is legal by the
"or". "r" would be more precise for what we want. I'm not sure if
there are configs that could exhaust the register allocator though if
"m" was not specified as an option.

AFAICT
commit ab94fcf528d1 ("x86: allow "=rm" in native_save_fl()")
does not cite register exhaustion as a reason to permit "m", so
perhaps the version from
commit f1f029c7bfbf ("x86: fix assembly constraints in native_save_fl()")
would be preferred.  That said, who knows how "r" will affect codegen
elsewhere, which is why my suggested diff above retains "=rm" as the
constraint for the fallback code. "No functional change" for released
versions of clang.

> Compiling:
> long fl;
> void native_save_fl(void) {
>        fl = __builtin_ia32_readeflags_u64();
> }
> Not only generates a stack frame, it also generates:
> pushf; pop %rax; mov mem, %rax.

https://github.com/llvm/llvm-project/issues/46875

--
Thanks,
~Nick Desaulniers