linux-kernel - Re: [PATCH v3 3/4] x86/alternative: Rewrite optimize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y+QGIiOupDKxlKKR@hirez.programming.kicks-ass.net>
Date:   Wed, 8 Feb 2023 21:29:22 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Andrew.Cooper3@...rix.com
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org, mhiramat@...nel.org,
        kirill.shutemov@...ux.intel.com, jpoimboe@...hat.com
Subject: Re: [PATCH v3 3/4] x86/alternative: Rewrite optimize_nops() some

On Wed, Feb 08, 2023 at 07:52:04PM +0000, Andrew.Cooper3@...rix.com wrote:
> On 08/02/2023 5:10 pm, Peter Zijlstra wrote:
> > This rewrite address two issues:
> >
> >  - it no longer hard requires single byte nop runs, it now accepts
> >    any NOP and NOPL encoded instruction (but not the more complicated
> >    32bit NOPs).
> >
> >  - it writes a single 'instruction' replacement.
> >
> > Specifically, ORC unwinder relies on the tail NOP of an alternative to
> > be a single instruction, in particular it relies on the inner bytes
> > not being executed.
> >
> > Once we reach the max supported NOP length (currently 8, could easily
> > be extended to 11 on x86_64), switches to JMP.d8 and INT3 padding to
> > achieve the same result.
> >
> > The ORC unwinder uses this guarantee in the analysis of
> > alternative/overlapping CFI state,
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> 
> How lucky are you feeling for your game of performance roulette?

Yeah, not very lucky.. I've been talking about this with Boris for a bit
already.

> Unconditional jmps cost branch prediction these days, and won't be
> successfully predicted until taken.

IKR, insane, but that's what it is.

> There is a point after which a jmp is more efficient that brute forcing
> through a line of nops, and where this point is is very uarch specific,
> but it's not a single nop...

Yeah, I know, even very big nops. $I prefetch is good at straight lines
etc..

> Whether you care or not is a different matter, but at least be aware
> doing a jmp like this instead of e.g. 2 or 3 nops, is contrary to the
> prior advice given by the architects.

So, it is either this, or make the CFI conflict analysis done by objtool
a lot more 'interesting'. I figured I'd try this first.

As is this is actually a correctness issue, not a performance issue.
Goes hand in hand with the patches:

  https://lkml.kernel.org/r/20230208172245.711471461@infradead.org
  https://lkml.kernel.org/r/20230208172245.783099843@infradead.org

Specifically, for short alternatives objtool adds a single nop of
size difference size at the end. The advantage is that you only get one
CFI entry for that thing and so don't have potential conflict for any of
the inner bytes.

The alternative is making it multiple NOPs and ensuring objtool and the
kernel both agree on the length of them.

As is, there were only a hand full of NOPs that turned into this jmp.d8
thing on the defconfig+kvm_guest.config I build to test this -- it is by
no means a common thing. And about half of them would be gone by
extending the max nop length to at least 10 or so.

In fact, I did that patch once, lemme see if I still have it...