linux-kernel - Re: [tip:x86/urgent] bpf: Fix ORC unwinding in non-JIT BPF code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAADnVQKWDvzsvyjGoFvSQV7VGr2hF2zzCsC9vnpncWMxOJWYdw@mail.gmail.com>
Date:   Mon, 8 Jul 2019 15:49:33 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Josh Poimboeuf <jpoimboe@...hat.com>
Cc:     Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>, Song Liu <songliubraving@...com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Steven Rostedt <rostedt@...dmis.org>,
        Kairui Song <kasong@...hat.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Alexei Starovoitov <ast@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-tip-commits@...r.kernel.org
Subject: Re: [tip:x86/urgent] bpf: Fix ORC unwinding in non-JIT BPF code

On Mon, Jul 8, 2019 at 3:38 PM Josh Poimboeuf <jpoimboe@...hat.com> wrote:
>
> On Mon, Jul 08, 2019 at 03:15:37PM -0700, Alexei Starovoitov wrote:
> > > 2)
> > >
> > >   After doing the first optimization, GCC then does another one which is
> > >   a little trickier.  It replaces:
> > >
> > >         select_insn:
> > >                 jmp *jumptable(, %rax, 8)
> > >                 ...
> > >         ALU64_ADD_X:
> > >                 ...
> > >                 jmp *jumptable(, %rax, 8)
> > >         ALU_ADD_X:
> > >                 ...
> > >                 jmp *jumptable(, %rax, 8)
> > >
> > >   with
> > >
> > >         select_insn:
> > >                 mov jumptable, %r12
> > >                 jmp *(%r12, %rax, 8)
> > >                 ...
> > >         ALU64_ADD_X:
> > >                 ...
> > >                 jmp *(%r12, %rax, 8)
> > >         ALU_ADD_X:
> > >                 ...
> > >                 jmp *(%r12, %rax, 8)
> > >
> > >   The problem is that it only moves the jumptable address into %r12
> > >   once, for the entire function, then it goes through multiple recursive
> > >   indirect jumps which rely on that %r12 value.  But objtool isn't yet
> > >   smart enough to be able to track the value across multiple recursive
> > >   indirect jumps through the jump table.
> > >
> > >   After some digging I found that the quick and easy fix is to disable
> > >   -fgcse.  In fact, this seems to be recommended by the GCC manual, for
> > >   code like this:
> > >
> > >     -fgcse
> > >         Perform a global common subexpression elimination pass.  This
> > >         pass also performs global constant and copy propagation.
> > >
> > >         Note: When compiling a program using computed gotos, a GCC
> > >         extension, you may get better run-time performance if you
> > >         disable the global common subexpression elimination pass by
> > >         adding -fno-gcse to the command line.
> > >
> > >         Enabled at levels -O2, -O3, -Os.
> > >
> > >   This code indeed relies extensively on computed gotos.  I don't know
> > >   *why* disabling this optimization would improve performance.  In fact
> > >   I really don't see how it could make much of a difference either way.
> > >
> > >   Anyway, using -fno-gcse makes optimization #2 go away and makes
> > >   objtool happy, with only a fix for #1 needed.
> > >
> > >   If -fno-gcse isn't an option, we might be able to fix objtool by using
> > >   the "first_jump_src" thing which Peter added, improving it such that
> > >   it also takes table jumps into account.
> >
> > Sorry for delay. I'm mostly offgrid until next week.
> > As far as -fno-gcse.. I don't mind as long as it doesn't hurt performance.
> > Which I suspect it will :(
> > All these indirect gotos are there for performance.
> > Single indirect goto and a bunch of jmp select_insn
> > are way slower, since there is only one instruction
> > for cpu branch predictor to work with.
> > When every insn is followed by "jmp *jumptable"
> > there is more room for cpu to speculate.
> > It's been long time, but when I wrote it the difference
> > between all indirect goto vs single indirect goto was almost 2x.
>
> Just to clarify, -fno-gcse doesn't get rid of any of the indirect jumps.
> It still has 166 indirect jumps.  It just gets rid of the second
> optimization, where the jumptable address is placed in a register.

what about other functions in core.c ?
May be it's easier to teach objtool to recognize that pattern?

> If you have a benchmark which is relatively easy to use, I could try to
> run some tests.

modprobe test_bpf
selftests/bpf/test_progs
both print runtime.
Some of test_progs have high run-to-run variations though.