netdev - Re: getting bpf_tail_call to work with bpf function calls. Was: [RFC PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200521040551.lnfaan6uszg2qjoh@ast-mbp.dhcp.thefacebook.com>
Date:   Wed, 20 May 2020 21:05:51 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc:     Daniel Borkmann <daniel@...earbox.net>, ast@...nel.org,
        bpf@...r.kernel.org, netdev@...r.kernel.org, bjorn.topel@...el.com,
        magnus.karlsson@...el.com, lmb@...udflare.com,
        john.fastabend@...il.com
Subject: Re: getting bpf_tail_call to work with bpf function calls. Was: [RFC
 PATCH bpf-next 0/1] bpf, x64: optimize JIT prologue/epilogue generation

On Mon, May 18, 2020 at 08:44:58PM +0200, Maciej Fijalkowski wrote:
> On Sat, May 16, 2020 at 09:32:27PM -0700, Alexei Starovoitov wrote:
> > On Wed, May 13, 2020 at 01:58:55PM +0200, Maciej Fijalkowski wrote:
> > > 
> > > So to me, if we would like to get rid of maxing out stack space, then we
> > > would have to do some dancing for preserving the tail call counter - keep
> > > it in some unused register? Or epilogue would pop it from stack to some
> > > register and target program's prologue would push it to stack from that
> > > register (I am making this up probably). And rbp/rsp would need to be
> > > created/destroyed during the program-to-program transition that happens
> > > via tailcall. That would mean also more instructions.
> > 
> > How about the following:
> > The prologue will look like:
> > nop5
> > xor eax,eax  // two new bytes if bpf_tail_call() is used in this function
> > push rbp
> > mov rbp, rsp
> > sub rsp, rounded_stack_depth
> > push rax // zero init tail_call counter
> > variable number of push rbx,r13,r14,r15
> > 
> > Then bpf_tail_call will pop variable number rbx,..
> > and final 'pop rax'
> > Then 'add rsp, size_of_current_stack_frame'
> > jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp'
> > 
> > This way new function will set its own stack size and will init tail call
> > counter with whatever value the parent had.
> > 
> > If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
> > Instead it would need to have 'nop2' in there.
> > That's the only downside I see.
> > Any other ideas?
> 
> Not really - had a thought with Bjorn about using one callee-saved
> register that is yet unused by x64 JIT (%r12) and i was also thinking

people keep trying to use r12 for all sorts of things, but I'd like
to keep it reserved. I hope we can add R11 to bpf ISA one day.

> about some freaky usage of SSE register as a general purpose one. However,
> your idea is pretty neat - I gave it already a shot and with a single
> tweak I managed to got it working, e.g. selftests are fine as well as two
> samples that utilize tail calls. Note also that I got rid of the stack
> clamp being done in fixup_bpf_calls.
> 
> About a tweak:
> - RETPOLINE_RAX_BPF_JIT used for indirect tail calls needed to become a
>   RETPOLINE_RCX_BPF_JIT, so that we preserve the content of %rax across
>   jumping between programs via tail calls. I looked up GCC commit that
>   Daniel quoted on a patch that implements RETPOLINE_RAX_BPF_JIT and it
>   said that for register that is holding the address of function that we
>   will be jumping onto, we are free to use most of GP registers. I picked
>   %rcx.

Good catch. Indeed. We have to use rcx for retpoline.
rdi/rsi/rdx are used to pass args and bpf_tail_call() doesn't have
4th argument.
r8 could have been used, but it will take more bytes to encode.
so imo rcx is the only choice.

> I was also thinking about a minor optimization where we would replace the
> add/sub %rsp, $off32 with a nop7 if stack depth is 0.

why leaf functions are special?
I've been thinking about it as well, but trampoline fentry/fexit can
be attached to bpf progs too and it would unnecessary complicate
calling original.
So I've discared nop7 idea.

Instead I was thinking to add useless two byte prefix to
either 'push rbp' or 'mov rbp, rsp'.
Like 0x66, so from cpu uops point of view it will stay single uop
to execute in the ooo pipeline, whereas nop2 is a separate uop.
But it's not clear whether decoder performance will be better
for separate nop2 or 0x66 will add unnecssary stress on it.
Like pushing it into 'complex' opcode that can be done by only one exclusive
decoder vs 'simple' opcode that multiple decoders process in parallel.
Some microbenchmarking is needed. Though I'm not sure that the time spent
in such micro performance analysis is worth it :)
So imo nop2 is good enough.

> About a way forward - I reached out to Bjorn to co-operate on providing
> the benchmark for measuring the impact of new tail call handling as well
> as providing a proof in a form of selftests that bpf2bpf is working
> together with tail calls.
> 
> About a benchmark, we think that having tests for best and worst cases
> would tell us what is going on. So:
> - have a main program that is not using any of callee registers that will
>   be tailcalling onto another program that is also not using any of R6-R9.

I don't think such micro benchmark will be realistic. You can probably
craft one in assembler, but when there is a function call and 'ctx' is
passed into that function the llvm will use R6 at least.

> - have the same flow but both programs will be using R6, R7, R8, R9; main
>   program needs to use them because we will be popping these registers
>   before the tail call and target program will be doing pushes.

I would create a benchmark out of progs/tailcall[1-5].c and progs/bpf_flow.c
I think it will be good enough to assess performance before/after.

> Daniel, John, is there some Cilium benchmark that we could incorporate? I
> don't think we be able to come up with a program that would mimic what you
> have previously described, e.g. 6 static jumps where every program would
> be utilizing every callee-saved register. Any help/pointers on how should
> we approach it would be very appreciated.

progs/bpf_flow.c is pretty much such example or I missing something?
Every jump of tail_call is using r6 and r7 at least there.
But bodies of the functions are not empty, so more registers being used
the more work the function is doing and less noticeable the overhead
of new tail_call will be.