netdev - Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and tailcall handling in JIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200714033630.2fw5wzljbkkfle3j@ast-mbp.dhcp.thefacebook.com>
Date:   Mon, 13 Jul 2020 20:36:30 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        bpf <bpf@...r.kernel.org>,
        Network Development <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>
Subject: Re: [RFC PATCH bpf-next 4/5] bpf, x64: rework pro/epilogue and
 tailcall handling in JIT

On Tue, Jul 14, 2020 at 03:00:45AM +0200, Maciej Fijalkowski wrote:
> On Fri, Jul 10, 2020 at 08:25:20PM -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 10, 2020 at 8:20 PM Alexei Starovoitov
> > <alexei.starovoitov@...il.com> wrote:
> > >
> > > Of course you are right.
> > > pop+nop+push is incorrect.
> > >
> > > How about the following instead:
> > > - during JIT:
> > > emit_jump(to_skip_below)  <- poke->tailcall_bypass
> 
> That's the jump to the instruction right after the poke->tailcall_target.

right. Mainly looking for better names than ip and ip_aux.

> > > pop_callee_regs
> > > emit_jump(to_tailcall_target) <- poke->tailcall_target
> 
> During JIT there's no tailcall_target so this will be nop5, right?

I thought it will be always jmp, but with new info I agree that
it will start with nop.

> 
> > >
> > > - Transition from one target to another:
> > > text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> > > if (new_jmp != NULL)
> > >   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> > > else
> > >   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> > 
> > One more correction. I meant:
> > 
> > if (new_jmp != NULL) {
> >   text_poke(poke->tailcall_target, MOD_JMP, old_jmp, new_jmp)
> 
> Problem with having the old_jmp here is that you could have the
> tailcall_target removed followed by the new program being inserted. So for
> that case old_jmp is NULL but we decided to not poke the
> poke->tailcall_target when removing the program, only the tailcall_bypass
> is poked back to jmp from nop. IOW old_jmp is not equal to what
> poke->tailcall_target currently stores. This means that
> bpf_arch_text_poke() would not be successful for this update and that is
> the reason of faking it in this patch.

got it.
I think it can be solved two ways:
1. add synchronize_rcu() after poking of tailcall_bypass into jmp
and then update tailcall_target into nop.
so the race you've described in cover letter won't happen.
In the future with sleepable progs we'd need to call sync_rcu_tasks_trace too.
Which will make poke_run even slower.

2. add a flag to bpf_arch_text_poke() to ignore 5 bytes in there
and update tailcall_target to new jmp.
The speed of poke_run will be faster,
but considering the speed of text_poke_bp() it's starting to feel like
premature optimization.

I think approach 1 is cleaner.
Then the pseudo code will be:
if (new_jmp != NULL) {
   text_poke(poke->tailcall_target, MOD_JMP, old ? old_jmp : NULL, new_jmp);
   if (!old)
     text_poke(poke->tailcall_bypass, MOD_JMP, bypass_addr, NULL /* into nop */);
} else {
   text_poke(poke->tailcall_bypass, MOD_JMP, NULL /* from nop */, bypass_addr);
   sync_rcu(); /* let progs finish */
   text_poke(poke->tailcall_target, MOD_JMP, old_jmp, NULL /* into nop */)
}

> 
> >   text_poke(poke->tailcall_bypass, MOD jmp into nop);
> > } else {
> >   text_poke(poke->tailcall_bypass, MOD nop into jmp);
> > }
> 
> I think that's what we currently (mostly) have. map_poke_run() is skipping
> the poke of poke->tailcall_target if new bpf_prog is NULL, just like
> you're proposing above. Of course I can rename the members in poke
> descriptor to names you're suggesting. I also assume that by text_poke you
> meant the bpf_arch_text_poke?

yep.

> 
> I've been able to hide the nop5 detection within the bpf_arch_text_poke so
> map_poke_run() is arch-independent in that approach. My feeling is that
> we don't need the old bpf_prog at all.
> 
> Some bits might change here due to the jump target alignment that I'm
> trying to introduce.

> Can you explain under what circumstances bpf_jit_binary_alloc() would not
> use get_random_int() ? Out of curiosity as from a quick look I can't tell
> when.

I meant when you're doing benchmarking get rid of that randomization
from bpf_jit_binary_alloc in your test kernel.

> I'm hitting the following check in do_jit():

I think aligning bypass_addr is a bit too much. Let it all be linear for now.
Since iTLB is sporadic it could be due to randomization and nothing to do
with additional jmp and unwind that this set is introducing.