netdev - Re: 32-bit zext time complexity (Was Re: [PATCH bpf-next] selftests/bpf: two scale tests)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190427030512.zs3tfdudjbfpyawh@ast-mbp>
Date:   Fri, 26 Apr 2019 20:05:13 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Jiong Wang <jiong.wang@...ronome.com>
Cc:     Alexei Starovoitov <ast@...nel.org>, daniel@...earbox.net,
        netdev@...r.kernel.org, bpf@...r.kernel.org,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        "oss-drivers@...ronome.com" <oss-drivers@...ronome.com>
Subject: Re: 32-bit zext time complexity (Was Re: [PATCH bpf-next]
 selftests/bpf: two scale tests)

On Fri, Apr 26, 2019 at 02:06:33PM +0100, Jiong Wang wrote:
> 
> Alexei Starovoitov writes:
> 
> > On Thu, Apr 25, 2019 at 08:25:44AM +0100, Jiong Wang wrote:
> >> 
> >> Alexei Starovoitov writes:
> >> 
> >> > On Thu, Apr 25, 2019 at 12:07:06AM +0100, Jiong Wang wrote:
> >> >> 
> >> >> Alexei Starovoitov writes:
> >> >> 
> >> >> > Add two tests to check that sequence of 1024 jumps is verifiable.
> >> >> >
> >> >> > Signed-off-by: Alexei Starovoitov <ast@...nel.org>
> >> >> > ---
> >> >> >  tools/testing/selftests/bpf/test_verifier.c  | 70 ++++++++++++++++++++
> >> >> >  tools/testing/selftests/bpf/verifier/scale.c | 18 +++++
> >> >> 
> >> >> I am rebasing 32-bit opt pass on top of latest bpf-next and found these new
> >> >> tests take more than 20 minutes to run and had not finished after that.
> >> >> 
> >> >> The reason the following insn filling insde bpf_fill_scale1 is generating
> >> >> nearly 1M insn whose results are recognized as safe to be poisoned.
> >> >> 
> >> >>   bpf_fill_scale1:
> >> >>     while (i < MAX_TEST_INSNS - 1025)
> >> >>       insn[i++] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 42);
> >> >> 
> >> >> For each hi32 poisoning, there will be one call to "bpf_patch_insn_data"
> >> >> which actually is not cheap (adjust jump insns, insn aux info etc). Now,
> >> >> 1M call to it has exhausted server resources as described, 20minutes running
> >> >> still not finished.
> >> >> 
> >> >> For real world applications, we don't do hi32 poisoning, and there isn't much
> >> >> lo32 zext. Benchmarking those bpf programs inside Cilium shows the final
> >> >> zext pass adds about 8% ~ 15% verification time.
> >> >> 
> >> >> The zext pass based on top of "bpf_patch_insn_data" looks more and more is
> >> >> not the best approach to utilize the read32 analysis results.
> >> >> 
> >> >> Previously, in v1 cover letter, I listed some of my other thoughts on how to
> >> >> utilize the liveness analysis results:
> >> >> 
> >> >>    1 Minor change on back-end JIT hook, also pass aux_insn information to
> >> >>      back-ends so they could have per insn information and they could do
> >> >>      zero extension for the marked insn themselves using the most
> >> >>      efficient native insn.
> >> >> 
> >> >>    2 Introduce zero extension insn for eBPF. Then verifier could insert
> >> >>      the new zext insn instead of lshift + rshift. zext could be JITed
> >> >>      more efficiently.
> >> >> 
> >> >>    3 Otherwise JIT back-ends need to do peephole to catch lshift + rshift
> >> >>      and turn them into native zext.
> >> >
> >> > all options sounds like hacks to workaround inefficient bpf_patch_insn_data().
> >> > Especially option 2 will work only because single insn is replaced
> >> > with another insn ?
> >> 
> >> Option 1 should be a generic solution. It is passing verifier analysis
> >> results generated by insn walk down to JIT back-ends. The information
> >> passed down could be any analysis result useful for JIT code-gen.
> >> 
> >> > Let's fix the algo of bpf_patch_insn_data() instead, so that 1 insn -> 2+ insn
> >> > is also fast.
> >> 
> >> The issue with 1 insn -> 2+ insn should be calling of bpf_adj_branches
> >> which is doing another for_each_insn_in_prog traversal, so the zext
> >> insertion becomes something like:
> >> 
> >>   for_each_insn_in_prog
> >>   ...
> >>      if (zext)
> >>      ...
> >>        for_each_insn_in_prog
> >> 
> >> which is quadratic. One solution is we chain all branch insns during
> >> previous insn traversal in for example cfg check, and keep the information
> >> somewhere info bpf_prog (env->insn_aux_data is a good place to keep such
> >> information, but insn patch helpers are supposed to work with bpf_prog)
> >> then bpf_adj_branches could traversal this chain instead of iterating
> >> through all insns.
> 
> Thanks very much for the feedbacks.
> 
> > I don't think it will make it much faster. There could be just as many
> > jumps as there are instructions.
> 
> Benchmarked a basic implemention on a couple of bpf programs in Cilium repo
> (the impl is a relocation list kept in a array built as by-product of
> check_subprogs. during patch, walk this relocation list instead of all
> insns. Then calculate time by 10 times run and take average)
> 
>  - The time spent on zero-extension pass is ~30% less
>  - The time spent on convert_ctx_accesses + fixup_bpf_call +
>    fixup_call_args is ~15% less
>  - The time spent on dead code elim pass is not changed which makes sense
>    as dead code is rare so patch infra is not triggered much.
> 
> So, looks like could help a little bit on daily program, but agree no
> fundamental improvements, it is still N(insn_needs_patch) * N(branch), and
> both N could go very big.
> 
> > Note that bpf_patch_insn_single() is calling bpf_adj_branches() twice too.
> > And dead_code + convert_ctx + fixup_bpf_calls are calling
> > bpf_patch_insn_single() a lot.
> > How about before dead_code pass we convert the program into basic-block
> > format, patch it all, and then convert from bb back to offsets.
> > Patching will become very cheap, since no loop over program will be
> > necessary. A jump from bb-N to bb-M will stay as-is regardless
> > of amount of patching was done inside each bb.
> > The loops inside these patching passes will be converted from:
> > for (i = 0; i < insn_cnt; i++, insn++)
> > into:
> > for each bb
> >   for each insn in bb
> 
> Interesting. If I am understanding correctly, BB then needs to support
> dynamic insn buffer resize. And after all insn patching finished, all BBs
> are finalized, we then linearized BBs (in a best order) to generate the
> final bpf image.

dynamic BB resize could be done similar to existing prog resize.
It grows in page increments.

> > As far as option 1 "also pass aux_insn information to JITs"...
> > in theory it's fine, but looks like big refactoring to existing code.
> 
> Will do quick explore, might turn out to be small change.
> 
> > So if you want to make this bb conversion as step 2 and unblock the
> > current patch set faster I suggest to go with option 2 "Introduce zero
> > extension insn".
> 
> A second think, even zero extension insn introduced, it is inserted after
> the sub-register write insn, so we are still doing insert *not*
> replacement, insn_delta inside bpf_patch_insn_single will be 1, so the slow
> path will always be taken (memmove + bpf_prog_realloc + 2 x
> bpf_adj_branches).

ahh. right.

> For the 1M-scale test, bpf_patch_insn_single is triggered because of hi32
> poisoning, not lo32 zext. So we just need to change MOV32 to MOV64 in the
> testcase which doesn't break the initial testing purpose of this testcase
> from my understanding. This then could avoid 1M call to
> bpf_patch_insn_single and pass the test after 32-bit opt patch set applied.
> 
> Without this change and with hi32 randomization enabled, scale tests will
> still hang before insn patch infra improved.
> 
> @@ -228,7 +228,7 @@ static void bpf_fill_scale1(struct bpf_test *self)           
> -             insn[i++] = BPF_ALU32_IMM(BPF_MOV, BPF_REG_0, 42);
> +             insn[i++] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 42);
> 
> This is change is not to paperover the underlying issue. We now know the
> existing insn patch infra doesn't scale to million level, so could work on
> improving it in the next step.

I'm hesitant to step back.
Do you see a program that can hit this patch_insn issue already?
(I mean without your hi32/lo32 zext changes).

> At the same time the issue exposed from hi32 poisoning does raise alarm
> that there could be the same issue for lo32 zext, therefore this patch set
> doesn't scale if there are lots of insns to be zero extended, even though
> this may not happen in real world bpf prog.
> 
> IMHO, avoid using insn patching when possible might always be better. So,
> if option 1 turns out to also generate clean patch set and introduce small
> changes, I am going to follow it in the update version.
> 
> Please let me know if you have different opinion.

if you can craft a test that shows patch_insn issue before your set,
then it's ok to hack bpf_fill_scale1 to use alu64.
I would also prefer to go with option 2 (new zext insn) for JITs.
I still don't have good feeling about option 1.
Exposing all of aux_data to JITs may become a headache
in the verifier development. It needs to be done carefully.