netdev - Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180403233718.rrzh6ds67hraxhax@ast-mbp>
Date:   Tue, 3 Apr 2018 16:37:19 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Edward Cree <ecree@...arflare.com>
Cc:     Daniel Borkmann <daniel@...earbox.net>, netdev@...r.kernel.org
Subject: Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call
 simplifications

On Tue, Apr 03, 2018 at 02:39:11PM +0100, Edward Cree wrote:
> On 03/04/18 02:08, Alexei Starovoitov wrote:
> > I like patch 3 and going to play with it.
> > How did you test it ?
> Just test_verifier and test_progs (the latter has a failure
>  "test_bpf_obj_id:FAIL:get-prog-info(fd) err 0 errno 2 i 0 type 1(1) info_len 80(40) jit_enabled 0 jited_prog_len 0 xlated_prog_len 72"
>  but that was already there before my patch).

hmm. that doesn't fail for me and any other bots didn't complain.
Are you sure you're running the latest kernel and tests?

You may see the issue with buildid test:
test_stacktrace_build_id:FAIL:get build_id with readelf err -1 errno 2
that's due to actually missing buildid in the binary.

> > Do you have processed_insn numbers for
> > cilium or selftests programs before and after?
> > There should be no difference, right?
> That's a good check, I'll do that.
> 
> > As far as patch 1 it was very difficult to review, since several logically
> > different things clamped together. So breaking it apart:
> > - converting two arrays of subprog_starts and subprog_stack_depth into
> >   single array of struct bpf_subprog_info is a good thing
> > - tsort is interesting, but not sure it's correct. more below
> > - but main change of combining subprog logic with do_check is no good.
> <snip>
> > There will be no 'do_check() across whole program' walk.
> > Combining subprog pass with do_check is going into opposite direction
> > of this long term work. Divide and conquer. Combining more things into
> > do_check is the opposite of this programming principle.
> The main object of my change here was to change the data structures, to
>  store a subprogno in each insn aux rather than using bsearch() on the
>  subprog_starts.  I have now figured out the algorithm to do this in its
>  own pass (previously I thought this needed a recursive walk which is why
>  I wanted to roll it into do_check() - doing more than one whole-program
>  recursive walk seems like a bad idea.)

hmm. what's wrong with bsearch? It's trivial and fast.

> > My short term plan is to split basic instruction correctness checks
> > out of do_check() loop into separate pass and run it early on.
> I agree with that short term plan, sounds like a good idea.
> I'm still not sure I understand the long-term plan, though; since most
>  insns' input registers will still need to be checked (I'm assuming
>  majority of most real ebpf programs consists of computing and
>  dereferencing pointers), the data flow analysis will have to end up
>  doing all the same register updates current do_check() does (though
>  potentially in a different order), e.g. if a function is called three
>  times it will have to analyse it with three sets of input registers.
> Unless you have some way of specifying function preconditions, I don't
>  see how it works.  In particular something like
>     char *f(char *p)
>     {
>         *p++ = 0;
>         return p;
>     }
>     int main(void)
>     {
>         char *p = "abc"; /* represents getting a ptr from ctx or map */
>         p = f(p);
>         p = f(p);
>         p = f(p);
>         return 0;
>     }
>  seems as though it would be difficult to analyse in any way more
>  scalable than the current full recursive walk.  Unless you somehow
>  propagate register state _backwards_ as constraints when analysing a
>  function...?  In any case it seems like there are likely to be things
>  which current verifier accepts which require 'whole-program' analysis
>  to determine the dataflow (e.g. what if there were some conditional
>  branches in f(), and the preconditions on p depended on the value of
>  some other arg in r2?)

The main point that we have to make verifier scale. That's my #1 priority.
Even if we don't see the solution today we have to work towards it.
I believe adopting compiler style analysis is the right way forward.
The independent passes to determine and _keep_ info about subprograms,
basic blocks, cfg, dom graph, liveness, dataflow are necessary steps
not only for the main purpose of the verifier (which is safety check),
but for additional analysis that JITs, like NFP, have to do.
"walk all instruction and apply single transformation" is a _good thing_.
llvm has tens, if not hundreds, loops like:
for_each_bb_in_func(bb, func)
  for_each_insn_in_bb(insn, bb)
    do stuff with insn
Compiler designers could have combined multiple of such passes into
fewer ones, but it's not done, because it increases complexity and
causes tough bugs where pass is partially complete. cfg and/or dataflow
sometimes is recomputed independently from transformation
instead of doing during the pass for the same reasons.

As far as scaling the verifier the number to shot for is 1M bpf instructions.
For certain program types, like XDP, we probably will always keep 4k insn limit,
but in some cases, where program runs in the user contex, we can run it longer.
There 64k insns would be fine. Verifier already capable to inject code
into the program. It could easily inject cond_resched afer every 4k or so.
We'd need to make progs preemtable and be smarter with rcu-based map lookups.
It's all solvable and easy to do as long as core of the verifier scales.
The prime example where more than 4k instructions and loops are mandatory
is user space stack analysis inside the program. Like walking python stack
requires non-trival pointer chasing. With 'pragma unroll' the stack depth
limit today is ~18. That's not really usable. Looping through 100 python
frames would require about 16k bpf assembler instructions. That's 16k JITed
cpu instructions. It's plenty fast and safe in user contex, but if we increase
bpf max_insn limit to 16k, I'm afraid, the verifier will start falling apart
due to insn_processed counter.
Hence do_check approach must go. The rough idea is to compute per basic
block a set of INs (registers and stack) that basic block needs
to see to be safe and corresponding set of OUTs.
Then propagate this knowledge across cfg edges.
Once we have such set per bpf function, it will essentially answer the question
'what arguments this function needs to see to be safe and what it returns'
To make bpf libraries scale we'd need to keep such information
around after the verification, so dynamic linking and indirect calls
are fast to verify.
It's very high level obviously. There are many gotchas to resolve.

> > As far as tsort approach for determining max stack depth...
> > It's an interesting idea, but implementation is suffering from the same
> > 'combine everything into one loop' coding issue.
> > I think updating total_stack_depth math should be separate from sorting,
> > since the same function can be part of different stacks with different max.
> > I don't see how updating global subprog_info[i].total_stack_depth can
> > be correct. It has to be different for every stack and the longest
> > stack is not necessarily the deepest. May be I'm missing something,
> > but combined sort and stack_depth math didn't make it easy to review.
> > I find existing check_max_stack_depth() algo much easier to understand.
> The sort _is_ the way to compute total stack depth.  The sort order isn't
>  being stored anywhere; it's being done just so that each subprog gets
>  looked at after all its callers have been considered.  So when it gets
>  selected as a 'frontier node', its maximum stack depth is known, and can
>  thus be used to update its callees (note that we do a max_t() with each
>  callee's existing total_stack_depth, thus getting the deepest stack of
>  all call chains to the function).
> It may help to imagine drawing the call graph and labelling each node with
>  a stack depth as it is visited; sadly that's difficult to show in an email
>  (or a code comment).  But I can try to explain it a bit better than
>  "/* Update callee total stack depth */".

Please do, since that's my concern with tsort.
The verifier is the key piece of bpf infra and to be effective maintainers
we need to thoroughly understand the verifier code.
We cannot just take the patch based on the cover letter. The author may
disappear tomorrow and what we're going to do with the code?
Hence today, with information I have, I prefer to keep the current
check_max_stack_depth() as-is.
In other words it looks to me that tsort complexity is not justified.