netdev - Re: [RFC PATCH net-next 1/2] bpf: Save original ebpf instructions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <589AF867.3020009@iogearbox.net>
Date:   Wed, 08 Feb 2017 11:52:23 +0100
From:   Daniel Borkmann <daniel@...earbox.net>
To:     David Ahern <dsa@...ulusnetworks.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>
CC:     Quentin Monnet <quentin.monnet@...nd.com>, netdev@...r.kernel.org,
        roopa@...ulusnetworks.com
Subject: Re: [RFC PATCH net-next 1/2] bpf: Save original ebpf instructions

On 02/07/2017 06:22 PM, David Ahern wrote:
> On 2/6/17 12:21 PM, Alexei Starovoitov wrote:
>>> Well, worst case cost would be ~8 additional pages per program that
>>> are very rarely used; assume you want to attach a prog per netdevice,
>>> or worse, one for ingress, one for egress ... and yet we already
>>> complain that netdevice itself is way too big and needs a diet ...
>>> That makes it much much worse, though. Using the remainder of the
>>> used page is thus not enough, I'm afraid.
>
> sure, 8 bytes per instruction, 4k instructions so worst case is ~8 pages.
>
>>> But also then, what do you do with 4k insns (or multiple of these in
>>> case of tail calls), is there a decompiler in the works? Understandably
>>> that for vrf it might be trivial to just read disasm, but that's just
>>> one very specific use-case. I can see the point of dumping for the
>
> Dumping bpf bytecode is no different than dumping asm with objdump. To
> those who know/learn how to read it and make sense of the asm, it is an
> important tool. I contend the ability to retrieve and dump BPF code is
> equally important.

I agree that dumping might be useful particularly with further
tooling on top (f.e. decompiler), but in any case the interface for
that needs to be designed carefully and generic, so it covers various
use cases (meaning also CRIU). Above, I was rather referring to what
an admin can make sense of when you have worst-case 4k insns with
complex control flow and you just dump the asm (or in case of [1] it
was pure opcodes) in, say, ss or tc; goal was on debugability and
the people debugging are not always kernel devs. Actually currently,
for cBPF dumps it looks like this in ss. Can you tell me what these
11 insns do? Likely you can, but can a normal admin?

# ss -0 -b
Netid  Recv-Q Send-Q                                       Local Address:Port                                                        Peer Address:Port
p_raw  0      0                                                        *:em1                                                                *
	bpf filter (11):  0x28 0 0 12, 0x15 0 8 2048, 0x30 0 0 23, 0x15 0 6 17, 0x28 0 0 20, 0x45 4 0 8191, 0xb1 0 0 14, 0x48 0 0 16, 0x15 0 1 68, 0x06 0 0 4294967295, 0x06 0 0 0,

Certainly that can and should be further improved (f.e. see bpf_disasm()
in tools/net/bpf_dbg.c) to make it actually more readable w/o any
external tooling that does not ship with iproute2. That could help
already, but is still hard. Oddly enough, for cBPF everyone was fine
so far with plain opcode dump it seems?! For eBPF, it could be same
kind of disasm that people are used from verifier with further debug
help when it comes to context access and helpers. So I'm not against
this, but if it's just plain opcodes smashed to the user w/o further
tooling/infra on top aiding the reverse engineering process, it's
pretty much obfuscated and unusable. Thus, both needs to be provided
here. Admittedly, I don't know how your iproute2 plans looked like.

   [1] https://github.com/6WIND/iproute2/commit/288af23f4dc0adac6c288b3f70ae25997ea5bebf

>>> sake of CRIU use-case given that cBPF is supported there in the past,
>>> and CRIU strives to catch up with all kind of kernel APIs. It's
>>> restricted wrt restore to either the same kernel version or newer
>>> kernels, but that's generic issue for CRIU. So far I'm not aware of
>>> user requests for CRIU support, but it could come in future, though.
>>> Thus, should we have some dump interface it definitely needs to be
>>> i) generic and ii) usable for CRIU, so we don't end up with multiple
>>> dump mechanisms due to one API not being designed good enough to
>>> already cover it.
>>>
>>> Dump result would need to look a bit similar to what we have in ELF
>>> file, that is, you'd need to construct map descriptions and reference
>>> them in the insn sequence similar to relocations in the loader. I
>
> That is a userspace problem, not a kernel problem. This discussion is on
> the API to return BPF code to userspace. How that program is presented
> to the user is up to the tools.

It's not a user space problem. When you want to CRIU your program,
then you also need the correlation to maps that are used in the program.
So the dump needs to export both, the map meta data and references to
it in the insns stream where src_reg is BPF_PSEUDO_MAP_FD. With map meta
data, I mean the attrs used to originally create the map via bpf(2), so
that you can later restore the same thing (taking the actual map data
aside for now as also not part of that specific api). Since a bpf prog
has all the maps referenced in its aux data, it should be doable.
Perhaps in case the map was pinned, that meta data should also contain
dev/indoe of the pinned map. There's still the issue of generating a
snapshot of map data itself, and to extract/dump prog insns from a tail
call map, though.

>>> haven't looked into whether it's feasible, but we should definitely
>>> evaluate possibilities to transform the post-verifier insn sequence
>>> back into a pre-verifier one for dumping, so we don't need to pay the
>>> huge additional cost (also a major fraction of the insns might be
>>> just duplicated here as they weren't rewritten). Perhaps it could
>>> be done with the help of small annotations, or some sort of diff
>>> that we only need to store. Not saying it's trivial, but we should
>>> definitely try hard first and design it carefully if we want to
>>> support such API.
>
> I looked at a reverse_convert_ctx_access at the end of last week. I only
> have code to reverse sock_filter_convert_ctx_access, but scanning the
> other convert functions nothing jumped out suggesting it is not doable.

Ok.

> Saving the original bpf code is the simplest solution, hence I decided
> to use it for this discussion.

Simplest but with downside of huge worst-case cost, see my earlier
comparison to bloating net devices, where the progs are eventually
attached to, at least in tc case. So lets find ways to avoid this big
overhead when designing this API.

>> +1
>>
>> such instruction dump is meaningful only for stateless programs.
>> Out of the top of my head the minium requirement is that they
>> shouldn't use maps and no tailcalls.
>> Also the extra memory is the concern, so I think we need a flag
>> for BPF_PROG_LOAD command like 'store stateless original prog'.
>> Then at load time we can check that maps are not used and so on.
>> The get_attach approach from patch 2 is imo too specific.
>> I think two independent commands would be better.
>> One to return new prog_fd from attachment point and
>> second to do the actual instruction dump based on given prog_fd.
>> Most of the programs today are stateful, so I'm reluctant to add
>> all that complexity for small fraction of programs, so I would
>> like to understand the motivation first.
>> 'ss --bpf' is imo part of the answer. It would need to print
>> attachment info as well, right? and this is for debugging only?
>> In case of 'ip vrf' the programs are trivial, so it's really
>> to debug 'ip vrf' only? and tracepoints somehow not enough?
>> Or there is more to it that I'm missing?
>
> My motivation for pursing this discussion now is to verify cgroups are
> correctly configured on running systems. But, I am also interested in
> the generic OS case and being able to debug and understand networking
> problems because of bugs in BPF code.
>
> For example, consider the case of enterprise distributions used by
> companies as part of their platforms and products with BPF programs
> coming from the OS vendor, the hardware vendors, and custom, locally
> developed programs (remember that BPF store comment I made in October? I
> was only half-joking). As an open OS, you have to anticipate programs
> coming from multiple sources, and those programs can and will have bugs.
> Further, people make mistakes; people forget commands that were run. It
> is fairly easy to see that the wrong bpf program could get attached to
> some object.
>
> In both the VRF/cgroup case and Quentin's case, we are talking about
> retrieving programs attached to specific objects, not just retrieving
> some BPF code that has been loaded into the kernel. bpf programs are
> loaded using the bpf system call, but they are attached to objects using
> different APIs. Given that, retrieval of the programs attached to
> specific objects need to be retrieved with object specific APIs. No?
> e.g., In Quentin's patch, the bpf code is added as part of the tc dump
> which seems appropriate. For the cgroup case, programs are attached to
> cgroups using the bpf system call hence an extension to it is needed to
> retrieve the attached program.
>
> Tracepoints are wrong here for multiple reasons: tracepoints do not and
> should not look at the bpf code, and auditing / verifying a config can
> not rely on executing code that can, hopefully, trigger the tracepoint
> to return some information that might shed light on a problem. That is
> not a reliable mechanism for verifying something is (or is not in the
> case of the default VRF) properly configured.
>