netdev - Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzaYYhK8PpO4Swcj0dqjYg+bn_3OkEnqjCXUgfkkHZgWMw@mail.gmail.com>
Date:   Mon, 13 Apr 2020 12:59:54 -0700
From:   Andrii Nakryiko <andrii.nakryiko@...il.com>
To:     Yonghong Song <yhs@...com>
Cc:     Andrii Nakryiko <andriin@...com>, bpf <bpf@...r.kernel.org>,
        Martin KaFai Lau <kafai@...com>,
        Networking <netdev@...r.kernel.org>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Kernel Team <kernel-team@...com>
Subject: Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers

On Fri, Apr 10, 2020 at 5:23 PM Yonghong Song <yhs@...com> wrote:
>
>
>
> On 4/10/20 4:25 PM, Andrii Nakryiko wrote:
> > On Wed, Apr 8, 2020 at 4:26 PM Yonghong Song <yhs@...com> wrote:
> >>
> >> Given a loaded dumper bpf program, which already
> >> knows which target it should bind to, there
> >> two ways to create a dumper:
> >>    - a file based dumper under hierarchy of
> >>      /sys/kernel/bpfdump/ which uses can
> >>      "cat" to print out the output.
> >>    - an anonymous dumper which user application
> >>      can "read" the dumping output.
> >>
> >> For file based dumper, BPF_OBJ_PIN syscall interface
> >> is used. For anonymous dumper, BPF_PROG_ATTACH
> >> syscall interface is used.
> >>
> >> To facilitate target seq_ops->show() to get the
> >> bpf program easily, dumper creation increased
> >> the target-provided seq_file private data size
> >> so bpf program pointer is also stored in seq_file
> >> private data.
> >>
> >> Further, a seq_num which represents how many
> >> bpf_dump_get_prog() has been called is also
> >> available to the target seq_ops->show().
> >> Such information can be used to e.g., print
> >> banner before printing out actual data.
> >
> > So I looked up seq_operations struct and did a very cursory read of
> > fs/seq_file.c and seq_file documentation, so I might be completely off
> > here.
> >
> > start() is called before iteration begins, stop() is called after
> > iteration ends. Would it be a bit better and user-friendly interface
> > to have to extra calls to BPF program, say with NULL input element,
> > but with extra enum/flag that specifies that this is a START or END of
> > iteration, in addition to seq_num?
>
> The current design always pass a valid object (task, file, netlink_sock,
> fib6_info). That is, access to fields to those data structure won't
> cause runtime exceptions.
>
> Therefore, with the existing seq_ops implementation for ipv6_route
> and netlink, etc, we don't have END information. We can get START
> information though.

Right, I understand this about current implementation, because it
calls BPF program from show. But I noticed also stop(), which

>
> >
> > Also, right now it's impossible to write stateful dumpers that do any
> > kind of stats calculation, because it's impossible to determine when
> > iteration restarted (it starts from the very beginning, not from the
> > last element). It's impossible to just rememebr last processed
> > seq_num, because BPF program might be called for a new "session" in
> > parallel with the old one.
>
> Theoretically, session end can be detected by checking the return
> value of last bpf_seq_printf() or bpf_seq_write(). If it indicates
> an overflow, that means session end.

That's not what I meant by session end. If there is an overflow, the
session is going to be restart from start (but it's still the same
session, we just got bigger output buffer).

>
> Or bpfdump infrastructure can help do this work to provide
> session id.

Well, come to think about it. seq_file pointer itself is unique per
session, so that one can be used as session id, is that right?

>
> >
> > So it seems like few things would be useful:
> >
> > 1. end flag for post-aggregation and/or footer printing (seq_num == 0
> > is providing similar means for start flag).
>
> the end flag is a problem. We could say hijack next or stop so we
> can detect the end, but passing a NULL pointer as the object
> to the bpf program may be problematic without verifier enforcement
> as it may cause a lot of exceptions... Although all these exception
> will be silenced by bpf infra, but still not sure whether this
> is acceptable or not.

Right, verifier will need to know that item can be valid pointer or
NULL. It's not perfect, but not too big of a deal for user to check
for NULL at the very beginning.

What I'm aiming for with this end flags is ability for BPF program to
collect data during show() calls, and then at the end get extra call
to give ability to post-aggregate this data and emit some sort of
summary into seq_file. Think about printing out summary stats across
all tasks (e.g., p50 of run queue latency, or something like that). In
that case, I need to iterate all tasks, I don't need to emit anything
for any individual tasks, but I need to produce an aggregation and
output after the last task was iterated. Right now it's impossible to
do, but seems like an extremely powerful and useful feature. drgn
could utilize this to speed up its scripts. There are plenty of tools
that would like to have a frequent but cheap view into internals of
the system, which current is implemented through netlink (taskstats)
or procfs, both quite expensive, if polled every second.

Anonymous bpfdump, though, is going to be much cheaper, because a lot
of aggregation can happen in the kernel and only minimal output at the
end will be read by user-space.

>
> > 2. Some sort of "session id", so that bpfdumper can maintain
> > per-session intermediate state. Plus with this it would be possible to
> > detect restarts (if there is some state for the same session and
> > seq_num == 0, this is restart).
>
> I guess we can do this.

See above, probably using seq_file pointer is good enough.

>
> >
> > It seems like it might be a bit more flexible to, instead of providing
> > seq_file * pointer directly, actually provide a bpfdumper_context
> > struct, which would have seq_file * as one of fields, other being
> > session_id and start/stop flags.
>
> As you mentioned, if we have more fields related to seq_file passing
> to bpf program, yes, grouping them into a structure makes sense.
>
> >
> > A bit unstructured thoughts, but what do you think?
> >
> >>
> >> Note the seq_num does not represent the num
> >> of unique kernel objects the bpf program has
> >> seen. But it should be a good approximate.
> >>
> >> A target feature BPF_DUMP_SEQ_NET_PRIVATE
> >> is implemented specifically useful for
> >> net based dumpers. It sets net namespace
> >> as the current process net namespace.
> >> This avoids changing existing net seq_ops
> >> in order to retrieve net namespace from
> >> the seq_file pointer.
> >>
> >> For open dumper files, anonymous or not, the
> >> fdinfo will show the target and prog_id associated
> >> with that file descriptor. For dumper file itself,
> >> a kernel interface will be provided to retrieve the
> >> prog_id in one of the later patches.
> >>
> >> Signed-off-by: Yonghong Song <yhs@...com>
> >> ---
> >>   include/linux/bpf.h            |   5 +
> >>   include/uapi/linux/bpf.h       |   6 +-
> >>   kernel/bpf/dump.c              | 338 ++++++++++++++++++++++++++++++++-
> >>   kernel/bpf/syscall.c           |  11 +-
> >>   tools/include/uapi/linux/bpf.h |   6 +-
> >>   5 files changed, 362 insertions(+), 4 deletions(-)
> >>
> >
> > [...]
> >