[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.21.2004171518090.16765@localhost>
Date: Fri, 17 Apr 2020 16:02:09 +0100 (BST)
From: Alan Maguire <alan.maguire@...cle.com>
To: Yonghong Song <yhs@...com>
cc: David Ahern <dsahern@...il.com>, Andrii Nakryiko <andriin@...com>,
bpf@...r.kernel.org, Martin KaFai Lau <kafai@...com>,
netdev@...r.kernel.org, Alexei Starovoitov <ast@...com>,
Daniel Borkmann <daniel@...earbox.net>, kernel-team@...com
Subject: Re: [RFC PATCH bpf-next v2 00/17] bpf: implement bpf based dumping
of kernel data structures
On Wed, 15 Apr 2020, Yonghong Song wrote:
>
>
> On 4/15/20 7:23 PM, David Ahern wrote:
> > On 4/15/20 1:27 PM, Yonghong Song wrote:
> >>
> >> As there are some discussions regarding to the kernel interface/steps to
> >> create file/anonymous dumpers, I think it will be beneficial for
> >> discussion with this work in progress.
> >>
> >> Motivation:
> >> The current way to dump kernel data structures mostly:
> >> 1. /proc system
> >> 2. various specific tools like "ss" which requires kernel support.
> >> 3. drgn
> >> The dropback for the first two is that whenever you want to dump more,
> >> you
> >> need change the kernel. For example, Martin wants to dump socket local
> >
> > If kernel support is needed for bpfdump of kernel data structures, you
> > are not really solving the kernel support problem. i.e., to dump
> > ipv4_route's you need to modify the relevant proc show function.
>
> Yes, as mentioned two paragraphs below. kernel change is required.
> The tradeoff is that this is a one-time investment. Once kernel change
> is in place, printing new fields (in most cases except new fields
> which need additional locks etc.) no need for kernel change any more.
>
One thing I struggled with initially when reading the cover
letter was understanding how BPF dumper programs get run.
Patch 7 deals with that I think and the answer seems to be to
create additional seq file infrastructure to the exisiting
one which executes the BPF dumper programs where appropriate.
Have I got this right? I guess more lightweight methods
such as instrumenting functions associated with an existing /proc
dumper are a bit too messy?
Thanks!
Alan
> >
> >
> >> storage with "ss". Kernel change is needed for it to work ([1]).
> >> This is also the direct motivation for this work.
> >>
> >> drgn ([2]) solves this proble nicely and no kernel change is not needed.
> >> But since drgn is not able to verify the validity of a particular
> >> pointer value,
> >> it might present the wrong results in rare cases.
> >>
> >> In this patch set, we introduce bpf based dumping. Initial kernel
> >> changes are
> >> still needed, but a data structure change will not require kernel
> >> changes
> >> any more. bpf program itself is used to adapt to new data structure
> >> changes. This will give certain flexibility with guaranteed correctness.
> >>
> >> Here, kernel seq_ops is used to facilitate dumping, similar to current
> >> /proc and many other lossless kernel dumping facilities.
> >>
> >> User Interfaces:
> >> 1. A new mount file system, bpfdump at /sys/kernel/bpfdump is
> >> introduced.
> >> Different from /sys/fs/bpf, this is a single user mount. Mount
> >> command
> >> can be:
> >> mount -t bpfdump bpfdump /sys/kernel/bpfdump
> >> 2. Kernel bpf dumpable data structures are represented as directories
> >> under /sys/kernel/bpfdump, e.g.,
> >> /sys/kernel/bpfdump/ipv6_route/
> >> /sys/kernel/bpfdump/netlink/
> >
> > The names of bpfdump fs entries do not match actual data structure names
> > - e.g., there is no ipv6_route struct. On the one hand that is a good
> > thing since structure names can change, but that also means a mapping is
> > needed between the dumper filesystem entries and what you get for context.
>
> Yes, the later bpftool patch implements a new command to dump such
> information.
>
> $ bpftool dumper show target
> target prog_ctx_type
> task bpfdump__task
> task/file bpfdump__task_file
> bpf_map bpfdump__bpf_map
> ipv6_route bpfdump__ipv6_route
> netlink bpfdump__netlink
>
> in vmlinux.h generated by vmlinux BTF, we have
>
> struct bpf_dump_meta {
> struct seq_file *seq;
> u64 session_id;
> u64 seq_num;
> };
>
> struct bpfdump__ipv6_route {
> struct bpf_dump_meta *meta;
> struct fib6_info *rt;
> };
>
> Here, bpfdump__ipv6_route is the bpf program context type.
> User can based on this to write the bpf program.
>
> >
> > Further, what is the expectation in terms of stable API for these fs
> > entries? Entries in the context can change. Data structure names can
> > change. Entries in the structs can change. All of that breaks the idea
> > of stable programs that are compiled once and run for all future
> > releases. When structs change, those programs will break - and
> > structures will change.
>
> Yes, the API (ctx) we presented to bpf program is indeed unstable.
> CO-RE should help to certain extend but if some fields are gone, e.g.,
> bpf program will need to be rewritten for that particular kernel version, or
> kernel bpfdump infrastructure can be enhanced to
> change its ctx structure to have more information to the program
> for that kernel version. In summary, I agree with you that this is
> an unstable API similar to other tracing program
> since it accesses kernel internal data structures.
>
> >
> > What does bpfdumper provide that you can not do with a tracepoint on a
> > relevant function and then putting a program on the tracepoint? ie., why
> > not just put a tracepoint in the relevant dump functions.
>
> In my very beginning to explore bpfdump, kprobe to "show" function is
> one of options. But quickly we realized that we actually do not want
> to just piggyback on "show" function, but want to replace it with
> bpf. This will be useful in following different use cases:
> 1. first catable dumper file, similar to /proc/net/ipv6_route,
> we want /sys/kernel/bpfdump/ipv6_route/my_dumper and you can cat
> to get it.
>
> Using kprobe when you are doing `cat /proc/net/ipv6_route`
> is complicated. You probably need an application which
> runs through `cat /proc/net/ipv6_route` and discard its output,
> and at the same time gets the result from bpf program
> (filtered by pid since somebody may run
> `cat /proc/net/ipv6_route` at the same time. You may use
> perf ring_buffer to send the result back to the application.
>
> note that perf ring buffer may lose records for whatever
> reason and seq_ops are implemented not to lose records
> by built-in retries.
>
> Using kprobe approach above is complicated and for each dumper
> you need an application. We would like it to be just catable
> with minimum user overhead to create such a dumper.
>
> 2. second, anonymous dumper, kprobe/tracepoint will incur
> original overhead of seq_printf per object. but user may
> be only interested in a very small portion of information.
> In such cases, bpf program directly doing filtering in
> the kernel can potentially speed up a lot if there are a lot of
> records to traverse.
>
> 3. for data structures which do not have catable dumpers
> for example task, hopefully, as demonstrated in this patch set,
> kernel implementation and writing a bpf program are not
> too hard. This especially enables people to do in-kernel
> filtering which is the strength of the bpf.
>
>
>
Powered by blists - more mailing lists