[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55558493-4cc5-15f2-70b4-48ff30d39e06@fb.com>
Date: Thu, 16 Apr 2020 00:15:51 -0700
From: Yonghong Song <yhs@...com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>,
Alexei Starovoitov <alexei.starovoitov@...il.com>
CC: Andrii Nakryiko <andriin@...com>, bpf <bpf@...r.kernel.org>,
Martin KaFai Lau <kafai@...com>,
Networking <netdev@...r.kernel.org>,
Alexei Starovoitov <ast@...com>,
Daniel Borkmann <daniel@...earbox.net>,
Kernel Team <kernel-team@...com>
Subject: Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers
On 4/15/20 6:48 PM, Andrii Nakryiko wrote:
> On Wed, Apr 15, 2020 at 9:46 AM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
>>
>> On Tue, Apr 14, 2020 at 09:45:08PM -0700, Andrii Nakryiko wrote:
>>>>
>>>>> FD is closed, dumper program is detached and dumper is destroyed
>>>>> (unless pinned in bpffs, just like with any other bpf_link.
>>>>> 3. At this point bpf_dumper_link can be treated like a factory of
>>>>> seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
>>>>> illustration purposes) command, that accepts dumper link FD and
>>>>> returns a new seq_file FD, which can be read() normally (or, e.g.,
>>>>> cat'ed from shell).
>>>>
>>>> In this case, link_query may not be accurate if a bpf_dumper_link
>>>> is created but no corresponding bpf_dumper_open_file. What we really
>>>> need to iterate through all dumper seq_file FDs.
>>>
>>> If the goal is to iterate all the open seq_files (i.e., bpfdump active
>>> sessions), then bpf_link is clearly not the right approach. But I
>>> thought we are talking about iterating all the bpfdump programs
>>> attachments, not **sessions**, in which case bpf_link is exactly the
>>> right approach.
>>
>> That's an important point. What is the pinned /sys/kernel/bpfdump/tasks/foo ?
>
> Assuming it's not a rhetorical question, foo is a pinned bpf_dumper
> link (in my interpretation of all this).
>
>> Every time 'cat' opens it a new seq_file is created with new FD, right ?
>
> yes
>
>> Reading of that file can take infinite amount of time, since 'cat' can be
>> paused in the middle.
>
> yep, correct (though most use case probably going to be very short-lived)
>
>> I think we're dealing with several different kinds of objects here.
>> 1. "template" of seq_file that is seen with 'ls' in /sys/kernel/bpfdump/
>
> Let's clarify here again, because this can be interpreted differently.
>
> Are you talking about, e.g., /sys/fs/bpfdump/task directory that
> defines what class of items should be iterated? Or you are talking
> about named dumper: /sys/fs/bpfdump/task/my_dumper?
>
> If the former, I agree that it's not a link. If the latter, then
> that's what we've been so far calling "a named bpfdumper". Which is
> what I argue is a link, pinned in bpfdumpfs (*not bpffs*).
>
> UPD: reading further, seems like it's some third interpretation, so
> please clarify.
>
>> 2. given instance of seq_file after "template" was open
>
> Right, corresponding to "bpfdump session" (has its own unique session_id).
>
>> 3. bpfdumper program
>
> Yep, BPF_PROG_LOAD returns FD to verified bpfdumper program.
>
>> 4. and now links. One bpf_link from seq_file template to bpf prog and
>
> So I guess "seq_file template" is /sys/kernel/bpfdump/tasks direntry
> itself, which has to be specified as FD during BPF_PROG_LOAD, is that
> right? If yes, I agree, "seq_file template" + attached bpf_prog is a
> link.
>
>> many other bpf_links from actual seq_file kernel object to bpf prog.
>
> I think this one is not a link at all. It's a bpfdumper session. For
> me this is equivalent of a single BPF program invocation on cgroup due
> to a single packet. I understand that in this case it's multiple BPF
> program invocations, so it's not exactly 1:1, but if we had an easy
> way to do iteration from inside BPF program over all, say, tasks, that
> would be one BPF program invocation with a loop inside. So to me one
> seq_file session is analogous to a single BPF program execution (or,
> say one hardware event triggering one execution of perf_event BPF
> program).
>
>> I think both kinds of links need to be iteratable via get_next_id.
>>
>> At the same time I don't think 1 and 2 are links.
>> read-ing link FD should not trigger program execution. link is the connecting
>> abstraction. It shouldn't be used to trigger anything. It's static.
>> Otherwise read-ing cgroup-bpf link would need to trigger cgroup bpf prog too.
>> FD that points to actual seq_file is the one that should be triggering
>> iteration of kernel objects and corresponding execution of linked prog.
>
> Yep, I agree totally, reading bpf_link FD directly as if it was
> seq_file seems weird and would support only a single time to read.
>
>> That FD can be anon_inode returned from raw_tp_open (or something else)
>
> raw_tp_open currently always returns bpf_link FDs, so if this suddenly
> returns readable seq_file instead, that would be weird, IMO.
>
>
>> or FD from open("/sys/kernel/bpfdump/foo").
>
> Agreed.
>
>>
>> The more I think about all the objects involved the more it feels that the
>> whole process should consist of three steps (instead of two).
>> 1. load bpfdump prog
>> 2. create seq_file-template in /sys/kernel/bpfdump/
>> (not sure which api should do that)
>
> Hm... ok, I think seq_file-template means something else entirely.
> It's not an attached BPF program, but also not a /sys/fs/bpfdump/task
> "provider". What is it and what is its purpose? Also, how is it
> cleaned up if application crashes between creating "seq_file-template"
> and attaching BPF program to it?
>
>> 3. use bpf_link_create api to attach bpfdumper prog to that seq_file-template
>>
>> Then when the file is opened a new bpf_link is created for that reading session.
>> At the same time both kinds of links (to teamplte and to seq_file) should be
>> iteratable for observability reasons, but get_fd_from_id on them should probably
>> be disallowed, since holding such FD to these special links by other process
>> has odd semantics.
>
> This special get_fd_from_id handling for bpfdumper links (in your
> interpretation) looks like a sign that using bpf_link to represent a
> specific bpfdumper session is not the right design.
>
> As for obserabilitiy of bpfdumper sessions, I think using bpfdump
> program + task/file provider will give a good way to do this,
> actually, with no need to maintain a separate IDR just for bpfdumper
> sessions.
>
>>
>> Similarly for anon seq_file it should be three step process as well:
>> 1. load bpfdump prog
>> 2. create anon seq_file (api is tbd) that returns FD
>> 3. use bpf_link_create to attach prog to seq_file FD
>>
>> May be it's all overkill. These are just my thoughts so far.
>
> Just to contrast, in a condensed form, what I was proposing:
>
> For named dumper:
> 1. load bpfdump prog
> 2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
> bpf_link anon FD back
I actually tried to prototype earlier today.
for existing tracing program (non-raw-tracepoint, e.g., fentry/fexit),
when raw_tracepoint_open() is called,
bpf_trampoline_link_prog() is called, trampoline is actually
updated with bpf_program and program start running. You can hold
this link_fd at user application or pin it to /sys/fs/bpf.
That is what I refers to in my previous email whether
we can have 'cat'-able link or not. But looks it is pretty hard.
Alternatively, we could still return a bpf_link.
The only thing bpf_link did is to hold a reference count for bpf_prog
and nothing else. Later on, we can use this bpf_link to pin dumper
or open anonymous seq_file.
But since bpf_link just holds a reference for prog and nothing more
and that is why I mentioned not 100% sure whether bpf_link is needed
as I could achieve the same thing with bpf_prog. Further,
it does not provide ability to query open files (a bpf program
for task/file target should be able to do it.)
But if for API consistency, we prefer raw_tracepoint_open() to
return a link fd. I can still do it, I guess. Or maybe link_query
still useful in some way.
> 3. pin link in bpfdumpfs (e.g., /sys/fs/bpfdump/task/my_dumper)
> 4. each open() of /sys/fs/bpfdump/task/my_dumper produces new
> bpfdumper session/seq_file
>
> For anon dumper:
> 1. load bpfdump prog
> 2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
> bpf_link anon FD back
> 3. give bpf_link FD to some new API (say, BPF_DUMP_NEW_SESSION or
> whatever name) to create seq_file/bpfdumper session, which will create
> FD that can be read(). One can do that many times, each time getting
> its own bpfdumper session.
>
> First two steps are exactly the same, as it should be, because
> named/anon dumper is still the same dumper. Note also that we can use
> bpf_link FD of named dumper and BPF_DUMP_NEW_SESSION command to also
> create sessions, which further underlines that the only difference
> between named and anon dumper is this bpfdumpfs direntry that allows
> to create new seq_file/session by doing normal open(), instead of
> BPF's BPF_DUMP_NEW_SESSION.
>
> Named vs anon dumper is like "named" vs "anon" bpf_link -- we don't
> even talk in those terms about bpf_link, because the only difference
> is pinned direntry in a special FS, really.
>
Powered by blists - more mailing lists