[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzbW20hYtFqTf+cynVqbOtVJK7oO7ySBB+WP6yRRYdQoNw@mail.gmail.com>
Date: Tue, 1 Sep 2020 17:34:21 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Yonghong Song <yhs@...com>
Cc: bpf <bpf@...r.kernel.org>, Networking <netdev@...r.kernel.org>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Kernel Team <kernel-team@...com>
Subject: Re: [PATCH bpf-next 2/5] bpf: add main_thread_only customization for
task/task_file iterators
On Thu, Aug 27, 2020 at 11:09 AM Yonghong Song <yhs@...com> wrote:
>
>
>
> On 8/26/20 10:07 PM, Andrii Nakryiko wrote:
> > On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@...com> wrote:
> >>
> >> Currently, task and task_file by default iterates through
> >> all tasks. For task_file, by default, all files from all tasks
> >> will be traversed.
> >>
> >> But for a user process, the file_table is shared by all threads
> >> of that process. So traversing the main thread per process should
> >> be enough to traverse all files and this can save a lot of cpu
> >> time if some process has large number of threads and each thread
> >> has lots of open files.
> >>
> >> This patch implemented a customization for task/task_file iterator,
> >> permitting to traverse only the kernel task where its pid equal
> >> to tgid in the kernel. This includes some kernel threads, and
> >> main threads of user processes. This will solve the above potential
> >> performance issue for task_file. This customization may be useful
> >> for task iterator too if only traversing main threads is enough.
> >>
> >> Signed-off-by: Yonghong Song <yhs@...com>
> >> ---
> >> include/linux/bpf.h | 3 ++-
> >> include/uapi/linux/bpf.h | 5 ++++
> >> kernel/bpf/task_iter.c | 46 +++++++++++++++++++++++-----------
> >> tools/include/uapi/linux/bpf.h | 5 ++++
> >> 4 files changed, 43 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index a6131d95e31e..058eb9b0ba78 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -1220,7 +1220,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> >> int __init bpf_iter_ ## target(args) { return 0; }
> >>
> >> struct bpf_iter_aux_info {
> >> - struct bpf_map *map;
> >> + struct bpf_map *map; /* for iterator traversing map elements */
> >> + bool main_thread_only; /* for task/task_file iterator */
> >
> > As a user of task_file iterator I'd hate to make this decision,
> > honestly, especially if I can't prove that all processes share the
> > same file table (I think clone() syscall allows to do weird
> > combinations like that, right?). It does make sense for a task/
>
> Right. the clone() syscall permits different kinds of sharing,
> sharing address space and sharing files are separated. It is possible
> that address space is shared and files are not shared. That is
> why I want to add flags for task_file so that if user knows
> they do not have cases where address space shared and files not
> shared, they can use main_thread_only to improve performance.
The problem with such options is that for a lot of users it won't be
clear at all when and if those options can be used. E.g., when I
imagine myself building some generic tool utilizing task_file bpf_iter
that is supposed to be run on any server, how do I know if it's safe
to specify "main_thread_only"? I can't guarantee that. So I'll be
forced to either risk it or fallback to default and unnecessarily slow
behavior.
This is different for task/ bpf_iter, though, so I support that for
task/ only. But you've already done that split, so thank you! :)
>
> > iterator, though, if I need to iterate a user-space process (group of
> > tasks). So can we do this:
> >
> > 1a. Either add a new bpf_iter type process/ (or in kernel lingo
> > task_group/) to iterate only main threads (group_leader in kernel
> > lingo);
> > 1b. Or make this main_thread_only an option for only a task/ iterator
> > (and maybe call it "group_leader_only" or something to reflect kernel
> > terminology?)
>
> The following is the kernel pid_type definition,
>
> enum pid_type
> {
> PIDTYPE_PID,
> PIDTYPE_TGID,
> PIDTYPE_PGID,
> PIDTYPE_SID,
> PIDTYPE_MAX,
> };
>
> Right now, task iterator is traversed following
> PIDTYPE_PID, i.e., every tasks in the system.
>
> To iterate through main thread, we need to traverse following
> PIDTYPE_TGID.
>
> In the future, it is possible, people want to traverse
> following PIDTYPE_PGID (process group) or PIDTYPE_SID (session).
>
> Or we might have other customization, e.g., cgroup_id, which can
> be filtered in the bpf program, but in-kernel filtering can
> definitely improve performance.
>
> So I prefer 1b.
Sounds good, but let's use a proper enum, not a set of bit fields. We
can support all 4 from the get go. There is no need to wait for the
actual use case to appear, as the iteration semantics is pretty clear.
>
> Yes, naming is hard.
> The name "main_thread_only" is mostly from userspace perspective.
> "group_leader_only" might not be good as it may be confused with
> possible future process group.
>
> >
> > 2. For task/file iterator, still iterate all tasks, but if the task's
> > files struct is the same as group_leader's files struct, then go to
> > the next one. This should eliminate tons of duplication of iterating
> > the same files over and over. It would still iterate a bunch of tasks,
> > but compared to the number of files that's generally a much smaller
> > number, so should still give sizable savings. I don't think we need an
> > extra option for this, tbh, this behavior was my intuitive
> > expectation, so I don't think you'll be breaking any sane user of this
> > iterator.
>
> What you suggested makes sense. for task_file iterator, we only promise
> to visit all files from all tasks. We can do necessary filtering to
> remove duplicates in the kernel and did not break promise.
> I will remove customization from task_file iterator.
Thanks for doing that!
>
> >
> > Disclaimer: I haven't got a chance to look through kernel code much,
> > so I'm sorry if what I'm proposing is something that is impossible to
> > implement or extremely hard/unreliable. But thought I'll put this idea
> > out there before we decide on this.
> >
> > WDYT?
> >
> > [...]
> >
Powered by blists - more mailing lists