[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQJcTAgcbwrOWO8EnbTdAcQ91HQmtpn7aKJGwHc=mEpJ1g@mail.gmail.com>
Date: Tue, 8 Feb 2022 13:20:44 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Hao Luo <haoluo@...gle.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
Andrii Nakryiko <andrii@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Martin KaFai Lau <kafai@...com>,
Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
KP Singh <kpsingh@...nel.org>,
Shakeel Butt <shakeelb@...gle.com>,
Joe Burton <jevburton.kernel@...il.com>,
Stanislav Fomichev <sdf@...gle.com>, bpf <bpf@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for
cgroup_view link
On Tue, Feb 8, 2022 at 12:07 PM Hao Luo <haoluo@...gle.com> wrote:
>
> On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@...gle.com> wrote:
> > > >
> > > > > In our use case, we can't ask the users who create cgroups to do the
> > > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > > non-root users who can create cgroup directories and still want to
> > > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > > inheritance is a requirement for us. With inheritance, they only need
> > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > > >
> > > > > It's also not a good idea in our use case to add a userspace
> > > > > privileged process to monitor cgroupfs operations and perform the
> > > > > pinning. It's more complex and has a higher maintenance cost and
> > > > > runtime overhead, compared to the solution of asking whoever makes
> > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > > the data center that don't have the userspace process deployed, the
> > > > > stats will be unavailable, which is a no-no for some of our users.
> > > >
> > > > The commit log says that there will be a daemon that does that
> > > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > > directories in bpffs when a new cgroup is created, no?
> > > > The kernel is only doing inheritance of bpf progs into
> > > > new dirs. I think that daemon can pin as well.
> > > >
> > > > The cgroup creation is typically managed by an agent like systemd.
> > > > Sounds like you have your own agent that creates cgroups?
> > > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> > >
> > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > > the top-level cgroup for each job to run inside. However, the job can
> > > create its own cgroups inside the top-level cgroup, for fine grained
> > > resource control. This doesn't go through the daemon. The job-created
> > > cgroups don't have the pinned objects and this is a no-no for our
> > > users.
> >
> > We can whitelist certain tracepoints to be sleepable and extend
> > tp_btf prog type to include everything from prog_type_syscall.
> > Such prog would attach to cgroup_mkdir and cgroup_release
> > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> > We can allow prog_type_syscall to do mkdir in bpffs as well.
> >
> > This feature could be useful for similar monitoring/introspection tasks.
> > We can write a program that would monitor bpf prog load/unload
> > and would pin an iterator prog that would show debug info about a prog.
> > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> > With this feature we can implement:
> > ls /sys/fs/bpf/all_progs.debug/
> > and each loaded prog would have a corresponding file.
> > The file name would be a program name, for example.
> > cat /sys/fs/bpf/all_progs.debug/my_prog
> > would pretty print info about 'my_prog' bpf program.
> >
> > This way the kernfs/cgroupfs specific logic from patches 1-4
> > will not be necessary.
> >
> > wdyt?
>
> Thanks Alexei. I gave it more thought in the last couple of days.
> Actually I think it's a good idea, more flexible. It gets rid of the
> need of a user space daemon for monitoring cgroup creation and
> destruction. We could monitor task creations and exits as well, so
> that we can export per-task information (e.g. task_vma_iter) more
> efficiently.
Yep. Monitoring task creation and exposing via bpf_iter sounds
useful too.
> A couple of thoughts when thinking about the details:
>
> - Regarding parameterized pinning, I don't think we can have one
> single bpf_iter_link object, but with different parameters. Because
> parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
> time we pin, we have to attach iter in order to get a new link object
> first. So we need to add attach and detach in bpf_sys_bpf().
Makes sense.
I'm adding bpf_link_create to bpf_sys_bpf as part of
the "lskel for kernel" patch set.
The detach is sys_close. It's already available.
> - We also need to add those syscalls for cleanup: (1) unlink for
> removing pinned obj and (2) rmdir for removing the directory in
> prog_type_syscall.
Yes. These two would be needed.
And obj_pin too.
> With these extensions, we can shift some of the bpf operations
> currently performed in system daemons into the kernel. IMHO it's a
> great thing, making system monitoring more flexible.
Awesome. Sounds like we're converging :)
Powered by blists - more mailing lists