linux-kernel - Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAADnVQ+-29CS7nSXghKMgZjKte84L0nRDegUE0ObFm3d7E=eWw@mail.gmail.com>
Date:   Sat, 5 Feb 2022 20:29:10 -0800
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Hao Luo <haoluo@...gle.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Andrii Nakryiko <andrii@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        KP Singh <kpsingh@...nel.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Joe Burton <jevburton.kernel@...il.com>,
        Stanislav Fomichev <sdf@...gle.com>, bpf <bpf@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for
 cgroup_view link

On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@...gle.com> wrote:
> >
> > > In our use case, we can't ask the users who create cgroups to do the
> > > pinning. Pinning requires root privilege. In our use case, we have
> > > non-root users who can create cgroup directories and still want to
> > > read bpf stats. They can't do pinning by themselves. This is why
> > > inheritance is a requirement for us. With inheritance, they only need
> > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > operation is required. Patch 1-4 are needed to implement inheritance.
> > >
> > > It's also not a good idea in our use case to add a userspace
> > > privileged process to monitor cgroupfs operations and perform the
> > > pinning. It's more complex and has a higher maintenance cost and
> > > runtime overhead, compared to the solution of asking whoever makes
> > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > the data center that don't have the userspace process deployed, the
> > > stats will be unavailable, which is a no-no for some of our users.
> >
> > The commit log says that there will be a daemon that does that
> > monitoring of cgroupfs. And that daemon needs to mkdir
> > directories in bpffs when a new cgroup is created, no?
> > The kernel is only doing inheritance of bpf progs into
> > new dirs. I think that daemon can pin as well.
> >
> > The cgroup creation is typically managed by an agent like systemd.
> > Sounds like you have your own agent that creates cgroups?
> > If so it has to be privileged and it can mkdir in bpffs and pin too ?
>
> Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> the top-level cgroup for each job to run inside. However, the job can
> create its own cgroups inside the top-level cgroup, for fine grained
> resource control. This doesn't go through the daemon. The job-created
> cgroups don't have the pinned objects and this is a no-no for our
> users.

We can whitelist certain tracepoints to be sleepable and extend
tp_btf prog type to include everything from prog_type_syscall.
Such prog would attach to cgroup_mkdir and cgroup_release
and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
We can allow prog_type_syscall to do mkdir in bpffs as well.

This feature could be useful for similar monitoring/introspection tasks.
We can write a program that would monitor bpf prog load/unload
and would pin an iterator prog that would show debug info about a prog.
Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
With this feature we can implement:
ls /sys/fs/bpf/all_progs.debug/
and each loaded prog would have a corresponding file.
The file name would be a program name, for example.
cat /sys/fs/bpf/all_progs.debug/my_prog
would pretty print info about 'my_prog' bpf program.

This way the kernfs/cgroupfs specific logic from patches 1-4
will not be necessary.

wdyt?