netdev - Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkaJQjfSy+YARFRkqQ8m7OGJHO9v91mSk-cFeo9Z5UVJKg@mail.gmail.com>
Date:   Fri, 20 May 2022 14:18:42 -0700
From:   Yosry Ahmed <yosryahmed@...gle.com>
To:     Hao Luo <haoluo@...gle.com>
Cc:     Tejun Heo <tj@...nel.org>, Yonghong Song <yhs@...com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Zefan Li <lizefan.x@...edance.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Shuah Khan <shuah@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Michal Hocko <mhocko@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Greg Thelen <gthelen@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Networking <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Cgroups <cgroups@...r.kernel.org>
Subject: Re: [PATCH bpf-next v1 3/5] bpf: Introduce cgroup iter

On Fri, May 20, 2022 at 12:43 PM Hao Luo <haoluo@...gle.com> wrote:
>
> Hi Tejun and Yonghong,
>
> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@...nel.org> wrote:
> > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> > > Maybe you can have a bpf program signature like below:
> > >
> > > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
> > > struct cgroup *parent_cgrp)
> > >
> > > parent_cgrp is NULL when cgrp is the root cgroup.
> > >
> > > I would like the bpf program should send the following information to
> > > user space:
> > >    <parent cgroup dir name> <current cgroup dir name>
> >
> > I don't think parent cgroup dir name would be sufficient to reconstruct the
> > path given that multiple cgroups in different subtrees can have the same
> > name. For live cgroups, userspace can find the path from id (or ino) without
> > traversing anything by constructing the fhandle, open it open_by_handle_at()
> > and then reading /proc/self/fd/$FD symlink -
> > https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
> > but I'm not sure how much that'd matter given that they aren't visible from
> > userspace anyway.
> >
>
> Sending cgroup id is better than cgroup dir name, also because IIUC
> the path obtained from cgroup id depends on the namespace of the
> userspace process. So if the dump file may be potentially read by
> processes within a container, it's better to have the output
> namespaced IMO.
>
> > >    <various stats interested by the user>
> > >
> > > This way, user space can easily construct the cgroup hierarchy stat like
> > >                            cpu   mem   cpu pressure   mem pressure ...
> > >    cgroup1                 ...
> > >       child1               ...
> > >         grandchild1        ...
> > >       child2               ...
> > >    cgroup 2                ...
> > >       child 3              ...
> > >         ...                ...
> > >
> > > the bpf iterator can have additional parameter like
> > > cgroup_id = ... to only call bpf program once with that
> > > cgroup_id if specified.
>
> Yep, this should work. We just need to make the cgroup_id parameter
> optional. If it is specified when creating bpf_iter_link, we print for
> that cgroup only. If it is not specified, we iterate over all cgroups.
> If I understand correctly, sounds doable.
>
> > > The kernel part of cgroup_iter can call cgroup_rstat_flush()
> > > before calling cgroup_iter bpf program.
>
> Sounds good to me as well. But my knowledge on rstat_flush is limited.
> Yosry can give this a try.
>
> >
> > Would it work to just pass in @cgrp and provide a group of helpers so that
> > the program can do whatever it wanna do including looking up the full path
> > and passing that to userspace?
> >
>
> My understanding is, yes, doable. If we need the full path information
> of a cgroup, helpers or kfuncs are needed.
>
> The userspace needs to specify the identity of the cgroup, when
> creating bpf_iter. This identity could be cgroup id or fd. This
> identity needs to be converted to cgroup object somewhere before
> passing into bpf program to use.


Let's sum up the discussion here, I feel like we are losing track of
the main problem. IIUC the main concern is that cgroup_iter is not
effectively an iterator, it rather dumps information for one cgroup. I
like the suggestion to make it iterate cgroups by default, and an
optional cgroup_id parameter to make it only "iterate" this one
cgroup. IIUC, this cgroup_id parameter would be a link parameter,
similar to the current approach. Basically, we extend the current
patch so that if cgroup_id is not specified the iterator gets called
for all cgroups instead of one. This fixes the problem for our use
case and also keeps cgroup_iter generic enough. Is my understanding
correct? If yes, I don't see a need to flush rstat in the kernel on
behalf of cgroup_iter progs.