netdev - Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbCvUxQn5Zkp2FJ+eL1VgjeRSq1xQhzdiY87C1Cbib-nig@mail.gmail.com>
Date:   Sat, 20 Aug 2022 10:25:59 +0800
From:   Yafang Shao <laoar.shao@...il.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>, Martin Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        john fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>,
        Hao Luo <haoluo@...gle.com>, jolsa@...nel.org,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Zefan Li <lizefan.x@...edance.com>,
        Cgroups <cgroups@...r.kernel.org>,
        netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>
Subject: Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map

On Sat, Aug 20, 2022 at 1:06 AM Tejun Heo <tj@...nel.org> wrote:
>
> Hello,
>
> On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@...nel.org> wrote:
> > >
> > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > > We have the exact same problem for any resources which span multiple
> > > > instances of a service including page cache, tmpfs instances and any other
> > > > thing which can persist longer than procss life time. My current opinion is
> > >
> > > To expand a bit more on this point, once we start including page cache and
> > > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > > not-yet-but-eventually CPU usage.
> >
> > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?
>
> Introducing a new layer in cgroup2 doesn't mean that any specific resource
> controller is enabled, so there is no runtime overhead difference. In terms
> of logical complexity, introducing a localized layer seems a lot more
> straightforward than building a whole separate tree.
>
> Note that the same applies to cgroup1 where collapsed controller tree is
> represented by simply not creating those layers in that particular
> controller tree.
>

No, we have observed on our product env that multiple-layers cpuacct
would cause obvious performance hit due to cache miss.

> No matter how we cut the problem here, if we want to track these persistent
> resources, we have to create a cgroup to host them somewhere. The discussion
> we're having is mostly around where to put them. With your proposal, it can
> be anywhere and you draw out an example where the persistent cgroups form
> their own separate tree. What I'm saying is that the logical place to put it
> is where the current resource consumption is and we just need to put the
> persistent entity as the parent of the instances.
>
> Flexibility, just like anything else, isn't free. Here, if we extrapolate
> this approach, the cost is evidently hefty in that it doesn't generically
> work with the basic resource control structure.
>
> > > Once you start splitting the tree like
> > > you're suggesting here, all those will break down and now we have to worry
> > > about how to split resource accounting and control for the same entities
> > > across two split branches of the tree, which doesn't really make any sense.
> >
> > The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> > If you ignored it, I paste it below.
> > [0]"1. The memory usage is not consistent between the first generation and
> > new generations."
> >
> > This issue will persist even if you introduce a new layer.
>
> Please watch your tone.
>

Hm? I apologize if my words offend you.
But, could you pls take a serious look at the patchset  before giving a NACK?
You didn't even want to know the background before you sent your NACK.

> Again, this isn't a problem specific to k8s. We have the same problem with
> e.g. persistent tmpfs. One idea which I'm not against is allowing specific
> resources to be charged to an ancestor. We gotta think carefully about how
> such charges should be granted / denied but an approach like that jives well
> with the existing hierarchical control structure and because introducing a
> persistent layer does too, the combination of the two works well.
>
> > > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > > is a dead-end. Please ditch it.
> >
> > It makes non-sensen to ditch it.
> > Because, the hierarchy I described in the commit log is *one* use case
> > of the selectable memcg, but not *the only one* use case of it. If you
> > dislike that hierarchy, I will remove it to avoid misleading you.
>
> But if you drop that, what'd be the rationale for adding what you're
> proposing? Why would we want bpf memory charges to be attached any part of
> the hierarchy?
>

I have explained it to you.
But unfortunately you ignored it again.
But I don't mind explaining to you again.

                 Parent-memcg
                     \
                   Child-memcg (k8s pod)

The user can charge the memory to the parent directly without charging
into the k8s pod.
Then the memory.stat is consistent between different generations.

> > Even if you introduce a new layer, you still need the selectable memcg.
> > For example, to avoid the issue I described in [0],  you still need to
> > charge to the parent cgroup instead of the current cgroup.
>
> As I wrote above, we've been discussing the above. Again, I'd be a lot more
> amenable to such approach because it fits with how everything is structured.
>
> > That's why I described in the commit log that the selectable memcg is flexible.
>
> Hopefully, my point on this is clear by now.
>

Unfortunately, you didn't want to get my point.


-- 
Regards
Yafang