lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 19 Aug 2022 07:06:51 -1000
From:   Tejun Heo <tj@...nel.org>
To:     Yafang Shao <laoar.shao@...il.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>, Martin Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        john fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>,
        Hao Luo <haoluo@...gle.com>, jolsa@...nel.org,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        lizefan.x@...edance.com, Cgroups <cgroups@...r.kernel.org>,
        netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>
Subject: Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for
 bpf map

Hello,

On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@...nel.org> wrote:
> >
> > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > We have the exact same problem for any resources which span multiple
> > > instances of a service including page cache, tmpfs instances and any other
> > > thing which can persist longer than procss life time. My current opinion is
> >
> > To expand a bit more on this point, once we start including page cache and
> > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > not-yet-but-eventually CPU usage.
> 
> Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?

Introducing a new layer in cgroup2 doesn't mean that any specific resource
controller is enabled, so there is no runtime overhead difference. In terms
of logical complexity, introducing a localized layer seems a lot more
straightforward than building a whole separate tree.

Note that the same applies to cgroup1 where collapsed controller tree is
represented by simply not creating those layers in that particular
controller tree.

No matter how we cut the problem here, if we want to track these persistent
resources, we have to create a cgroup to host them somewhere. The discussion
we're having is mostly around where to put them. With your proposal, it can
be anywhere and you draw out an example where the persistent cgroups form
their own separate tree. What I'm saying is that the logical place to put it
is where the current resource consumption is and we just need to put the
persistent entity as the parent of the instances.

Flexibility, just like anything else, isn't free. Here, if we extrapolate
this approach, the cost is evidently hefty in that it doesn't generically
work with the basic resource control structure.

> > Once you start splitting the tree like
> > you're suggesting here, all those will break down and now we have to worry
> > about how to split resource accounting and control for the same entities
> > across two split branches of the tree, which doesn't really make any sense.
> 
> The k8s has already been broken thanks to the memcg accounting on  bpf memory.
> If you ignored it, I paste it below.
> [0]"1. The memory usage is not consistent between the first generation and
> new generations."
> 
> This issue will persist even if you introduce a new layer.

Please watch your tone.

Again, this isn't a problem specific to k8s. We have the same problem with
e.g. persistent tmpfs. One idea which I'm not against is allowing specific
resources to be charged to an ancestor. We gotta think carefully about how
such charges should be granted / denied but an approach like that jives well
with the existing hierarchical control structure and because introducing a
persistent layer does too, the combination of the two works well.

> > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > is a dead-end. Please ditch it.
> 
> It makes non-sensen to ditch it.
> Because, the hierarchy I described in the commit log is *one* use case
> of the selectable memcg, but not *the only one* use case of it. If you
> dislike that hierarchy, I will remove it to avoid misleading you.

But if you drop that, what'd be the rationale for adding what you're
proposing? Why would we want bpf memory charges to be attached any part of
the hierarchy?

> Even if you introduce a new layer, you still need the selectable memcg.
> For example, to avoid the issue I described in [0],  you still need to
> charge to the parent cgroup instead of the current cgroup.

As I wrote above, we've been discussing the above. Again, I'd be a lot more
amenable to such approach because it fits with how everything is structured.

> That's why I described in the commit log that the selectable memcg is flexible.

Hopefully, my point on this is clear by now.

Thanks.

-- 
tejun

Powered by blists - more mailing lists