[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbCvUxQn5Zkp2FJ+eL1VgjeRSq1xQhzdiY87C1Cbib-nig@mail.gmail.com>
Date: Sat, 20 Aug 2022 10:25:59 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Tejun Heo <tj@...nel.org>
Cc: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>, Martin Lau <kafai@...com>,
Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
john fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...gle.com>,
Hao Luo <haoluo@...gle.com>, jolsa@...nel.org,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Shakeel Butt <shakeelb@...gle.com>,
Muchun Song <songmuchun@...edance.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Zefan Li <lizefan.x@...edance.com>,
Cgroups <cgroups@...r.kernel.org>,
netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
Linux MM <linux-mm@...ck.org>
Subject: Re: [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map
On Sat, Aug 20, 2022 at 1:06 AM Tejun Heo <tj@...nel.org> wrote:
>
> Hello,
>
> On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote:
> > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@...nel.org> wrote:
> > >
> > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote:
> > > > We have the exact same problem for any resources which span multiple
> > > > instances of a service including page cache, tmpfs instances and any other
> > > > thing which can persist longer than procss life time. My current opinion is
> > >
> > > To expand a bit more on this point, once we start including page cache and
> > > tmpfs, we now get entangled with memory reclaim which then brings in IO and
> > > not-yet-but-eventually CPU usage.
> >
> > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead?
>
> Introducing a new layer in cgroup2 doesn't mean that any specific resource
> controller is enabled, so there is no runtime overhead difference. In terms
> of logical complexity, introducing a localized layer seems a lot more
> straightforward than building a whole separate tree.
>
> Note that the same applies to cgroup1 where collapsed controller tree is
> represented by simply not creating those layers in that particular
> controller tree.
>
No, we have observed on our product env that multiple-layers cpuacct
would cause obvious performance hit due to cache miss.
> No matter how we cut the problem here, if we want to track these persistent
> resources, we have to create a cgroup to host them somewhere. The discussion
> we're having is mostly around where to put them. With your proposal, it can
> be anywhere and you draw out an example where the persistent cgroups form
> their own separate tree. What I'm saying is that the logical place to put it
> is where the current resource consumption is and we just need to put the
> persistent entity as the parent of the instances.
>
> Flexibility, just like anything else, isn't free. Here, if we extrapolate
> this approach, the cost is evidently hefty in that it doesn't generically
> work with the basic resource control structure.
>
> > > Once you start splitting the tree like
> > > you're suggesting here, all those will break down and now we have to worry
> > > about how to split resource accounting and control for the same entities
> > > across two split branches of the tree, which doesn't really make any sense.
> >
> > The k8s has already been broken thanks to the memcg accounting on bpf memory.
> > If you ignored it, I paste it below.
> > [0]"1. The memory usage is not consistent between the first generation and
> > new generations."
> >
> > This issue will persist even if you introduce a new layer.
>
> Please watch your tone.
>
Hm? I apologize if my words offend you.
But, could you pls take a serious look at the patchset before giving a NACK?
You didn't even want to know the background before you sent your NACK.
> Again, this isn't a problem specific to k8s. We have the same problem with
> e.g. persistent tmpfs. One idea which I'm not against is allowing specific
> resources to be charged to an ancestor. We gotta think carefully about how
> such charges should be granted / denied but an approach like that jives well
> with the existing hierarchical control structure and because introducing a
> persistent layer does too, the combination of the two works well.
>
> > > So, we *really* don't wanna paint ourselves into that kind of a corner. This
> > > is a dead-end. Please ditch it.
> >
> > It makes non-sensen to ditch it.
> > Because, the hierarchy I described in the commit log is *one* use case
> > of the selectable memcg, but not *the only one* use case of it. If you
> > dislike that hierarchy, I will remove it to avoid misleading you.
>
> But if you drop that, what'd be the rationale for adding what you're
> proposing? Why would we want bpf memory charges to be attached any part of
> the hierarchy?
>
I have explained it to you.
But unfortunately you ignored it again.
But I don't mind explaining to you again.
Parent-memcg
\
Child-memcg (k8s pod)
The user can charge the memory to the parent directly without charging
into the k8s pod.
Then the memory.stat is consistent between different generations.
> > Even if you introduce a new layer, you still need the selectable memcg.
> > For example, to avoid the issue I described in [0], you still need to
> > charge to the parent cgroup instead of the current cgroup.
>
> As I wrote above, we've been discussing the above. Again, I'd be a lot more
> amenable to such approach because it fits with how everything is structured.
>
> > That's why I described in the commit log that the selectable memcg is flexible.
>
> Hopefully, my point on this is clear by now.
>
Unfortunately, you didn't want to get my point.
--
Regards
Yafang
Powered by blists - more mailing lists