linux-kernel - Re: [PATCH] memcg: expose root cgroup's memory.stat

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200515180920.GC94522@carbon.DHCP.thefacebook.com>
Date:   Fri, 15 May 2020 11:09:20 -0700
From:   Roman Gushchin <guro@...com>
To:     Shakeel Butt <shakeelb@...gle.com>
CC:     Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>, Mel Gorman <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Yafang Shao <laoar.shao@...il.com>,
        Linux MM <linux-mm@...ck.org>,
        Cgroups <cgroups@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] memcg: expose root cgroup's memory.stat

On Fri, May 15, 2020 at 10:49:22AM -0700, Shakeel Butt wrote:
> On Fri, May 15, 2020 at 8:00 AM Roman Gushchin <guro@...com> wrote:
> >
> > On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner <hannes@...xchg.org> wrote:
> > > >
> > > > On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > > > > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > > > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner <hannes@...xchg.org> wrote:
> > > > > > >
> > > > > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > > > > > not updated consistently at the system level and the ratio of these are
> > > > > > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > > > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > > > > > reclaim.
> > > > > > > >
> > > > > > > > Please note that this difference is only for system level vmstats. The
> > > > > > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > > > > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > > > > > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > > > > > > these stats from root's memory.stat, so, expose memory.stat for the root
> > > > > > > > cgroup.
> > > > > > > >
> > > > > > > >       from Johannes Weiner:
> > > > > > > >       There are subtle differences between /proc/vmstat and
> > > > > > > >       memory.stat, and cgroup-aware code that wants to watch the full
> > > > > > > >       hierarchy currently has to know about these intricacies and
> > > > > > > >       translate semantics back and forth.
> > > > >
> > > > > Can we have those subtle differences documented please?
> > > > >
> > > > > > > >
> > > > > > > >       Generally having the fully recursive memory.stat at the root
> > > > > > > >       level could help a broader range of usecases.
> > > > > > >
> > > > > > > The changelog begs the question why we don't just "fix" the
> > > > > > > system-level stats. It may be useful to include the conclusions from
> > > > > > > that discussion, and why there is value in keeping the stats this way.
> > > > > > >
> > > > > >
> > > > > > Right. Andrew, can you please add the following para to the changelog?
> > > > > >
> > > > > > Why not fix the stats by including both the global and cgroup reclaim
> > > > > > activity instead of exposing root cgroup's memory.stat? The reason is
> > > > > > the benefit of having metrics exposing the activity that happens
> > > > > > purely due to machine capacity rather than localized activity that
> > > > > > happens due to the limits throughout the cgroup tree. Additionally
> > > > > > there are userspace tools like sysstat(sar) which reads these stats to
> > > > > > inform about the system level reclaim activity. So, we should not
> > > > > > break such use-cases.
> > > > > >
> > > > > > > > Signed-off-by: Shakeel Butt <shakeelb@...gle.com>
> > > > > > > > Suggested-by: Johannes Weiner <hannes@...xchg.org>
> > > > > > >
> > > > > > > Acked-by: Johannes Weiner <hannes@...xchg.org>
> > > > > >
> > > > > > Thanks a lot.
> > > > >
> > > > > I was quite surprised that the patch is so simple TBH. For some reason
> > > > > I've still had memories that we do not account for root memcg (likely
> > > > > because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> > > > > are slightly different here.
> > > >
> > > > Yep, we skip the page_counter for root, but keep in mind that cgroup1
> > > > *does* have a root-level memory.stat, so (for the most part) we've
> > > > been keeping consumer stats for the root level the whole time.
> > > >
> > > > > counters because they are not really all the same. E.g.
> > > > > - mem_cgroup_charge_statistics accounts for each memcg
> > > >
> > > > Yep, that's heritage from cgroup1.
> > > >
> > > > > - memcg_charge_kernel_stack relies on pages being associated with a
> > > > >   memcg and that in turn relies on __memcg_kmem_charge_page which bails
> > > > >   out on root memcg
> > > >
> > > > You're right. It should only bypass the page_counter, but still set
> > > > page->mem_cgroup = root_mem_cgroup, just like user pages.
> >
> > What about kernel threads? We consider them belonging to the root memory
> > cgroup. Should their memory consumption being considered in root-level stats?
> >
> > I'm not sure we really want it, but I guess we need to document how
> > kernel threads are handled.
> >
> 
> What will be the cons of updating root-level stats for kthreads?

It makes total sense for stacks, but not much for the slab memory.
Because it's really "some part of the total slab memory, which is
accounted on the memcg level". And it comes with some performance
overhead.

I'm not really opposing any solution, just saying we need to document
what's included into this statistics and what not.

Thanks!