linux-kernel - Re: [PATCH] mm: memcontrol: flush slab vmstats on kmem offlining

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:   Thu, 8 Aug 2019 16:02:57 -0700
From:   Andrew Morton <akpm@...ux-foundation.org>
To:     Roman Gushchin <guro@...com>
Cc:     "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Michal Hocko <mhocko@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH] mm: memcontrol: flush slab vmstats on kmem offlining

On Thu, 8 Aug 2019 21:47:11 +0000 Roman Gushchin <guro@...com> wrote:

> On Thu, Aug 08, 2019 at 02:21:46PM -0700, Andrew Morton wrote:
> > On Thu, 8 Aug 2019 13:36:04 -0700 Roman Gushchin <guro@...com> wrote:
> > 
> > > I've noticed that the "slab" value in memory.stat is sometimes 0,
> > > even if some children memory cgroups have a non-zero "slab" value.
> > > The following investigation showed that this is the result
> > > of the kmem_cache reparenting in combination with the per-cpu
> > > batching of slab vmstats.
> > > 
> > > At the offlining some vmstat value may leave in the percpu cache,
> > > not being propagated upwards by the cgroup hierarchy. It means
> > > that stats on ancestor levels are lower than actual. Later when
> > > slab pages are released, the precise number of pages is substracted
> > > on the parent level, making the value negative. We don't show negative
> > > values, 0 is printed instead.
> > > 
> > > To fix this issue, let's flush percpu slab memcg and lruvec stats
> > > on memcg offlining. This guarantees that numbers on all ancestor
> > > levels are accurate and match the actual number of outstanding
> > > slab pages.
> > > 
> > 
> > Looks expensive.  How frequently can these functions be called?
> 
> Once per memcg lifetime.

iirc there are some workloads in which this can be rapid?

> > > +	for_each_node(node)
> > > +		memcg_flush_slab_node_stats(memcg, node);
> > 
> > This loops across all possible CPUs once for each possible node.  Ouch.
> > 
> > Implementing hotplug handlers in here (which is surprisingly simple)
> > brings this down to num_online_nodes * num_online_cpus which is, I
> > think, potentially vastly better.
> >
> 
> Hm, maybe I'm biased because we don't play much with offlining, and
> don't have many NUMA nodes. What's the real world scenario? Disabling
> hyperthreading?

I assume it's machines which could take a large number of CPUs but in
fact have few.  I've asked this in response to many patches down the
ages and have never really got a clear answer.

A concern is that if such machines do exist, it will take a long time
for the regression reports to get to us.  Especially if such machines
are rare.

> Idk, given that it happens once per memcg lifetime, and memcg destruction
> isn't cheap anyway, I'm not sure it worth it. But if you are, I'm happy
> to add hotplug handlers.

I think it's worth taking a look.  As I mentioned, it can turn out to
be stupidly simple.

> I also thought about merging per-memcg stats and per-memcg-per-node stats
> (reading part can aggregate over 2? 4? numa nodes each time). That will
> make everything overall cheaper. But it's a separate topic.