linux-kernel - Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200203222853.GD6781@xps.dhcp.thefacebook.com>
Date:   Mon, 3 Feb 2020 14:28:53 -0800
From:   Roman Gushchin <guro@...com>
To:     Johannes Weiner <hannes@...xchg.org>
CC:     <linux-mm@...ck.org>, Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...nel.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        <linux-kernel@...r.kernel.org>, <kernel-team@...com>,
        Bharata B Rao <bharata@...ux.ibm.com>,
        Yafang Shao <laoar.shao@...il.com>
Subject: Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in
 struct per_cpu_nodestat

On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > > It works fine because the overfill threshold can't exceed 125.
> > > > 
> > > > But if some counters are in bytes (and the next commit in the series
> > > > will convert slab counters to bytes), it's not gonna work:
> > > > value in bytes can easily exceed s8 without exceeding the threshold
> > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > > vmstats correctness, let's use s32 instead.
> > > > 
> > > > This doesn't affect per-zone statistics. There are no plans to use
> > > > zone-level byte-sized counters, so no reasons to change anything.
> > > 
> > > Wait, is this still necessary? AFAIU, the node counters will account
> > > full slab pages, including free space, and only the memcg counters
> > > that track actual objects will be in bytes.
> > > 
> > > Can you please elaborate?
> > 
> > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> > being in different units depending on the accounting scope.
> > So I do convert all slab counters: global, per-lruvec,
> > and per-memcg to bytes.
> 
> Since the node counters tracks allocated slab pages and the memcg
> counter tracks allocated objects, arguably they shouldn't use the same
> name anyway.
> 
> > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> > NR_SLAB_RECLAIMABLE_OBJ
> > NR_SLAB_UNRECLAIMABLE_OBJ
> 
> Can we alias them and reuse their slots?
> 
> 	/* Reuse the node slab page counters item for charged objects */
> 	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
> 	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,

Yeah, lgtm.

Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to
the enum memcg_stat_item?

> 
> > and keep global counters untouched. If going this way, I'd prefer to make
> > them per-memcg, because it will simplify things on charging paths:
> > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> > bump per-lruvec counters.
> 
> I don't quite follow. Don't you still have to update the global
> counters?

Global counters are updated only if an allocation requires a new slab
page, which isn't the most common path.
In generic case post_hook is required because it's the only place where
we have both page (to get the node) and memcg pointer.

If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK),
then post_hook can handle only the rare "allocation failed" case.

I'm not sure here what's better.

> 
> > Btw, I wonder if we really need per-lruvec counters at all (at least
> > being enabled by default). For the significant amount of users who
> > have a single-node machine it doesn't bring anything except performance
> > overhead.
> 
> Yeah, for single-node systems we should be able to redirect everything
> to the memcg counters, without allocating and tracking lruvec copies.

Sounds good. It can lead to significant savings on single-node machines.

> 
> > For those who have multiple nodes (and most likely many many
> > memory cgroups) it provides way too many data except for debugging
> > some weird mm issues.
> > I guess in the absolute majority of cases having global per-node + per-memcg
> > counters will be enough.
> 
> Hm? Reclaim uses the lruvec counters.

Can you, please, provide some examples? It looks like it's mostly based
on per-zone lruvec size counters.

Anyway, it seems to be a little bit off from this patchset, so let's
discuss it separately.