linux-kernel - Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200204014446.GB9138@xps.dhcp.thefacebook.com>
Date:   Mon, 3 Feb 2020 17:44:46 -0800
From:   Roman Gushchin <guro@...com>
To:     Johannes Weiner <hannes@...xchg.org>
CC:     <linux-mm@...ck.org>, Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...nel.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        <linux-kernel@...r.kernel.org>, <kernel-team@...com>,
        Bharata B Rao <bharata@...ux.ibm.com>,
        Yafang Shao <laoar.shao@...il.com>
Subject: Re: [PATCH v2 12/28] mm: vmstat: use s32 for vm_node_stat_diff in
 struct per_cpu_nodestat

On Mon, Feb 03, 2020 at 05:39:54PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 02:28:53PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 03:34:50PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 10:25:06AM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 12:58:18PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:37AM -0800, Roman Gushchin wrote:
> > > > > > Currently s8 type is used for per-cpu caching of per-node statistics.
> > > > > > It works fine because the overfill threshold can't exceed 125.
> > > > > > 
> > > > > > But if some counters are in bytes (and the next commit in the series
> > > > > > will convert slab counters to bytes), it's not gonna work:
> > > > > > value in bytes can easily exceed s8 without exceeding the threshold
> > > > > > converted to bytes. So to avoid overfilling per-cpu caches and breaking
> > > > > > vmstats correctness, let's use s32 instead.
> > > > > > 
> > > > > > This doesn't affect per-zone statistics. There are no plans to use
> > > > > > zone-level byte-sized counters, so no reasons to change anything.
> > > > > 
> > > > > Wait, is this still necessary? AFAIU, the node counters will account
> > > > > full slab pages, including free space, and only the memcg counters
> > > > > that track actual objects will be in bytes.
> > > > > 
> > > > > Can you please elaborate?
> > > > 
> > > > It's weird to have a counter with the same name (e.g. NR_SLAB_RECLAIMABLE_B)
> > > > being in different units depending on the accounting scope.
> > > > So I do convert all slab counters: global, per-lruvec,
> > > > and per-memcg to bytes.
> > > 
> > > Since the node counters tracks allocated slab pages and the memcg
> > > counter tracks allocated objects, arguably they shouldn't use the same
> > > name anyway.
> > > 
> > > > Alternatively I can fork them, e.g. introduce per-memcg or per-lruvec
> > > > NR_SLAB_RECLAIMABLE_OBJ
> > > > NR_SLAB_UNRECLAIMABLE_OBJ
> > > 
> > > Can we alias them and reuse their slots?
> > > 
> > > 	/* Reuse the node slab page counters item for charged objects */
> > > 	MEMCG_SLAB_RECLAIMABLE = NR_SLAB_RECLAIMABLE,
> > > 	MEMCG_SLAB_UNRECLAIMABLE = NR_SLAB_UNRECLAIMABLE,
> > 
> > Yeah, lgtm.
> > 
> > Isn't MEMCG_ prefix bad because it makes everybody think that it belongs to
> > the enum memcg_stat_item?
> 
> Maybe, not sure that's a problem. #define CG_SLAB_RECLAIMABLE perhaps?

Maybe not. I'll probably go with 
    MEMCG_SLAB_RECLAIMABLE_B = NR_SLAB_RECLAIMABLE,
    MEMCG_SLAB_UNRECLAIMABLE_B = NR_SLAB_UNRECLAIMABLE,

Please, let me know if you're not ok with it.

> 
> > > > and keep global counters untouched. If going this way, I'd prefer to make
> > > > them per-memcg, because it will simplify things on charging paths:
> > > > now we do get task->mem_cgroup->obj_cgroup in the pre_alloc_hook(),
> > > > and then obj_cgroup->mem_cgroup in the post_alloc_hook() just to
> > > > bump per-lruvec counters.
> > > 
> > > I don't quite follow. Don't you still have to update the global
> > > counters?
> > 
> > Global counters are updated only if an allocation requires a new slab
> > page, which isn't the most common path.
> 
> Right.
> 
> > In generic case post_hook is required because it's the only place where
> > we have both page (to get the node) and memcg pointer.
> > 
> > If NR_SLAB_RECLAIMABLE is tracked only per-memcg (as MEMCG_SOCK),
> > then post_hook can handle only the rare "allocation failed" case.
> > 
> > I'm not sure here what's better.
> 
> If it's tracked only per-memcg, you still have to account it every
> time you charge an object to a memcg, no? How is it less frequent than
> acconting at the lruvec level?

It's not less frequent, it just can be done in the pre-alloc hook
when there is a memcg pointer available.

The problem with the obj_cgroup thing is that we get it indirectly
from current memcg in the pre_alloc_hook, then pass it to obj_cgroup API,
internally we might need to get the memcg from it to charge a page,
and then again in the post_hook we need to get memcg to bump
per-lruvec stats. In other words we make several memcg <-> objcg
conversions, which isn't very nice on the hot path.

I see that in the future we might optimize the initial lookup
of objcg, but getting memcg just to bump vmstats looks unnecessarily expensive.
One option I think about is to handle byte-sized stats on obj_cgroup
level and flush whole pages to memcg level.

> 
> > > > Btw, I wonder if we really need per-lruvec counters at all (at least
> > > > being enabled by default). For the significant amount of users who
> > > > have a single-node machine it doesn't bring anything except performance
> > > > overhead.
> > > 
> > > Yeah, for single-node systems we should be able to redirect everything
> > > to the memcg counters, without allocating and tracking lruvec copies.
> > 
> > Sounds good. It can lead to significant savings on single-node machines.
> > 
> > > 
> > > > For those who have multiple nodes (and most likely many many
> > > > memory cgroups) it provides way too many data except for debugging
> > > > some weird mm issues.
> > > > I guess in the absolute majority of cases having global per-node + per-memcg
> > > > counters will be enough.
> > > 
> > > Hm? Reclaim uses the lruvec counters.
> > 
> > Can you, please, provide some examples? It looks like it's mostly based
> > on per-zone lruvec size counters.
> 
> It uses the recursive lruvec state to decide inactive_is_low(),
> whether refaults are occuring, whether to trim cache only or go for
> anon etc. We use it to determine refault distances and how many shadow
> nodes to shrink.
> 
> Grep for lruvec_page_state().

I see... Thanks!

> 
> > Anyway, it seems to be a little bit off from this patchset, so let's
> > discuss it separately.
> 
> True