linux-kernel - Re: [PATCH v2] memcg: use ratelimited stats flush in the reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkaBfWWS32VYAwkgyfzkD_WbUUbx+rrK-Cc6OT7UN27DYA@mail.gmail.com>
Date: Wed, 14 Aug 2024 16:48:42 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Nhat Pham <nphamcs@...il.com>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Muchun Song <muchun.song@...ux.dev>, Yu Zhao <yuzhao@...gle.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Meta kernel team <kernel-team@...a.com>, 
	cgroups@...r.kernel.org
Subject: Re: [PATCH v2] memcg: use ratelimited stats flush in the reclaim

On Wed, Aug 14, 2024 at 4:42 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> On Wed, Aug 14, 2024 at 04:03:13PM GMT, Nhat Pham wrote:
> > On Wed, Aug 14, 2024 at 9:32 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > >
> > >
> > > Ccing Nhat
> > >
> > > On Wed, Aug 14, 2024 at 02:57:38PM GMT, Jesper Dangaard Brouer wrote:
> > > > I suspect the next whac-a-mole will be the rstat flush for the slab code
> > > > that kswapd also activates via shrink_slab, that via
> > > > shrinker->count_objects() invoke count_shadow_nodes().
> > > >
> > >
> > > Actually count_shadow_nodes() is already using ratelimited version.
> > > However zswap_shrinker_count() is still using the sync version. Nhat is
> > > modifying this code at the moment and we can ask if we really need most
> > > accurate values for MEMCG_ZSWAP_B and MEMCG_ZSWAPPED for the zswap
> > > writeback heuristic.
> >
> > You are referring to this, correct:
> >
> > mem_cgroup_flush_stats(memcg);
> > nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> >
> > It's already a bit less-than-accurate - as you pointed out in another
> > discussion, it takes into account the objects and sizes of the entire
> > subtree, rather than just the ones charged to the current (memcg,
> > node) combo. Feel free to optimize this away!
> >
> > In fact, I should probably replace this with another (atomic?) counter
> > in zswap_lruvec_state struct, which tracks the post-compression size.
> > That way, we'll have a better estimate of the compression factor -
> > total post-compression size /  (length of LRU * page size), and
> > perhaps avoid the whole stat flushing path altogether...
> >
>
> That sounds like much better solution than relying on rstat for accurate
> stats.

We can also use such atomic counters in obj_cgroup_may_zswap() and
eliminate the rstat flush there as well. Same for zswap_current_read()
probably.

Most in-kernel flushers really only need a few stats, so I am
wondering if it's better to incrementally move these ones outside of
the rstat framework and completely eliminate in-kernel flushers. For
instance, MGLRU does not require the flush that reclaim does as
Shakeel pointed out.

This will solve so many scalability problems that all of us have
observed at some point or another and tried to optimize. I believe
using rstat for userspace reads was the original intention anyway.