linux-kernel - Re: [PATCH v2] memcg: use ratelimited stats flush in the reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <vyi7d5fw4d3h5osolpu4reyhcqylgnfi6uz32z67dpektbc2dz@jpu4ob34a2ug>
Date: Wed, 14 Aug 2024 09:32:36 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Yosry Ahmed <yosryahmed@...gle.com>, Nhat Pham <nphamcs@...il.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Muchun Song <muchun.song@...ux.dev>, Yu Zhao <yuzhao@...gle.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Meta kernel team <kernel-team@...a.com>, cgroups@...r.kernel.org
Subject: Re: [PATCH v2] memcg: use ratelimited stats flush in the reclaim


Ccing Nhat

On Wed, Aug 14, 2024 at 02:57:38PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 14/08/2024 00.30, Shakeel Butt wrote:
> > On Tue, Aug 13, 2024 at 02:58:51PM GMT, Yosry Ahmed wrote:
> > > On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > > > 
> > > > The Meta prod is seeing large amount of stalls in memcg stats flush
> > > > from the memcg reclaim code path. At the moment, this specific callsite
> > > > is doing a synchronous memcg stats flush. The rstat flush is an
> > > > expensive and time consuming operation, so concurrent relaimers will
> > > > busywait on the lock potentially for a long time. Actually this issue is
> > > > not unique to Meta and has been observed by Cloudflare [1] as well. For
> > > > the Cloudflare case, the stalls were due to contention between kswapd
> > > > threads running on their 8 numa node machines which does not make sense
> > > > as rstat flush is global and flush from one kswapd thread should be
> > > > sufficient for all. Simply replace the synchronous flush with the
> > > > ratelimited one.
> > > > 
> > > > One may raise a concern on potentially using 2 sec stale (at worst)
> > > > stats for heuristics like desirable inactive:active ratio and preferring
> > > > inactive file pages over anon pages but these specific heuristics do not
> > > > require very precise stats and also are ignored under severe memory
> > > > pressure.
> > > > 
> > > > More specifically for this code path, the stats are needed for two
> > > > specific heuristics:
> > > > 
> > > > 1. Deactivate LRUs
> > > > 2. Cache trim mode
> > > > 
> > > > The deactivate LRUs heuristic is to maintain a desirable inactive:active
> > > > ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE*
> > > > and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to
> > > > check if there is a refault since last snapshot and the LRU size are
> > > > needed for the desirable ratio between inactive and active LRUs. See the
> > > > table below on how the desirable ratio is calculated.
> > > > 
> > > > /* total     target    max
> > > >   * memory    ratio     inactive
> > > >   * -------------------------------------
> > > >   *   10MB       1         5MB
> > > >   *  100MB       1        50MB
> > > >   *    1GB       3       250MB
> > > >   *   10GB      10       0.9GB
> > > >   *  100GB      31         3GB
> > > >   *    1TB     101        10GB
> > > >   *   10TB     320        32GB
> > > >   */
> > > > 
> > > > The desirable ratio only changes at the boundary of 1 GiB, 10 GiB,
> > > > 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate
> > > > LRU size information to calculate this ratio. In addition, if
> > > > deactivation is skipped for some LRU, the kernel will force deactive on
> > > > the severe memory pressure situation.
> > > > 
> > > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > > only checks if it is zero or not. Again precise information is not
> > > > needed.
> > > > 
> > > > This patch has been running on Meta fleet for several months and we have
> > > > not observed any issues. Please note that MGLRU is not impacted by this
> > > > issue at all as it avoids rstat flushing completely.
> > > > 
> > > > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1]
> > > > Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev>
> > > 
> > > Just curious, does Jesper's patch help with this problem?
> > 
> > If you are asking if I have tested Jesper's patch in Meta's production
> > then no, I have not tested it. Also I have not taken a look at the
> > latest from Jesper as I was stuck in some other issues.
> > 
> 
> I see this patch as a whac-a-mole approach.  But it should be applied as
> a stopgap, because my patches are still not ready to be merged.
> 
> My patch is more generic, but *only* solves the rstat lock contention
> part of the issue.  The remaining issue is that rstat is flushed too
> often, which I address in my other patch[2] "cgroup/rstat: introduce
> ratelimited rstat flushing".  In [2], I explicitly excluded memcg as
> Shakeel's patch demonstrates memcg already have a ratelimit API specific
> to memcg.
> 
>  [2] https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/
> 
> I suspect the next whac-a-mole will be the rstat flush for the slab code
> that kswapd also activates via shrink_slab, that via
> shrinker->count_objects() invoke count_shadow_nodes().
>

Actually count_shadow_nodes() is already using ratelimited version.
However zswap_shrinker_count() is still using the sync version. Nhat is
modifying this code at the moment and we can ask if we really need most
accurate values for MEMCG_ZSWAP_B and MEMCG_ZSWAPPED for the zswap
writeback heuristic.

> --Jesper