linux-kernel - Re: [PATCH] memcg: use ratelimited stats flush in the reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <rhvafiag6fjkj66ohex3eamoqpsw62bxmwbvd7shsa72rqcile@fvo4nsggjpwg>
Date: Mon, 17 Jun 2024 11:01:36 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: Yosry Ahmed <yosryahmed@...gle.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...e.com>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Yu Zhao <yuzhao@...gle.com>, Muchun Song <songmuchun@...edance.com>, 
	Facebook Kernel Team <kernel-team@...a.com>, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	kernel-team <kernel-team@...udflare.com>
Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim

On Mon, Jun 17, 2024 at 05:31:21PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 16/06/2024 02.28, Yosry Ahmed wrote:
> > On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > > 
> > > The Meta prod is seeing large amount of stalls in memcg stats flush
> > > from the memcg reclaim code path. At the moment, this specific callsite
> > > is doing a synchronous memcg stats flush. The rstat flush is an
> > > expensive and time consuming operation, so concurrent relaimers will
> > > busywait on the lock potentially for a long time. Actually this issue is
> > > not unique to Meta and has been observed by Cloudflare [1] as well. For
> > > the Cloudflare case, the stalls were due to contention between kswapd
> > > threads running on their 8 numa node machines which does not make sense
> > > as rstat flush is global and flush from one kswapd thread should be
> > > sufficient for all. Simply replace the synchronous flush with the
> > > ratelimited one.
> > > 
> 
> Like Yosry, I don't agree that simply using ratelimited flush here is
> the right solution, at-least other options need to be investigated first.

I added more detail in my reply to Yosry on why using ratelimited flush
for this specific case is fine.

[...]
> > 
> > I think you already know my opinion about this one :) I don't like it
> > at all, and I will explain why below. I know it may be a necessary
> > evil, but I would like us to make sure there is no other option before
> > going forward with this.
> > 
> I'm signing up to solving this somehow, as this is a real prod issue.
> 
> An easy way to solve the kswapd issue, would be to reintroduce
> "stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm:
> memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and
> introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an
> atomic") (Author: Yosry Ahmed).
> 

The skipping flush for "stats_flush_ongoing" was there from the start.

> The concept is: If there is an ongoing rstat flush, this time limited to
> the root cgroup, then don't perform the flush.  We can only do this for
> the root cgroup tree, as flushing can be done for subtrees, but kswapd
> is always for root tree, so it is good enough to solve the kswapd
> thundering herd problem.  We might want to generalize this beyond memcg.
> 

No objection from me for this skipping root memcg flush idea.

> 
[...]
> 
> > - With the added thresholding code, a flush is only done if there is a
> > significant number of pending updates in the relevant subtree.
> > Choosing the ratelimited approach is intentionally ignoring a
> > significant change in stats (although arguably it could be irrelevant
> > stats).
> > 
> 
> My production observations are that the thresholding code isn't limiting
> the flushing in practice.
> 

Here we need more production data. I remember you mentioned MEMCG_KMEM
being used for most of the updates. Is it possible to get top 5 (or 10)
most updated stats for your production environment?

> 
> > - Reclaim code is an iterative process, so not updating the stats on
> > every retry is very counterintuitive. We are retrying reclaim using
> > the same stats and heuristics used by a previous iteration,
> > essentially dismissing the effects of those previous iterations.
> > 
> > - Indeterministic behavior like this one is very difficult to debug if
> > it causes problems. The missing updates in the last 2s (or whatever
> > period) could be of any magnitude. We may be ignoring GBs of
> > free/allocated memory. What's worse is, if it causes any problems,
> > tracing it back to this flush will be extremely difficult.
> > 
> 
> The 2 sec seems like a long period for me.
> 
> > What can we do?
> > 
> > - Try to make more fundamental improvements to the flushing code (for
> > memcgs or cgroups in general). The per-memcg flushing thresholding is
> > an example of this. For example, if flushing is taking too long
> > because we are flushing all subsystems, it may make sense to have
> > separate rstat trees for separate subsystems.
> > 
> > One other thing we can try is add a mutex in the memcg flushing path.
> > I had initially had this in my subtree flushing series [1], but I
> > dropped it as we thought it's not very useful.
> 
> I'm running an experimental kernel with rstat lock converted to mutex on
> a number of production servers, and we have not observed any regressions.
> The kswapd thundering herd problem also happen on these machines, but as
> these are sleep-able background threads, it is fine to sleep on the mutex.
> 

Sorry but a global mutex which can be taken by userspace applications
and is needed by node controller (to read stats) is a no from me. On a
multi-tenant systems, global locks causing priority inversion is a real
issue.

> 
[...]
> 
> My pipe dream is that kernel can avoiding the cost of maintain the
> cgroup threshold stats for flushing, and instead rely on a dynamic time
> based threshold (in ms area) that have no fast-path overhead :-P
> 

Please do expand on what you mean by dynamic time based threshold.