linux-kernel - Re: [PATCH] memcg: use ratelimited stats flush in the reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <wlgw7fz4cgwlsnvzufeak26fqfj5ahutnfnbfifgob722il253@k2qxpgdtutmt>
Date: Mon, 24 Jun 2024 13:01:39 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...e.com>, 
	Roman Gushchin <roman.gushchin@...ux.dev>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	Yu Zhao <yuzhao@...gle.com>, Muchun Song <songmuchun@...edance.com>, 
	Facebook Kernel Team <kernel-team@...a.com>, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim

On Mon, Jun 24, 2024 at 12:06:28PM GMT, Yosry Ahmed wrote:
> On Mon, Jun 24, 2024 at 11:59 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> >
> > On Mon, Jun 24, 2024 at 10:15:38AM GMT, Yosry Ahmed wrote:
> > > On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > > > > and I will explain why below. I know it may be a necessary
> > > > > > > evil, but I would like us to make sure there is no other option before
> > > > > > > going forward with this.
> > > > > >
> > > > > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > > > > resolve the ongoing pain with good enough solution and work on long term
> > > > > > solution later.
> > > > >
> > > > > It seems like there are a few ideas for solutions that may address
> > > > > longer-term concerns, let's make sure we try those out first before we
> > > > > fall back to the short-term mitigation.
> > > > >
> > > >
> > > > Why? More specifically why try out other things before this patch? Both
> > > > can be done in parallel. This patch has been running in production at
> > > > Meta for several weeks without issues. Also I don't see how merging this
> > > > would impact us on working on long term solutions.
> > >
> > > The problem is that once this is merged, it will be difficult to
> > > change this back to a normal flush once other improvements land. We
> > > don't have a test that reproduces the problem that we can use to make
> > > sure it's safe to revert this change later, it's only using data from
> > > prod.
> > >
> >
> > I am pretty sure the work on long term solution would be iterative which
> > will involve many reverts and redoing things differently. So, I think it
> > is understandable that we may need to revert or revert the reverts.
> >
> > > Once this mitigation goes in, I think everyone will be less motivated
> > > to get more data from prod about whether it's safe to revert the
> > > ratelimiting later :)
> >
> > As I said I don't expect "safe in prod" as a strict requirement for a
> > change.
> 
> If everyone agrees that we can experiment with reverting this change
> later without having to prove that it is safe, then I think it's fine.
> Let's document this in the commit log though, so that whoever tries to
> revert this in the future (if any) does not have to re-explain all of
> this :)

Sure.

> 
> [..]
> > > > > >
> > > > > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > > > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > > > > only checks if it is zero or not. Again precise information is not
> > > > > > needed.
> > > > >
> > > > > It sounds like it is possible that we enter the cache trim mode when
> > > > > we shouldn't if the stats are stale. Couldn't this lead to
> > > > > over-reclaiming file memory?
> > > > >
> > > >
> > > > Can you explain how this over-reclaiming file will happen?
> > >
> > > In one reclaim iteration, we could flush the stats, read the inactive
> > > file LRU size, confirm that (file >> sc->priority) > 0 and enter the
> > > cache trim mode, reclaiming file memory only. Let's assume that we
> > > reclaimed enough file memory such that the condition (file >>
> > > sc->priority) > 0 does not hold anymore.
> > >
> > > In a subsequent reclaim iteration, the flush could be skipped due to
> > > ratelimiting. Now we will enter the cache trim mode again and reclaim
> > > file memory only, even though the actual amount of file memory is low.
> > > This will cause over-reclaiming from file memory and dismissing anon
> > > memory that we should have reclaimed, which means that we will need
> > > additional reclaim iterations to actually free memory.
> > >
> > > I believe this scenario would be possible with ratelimiting, right?
> > >
> >
> > So, the (old_file >> sc->priority) > 0 is true but the (new_file >>
> > sc->priority) > is false. In the next iteration, (old_file >>
> > (sc->priority-1)) > 0 will still be true but somehow (new_file >>
> > (sc->priority-1)) > 0 is false. It can happen if in the previous
> > iteration, somehow kernel has reclaimed more than double what it was
> > supposed to reclaim or there are concurrent reclaimers. In addition the
> > nr_reclaim is still less than nr_to_reclaim and there is no file
> > deactivation request.
> >
> > Yeah it can happen but a lot of wierd conditions need to happen
> > concurrently for this to happen.
> 
> Not necessarily sc->priority-1. Consider two separate sequential
> reclaim attempts. At the same priority, the first reclaim attempt
> could rightfully enter cache trim mode, while the second one
> wrongfully enters cache trim mode due to stale stats, over-reclaim
> file memory, and stall longer to actually reclaim the anon memory.
> 

For two different reclaim attempts even more things need to go wrong.
Anyways we are talking too much in abstract here and focusing on the
corner cases which almost all heuristics have. Unless there is a clear
explanation that the corner case probability will be increased, I don't
think spending time discussing it is useful.

> I am sure such a scenario is not going to be common, but I am also
> sure if it happens it will be a huge pain to debug.
> 
> If others agree that this is fine, let's document this with a comment
> and in the commit log. I am not sure how common the cache trim mode is
> in practice to understand the potential severity of such problems.
> There may also be other consequences that I am not aware of.

What is your definition of "others" though?