linux-kernel - Re: [PATCH] memcg: use ratelimited stats flush in the reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJD7tkaB+JgP=+Nb2Ecikw024eO7qHo6vkHKL-uf2f135LL4UQ@mail.gmail.com>
Date: Mon, 24 Jun 2024 10:15:38 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...e.com>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Yu Zhao <yuzhao@...gle.com>, 
	Muchun Song <songmuchun@...edance.com>, Facebook Kernel Team <kernel-team@...a.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim

On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > and I will explain why below. I know it may be a necessary
> > > > evil, but I would like us to make sure there is no other option before
> > > > going forward with this.
> > >
> > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > resolve the ongoing pain with good enough solution and work on long term
> > > solution later.
> >
> > It seems like there are a few ideas for solutions that may address
> > longer-term concerns, let's make sure we try those out first before we
> > fall back to the short-term mitigation.
> >
>
> Why? More specifically why try out other things before this patch? Both
> can be done in parallel. This patch has been running in production at
> Meta for several weeks without issues. Also I don't see how merging this
> would impact us on working on long term solutions.

The problem is that once this is merged, it will be difficult to
change this back to a normal flush once other improvements land. We
don't have a test that reproduces the problem that we can use to make
sure it's safe to revert this change later, it's only using data from
prod.

Once this mitigation goes in, I think everyone will be less motivated
to get more data from prod about whether it's safe to revert the
ratelimiting later :)

>
> [...]
> >
> > Thanks for explaining this in such detail. It does make me feel
> > better, but keep in mind that the above heuristics may change in the
> > future and become more sensitive to stale stats, and very likely no
> > one will remember that we decided that stale stats are fine
> > previously.
> >
>
> When was the last time this heuristic change? This heuristic was
> introduced in 2008 for anon pages and extended to file pages in 2016. In
> 2019 the ratio enforcement at 'reclaim root' was introduce. I am pretty
> sure we will improve the whole rstat flushing thing within a year or so
> :P

Fair point, although I meant it's easy to miss that the flush is
ratelimited and the stats are potentially stale in general :)

>
> > >
> > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > only checks if it is zero or not. Again precise information is not
> > > needed.
> >
> > It sounds like it is possible that we enter the cache trim mode when
> > we shouldn't if the stats are stale. Couldn't this lead to
> > over-reclaiming file memory?
> >
>
> Can you explain how this over-reclaiming file will happen?

In one reclaim iteration, we could flush the stats, read the inactive
file LRU size, confirm that (file >> sc->priority) > 0 and enter the
cache trim mode, reclaiming file memory only. Let's assume that we
reclaimed enough file memory such that the condition (file >>
sc->priority) > 0 does not hold anymore.

In a subsequent reclaim iteration, the flush could be skipped due to
ratelimiting. Now we will enter the cache trim mode again and reclaim
file memory only, even though the actual amount of file memory is low.
This will cause over-reclaiming from file memory and dismissing anon
memory that we should have reclaimed, which means that we will need
additional reclaim iterations to actually free memory.

I believe this scenario would be possible with ratelimiting, right?

[..]
> > >
> > > Please note that this is not some user API which can not be changed
> > > later. We can change and disect however we want. My only point is not to
> > > wait for the perfect solution and have some intermediate and good enough
> > > solution.
> >
> > I agree that we shouldn't wait for a perfect solution, but it also
> > seems like there are a few easy-ish solutions that we can discover
> > first (Jesper's patch, investigating update paths, etc). If none of
> > those pan out, we can fall back to the ratelimited flush, ideally with
> > a plan on next steps for a longer-term solution.
>
> I think I already explain why there is no need to wait. One thing we
> should agree on is that this is hard problem and will need multiple
> iterations to comeup with a solution which is acceptable for most. Until
> then I don't see any reason to block mitigations to reduce pain.

Agreed, but I expressed above why I think we should explore other
solutions first. Please correct me if I am wrong.