lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 24 Jun 2024 12:06:28 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...e.com>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Jesper Dangaard Brouer <hawk@...nel.org>, Yu Zhao <yuzhao@...gle.com>, 
	Muchun Song <songmuchun@...edance.com>, Facebook Kernel Team <kernel-team@...a.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim

On Mon, Jun 24, 2024 at 11:59 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> On Mon, Jun 24, 2024 at 10:15:38AM GMT, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > > > and I will explain why below. I know it may be a necessary
> > > > > > evil, but I would like us to make sure there is no other option before
> > > > > > going forward with this.
> > > > >
> > > > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > > > resolve the ongoing pain with good enough solution and work on long term
> > > > > solution later.
> > > >
> > > > It seems like there are a few ideas for solutions that may address
> > > > longer-term concerns, let's make sure we try those out first before we
> > > > fall back to the short-term mitigation.
> > > >
> > >
> > > Why? More specifically why try out other things before this patch? Both
> > > can be done in parallel. This patch has been running in production at
> > > Meta for several weeks without issues. Also I don't see how merging this
> > > would impact us on working on long term solutions.
> >
> > The problem is that once this is merged, it will be difficult to
> > change this back to a normal flush once other improvements land. We
> > don't have a test that reproduces the problem that we can use to make
> > sure it's safe to revert this change later, it's only using data from
> > prod.
> >
>
> I am pretty sure the work on long term solution would be iterative which
> will involve many reverts and redoing things differently. So, I think it
> is understandable that we may need to revert or revert the reverts.
>
> > Once this mitigation goes in, I think everyone will be less motivated
> > to get more data from prod about whether it's safe to revert the
> > ratelimiting later :)
>
> As I said I don't expect "safe in prod" as a strict requirement for a
> change.

If everyone agrees that we can experiment with reverting this change
later without having to prove that it is safe, then I think it's fine.
Let's document this in the commit log though, so that whoever tries to
revert this in the future (if any) does not have to re-explain all of
this :)

[..]
> > > > >
> > > > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > > > only checks if it is zero or not. Again precise information is not
> > > > > needed.
> > > >
> > > > It sounds like it is possible that we enter the cache trim mode when
> > > > we shouldn't if the stats are stale. Couldn't this lead to
> > > > over-reclaiming file memory?
> > > >
> > >
> > > Can you explain how this over-reclaiming file will happen?
> >
> > In one reclaim iteration, we could flush the stats, read the inactive
> > file LRU size, confirm that (file >> sc->priority) > 0 and enter the
> > cache trim mode, reclaiming file memory only. Let's assume that we
> > reclaimed enough file memory such that the condition (file >>
> > sc->priority) > 0 does not hold anymore.
> >
> > In a subsequent reclaim iteration, the flush could be skipped due to
> > ratelimiting. Now we will enter the cache trim mode again and reclaim
> > file memory only, even though the actual amount of file memory is low.
> > This will cause over-reclaiming from file memory and dismissing anon
> > memory that we should have reclaimed, which means that we will need
> > additional reclaim iterations to actually free memory.
> >
> > I believe this scenario would be possible with ratelimiting, right?
> >
>
> So, the (old_file >> sc->priority) > 0 is true but the (new_file >>
> sc->priority) > is false. In the next iteration, (old_file >>
> (sc->priority-1)) > 0 will still be true but somehow (new_file >>
> (sc->priority-1)) > 0 is false. It can happen if in the previous
> iteration, somehow kernel has reclaimed more than double what it was
> supposed to reclaim or there are concurrent reclaimers. In addition the
> nr_reclaim is still less than nr_to_reclaim and there is no file
> deactivation request.
>
> Yeah it can happen but a lot of wierd conditions need to happen
> concurrently for this to happen.

Not necessarily sc->priority-1. Consider two separate sequential
reclaim attempts. At the same priority, the first reclaim attempt
could rightfully enter cache trim mode, while the second one
wrongfully enters cache trim mode due to stale stats, over-reclaim
file memory, and stall longer to actually reclaim the anon memory.

I am sure such a scenario is not going to be common, but I am also
sure if it happens it will be a huge pain to debug.

If others agree that this is fine, let's document this with a comment
and in the commit log. I am not sure how common the cache trim mode is
in practice to understand the potential severity of such problems.
There may also be other consequences that I am not aware of.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ