[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZsTOwBffg5xSCUbP@gmail.com>
Date: Tue, 20 Aug 2024 10:13:36 -0700
From: Breno Leitao <leitao@...ian.org>
To: Usama Arif <usamaarif642@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, yuzhao@...gle.com,
david@...hat.com, huangzhaoyang@...il.com, bharata@....com,
willy@...radead.org, vbabka@...e.cz, linux-kernel@...r.kernel.org,
kernel-team@...a.com, Johannes Weiner <hannes@...xchg.org>,
zhaoyang.huang@...soc.com, Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH RESEND] mm: drop lruvec->lru_lock if contended when
skipping folio
On Tue, Aug 20, 2024 at 11:45:11AM -0400, Usama Arif wrote:
> So Johannes pointed out to me that this is not going to properly fix
> the problem of holding the lru_lock for a long time introduced in [1]
> because of 2 reasons: - the task that is doing lock break is hoarding
> folios on folios_skipped and making the lru shorter, I didn't see it
> in the usecase I was trying, but it could be that yielding the lock to
> the other task is not of much use as it is going to go through a much
> shorter lru list or even an empty lru list and would OOM, while the
> folio it is looking for is on folios_skipped. We would be substituting
> one OOM problem for another with this patch. - Compaction code goes
> through pages by pfn and not using the list, as this patch does not
> clear lru flag, compaction could claim this folio.
>
> The patch in [1] is severely breaking production at Meta and its not a
> proper fix to the problem that the commit was trying to be solved. It
> results in holding the lru_lock for a very significant amount of time,
> stalling all other processes trying to claim memory, creating very
> high memory pressure for large periods of time and causing OOM.
>
> The way forward would be to revert it and try to come up with a longer
> term solution that the original commit tried to solve. If no one is
> opposed to it, I will wait a couple of days for comments and send a
> revert patch.
I agree with the concern, but for a different reason. Commit
5da226dbfce3 ("mm: skip CMA pages when they are not available") was
intended as an optimization, but it changed the behavior of the
isolate_lru_folios() function in a way that had significant, unintended
consequences.
One such consequence was a notable increase in lock contention, as
described in [1]. Addressing this lock contention issue with a quick fix
seems like a suboptimal solution for such a core part of the system.
Instead, a better approach would be to rethink the original
optimization. Rather than applying a band-aid to the lock contention
problem, it would be more prudent to revisit the changes introduced by
commit 5da226dbfce3 and explore alternative optimization strategies that
do not have such far-reaching and difficult-to-diagnose effects.
[1] Link: https://lore.kernel.org/all/ZrssOrcJIDy8hacI@gmail.com/
Powered by blists - more mailing lists