linux-kernel - Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the tail of LRU after shrinking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com>
Date: Fri, 29 Nov 2024 16:07:29 +1300
From: Barry Song <21cnbao@...il.com>
To: chenridong <chenridong@...wei.com>
Cc: Yu Zhao <yuzhao@...gle.com>, Matthew Wilcox <willy@...radead.org>, Chris Li <chrisl@...nel.org>, 
	Chen Ridong <chenridong@...weicloud.com>, akpm@...ux-foundation.org, mhocko@...e.com, 
	hannes@...xchg.org, yosryahmed@...gle.com, david@...hat.com, 
	ryan.roberts@....com, linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	wangweiyang2@...wei.com, xieym_ict@...mail.com
Subject: Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the
 tail of LRU after shrinking

On Fri, Nov 29, 2024 at 3:25 PM chenridong <chenridong@...wei.com> wrote:
>
>
>
> On 2024/11/29 7:08, Barry Song wrote:
> > On Mon, Nov 25, 2024 at 2:19 PM chenridong <chenridong@...wei.com> wrote:
> >>
> >>
> >>
> >> On 2024/11/18 12:21, Matthew Wilcox wrote:
> >>> On Mon, Nov 18, 2024 at 05:14:14PM +1300, Barry Song wrote:
> >>>> On Mon, Nov 18, 2024 at 5:03 PM Matthew Wilcox <willy@...radead.org> wrote:
> >>>>>
> >>>>> On Sat, Nov 16, 2024 at 09:16:58AM +0000, Chen Ridong wrote:
> >>>>>> 2. In shrink_page_list function, if folioN is THP(2M), it may be splited
> >>>>>>    and added to swap cache folio by folio. After adding to swap cache,
> >>>>>>    it will submit io to writeback folio to swap, which is asynchronous.
> >>>>>>    When shrink_page_list is finished, the isolated folios list will be
> >>>>>>    moved back to the head of inactive lru. The inactive lru may just look
> >>>>>>    like this, with 512 filioes have been move to the head of inactive lru.
> >>>>>
> >>>>> I was hoping that we'd be able to stop splitting the folio when adding
> >>>>> to the swap cache.  Ideally. we'd add the whole 2MB and write it back
> >>>>> as a single unit.
> >>>>
> >>>> This is already the case: adding to the swapcache doesn’t require splitting
> >>>> THPs, but failing to allocate 2MB of contiguous swap slots will.
> >>>
> >>> Agreed we need to understand why this is happening.  As I've said a few
> >>> times now, we need to stop requiring contiguity.  Real filesystems don't
> >>> need the contiguity (they become less efficient, but they can scatter a
> >>> single 2MB folio to multiple places).
> >>>
> >>> Maybe Chris has a solution to this in the works?
> >>>
> >>
> >> Hi, Chris, do you have a better idea to solve this issue?
> >
> > Not Chris. As I read the code again, we have already the below code to fixup
> > the issue "missed folio_rotate_reclaimable()" in evict_folios():
> >
> >                 /* retry folios that may have missed
> > folio_rotate_reclaimable() */
> >                 list_move(&folio->lru, &clean);
> >
> > It doesn't work for you?
> >
> > commit 359a5e1416caaf9ce28396a65ed3e386cc5de663
> > Author: Yu Zhao <yuzhao@...gle.com>
> > Date:   Tue Nov 15 18:38:07 2022 -0700
> >     mm: multi-gen LRU: retry folios written back while isolated
> >
> >     The page reclaim isolates a batch of folios from the tail of one of the
> >     LRU lists and works on those folios one by one.  For a suitable
> >     swap-backed folio, if the swap device is async, it queues that folio for
> >     writeback.  After the page reclaim finishes an entire batch, it puts back
> >     the folios it queued for writeback to the head of the original LRU list.
> >
> >     In the meantime, the page writeback flushes the queued folios also by
> >     batches.  Its batching logic is independent from that of the page reclaim.
> >     For each of the folios it writes back, the page writeback calls
> >     folio_rotate_reclaimable() which tries to rotate a folio to the tail.
> >
> >
> >     folio_rotate_reclaimable() only works for a folio after the page reclaim
> >     has put it back.  If an async swap device is fast enough, the page
> >     writeback can finish with that folio while the page reclaim is still
> >     working on the rest of the batch containing it.  In this case, that folio
> >     will remain at the head and the page reclaim will not retry it before
> >     reaching there.
> >
> >     This patch adds a retry to evict_folios().  After evict_folios() has
> >     finished an entire batch and before it puts back folios it cannot free
> >     immediately, it retries those that may have missed the rotation.
> >     Before this patch, ~60% of folios swapped to an Intel Optane missed
> >     folio_rotate_reclaimable().  After this patch, ~99% of missed folios were
> >     reclaimed upon retry.
> >
> >     This problem affects relatively slow async swap devices like Samsung 980
> >     Pro much less and does not affect sync swap devices like zram or zswap at
> >     all.
> >
> >>
> >> Best regards,
> >> Ridong
> >
> > Thanks
> > Barry
>
> Thank you for your reply, Barry.
> I found this issue with 5.10 version. I reproduced this issue with the
> next version, but the CONFIG_LRU_GEN_ENABLED kconfig is disabled. I
> tested again with  CONFIG_LRU_GEN_ENABLED enabled, and this issue can be
> fixed.
>
> IIUC, the 359a5e1416caaf9ce28396a65ed3e386cc5de663 commit can only work
> when CONFIG_LRU_GEN_ENABLED is enabled, but this issue exists when
> CONFIG_LRU_GEN_ENABLED is disabled and it should be fixed.
>
> I read the code of commit 359a5e1416caaf9ce28396a65ed3e386cc5de663, it
> found folios that are missed to rotate in a more complicated way, but it
>  makes it much clearer what is being done. Should I implement in Yu
> Zhao's way?

yes. this is completely the same thing.
since Yu only fixed in mglru and you are still using active/inactive,
the same fix should apply to active/inactive lru.


>
> Best regards,
> Ridong

thanks
barry