linux-kernel - Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the tail of LRU after shrinking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACePvbWpbGa9w3MNsATYHMcTSkzOu6OWw6tdiGS_=PdXYXzH1w@mail.gmail.com>
Date: Tue, 26 Nov 2024 16:17:03 -0800
From: Chris Li <chrisl@...nel.org>
To: Barry Song <21cnbao@...il.com>
Cc: chenridong <chenridong@...wei.com>, Matthew Wilcox <willy@...radead.org>, 
	Chen Ridong <chenridong@...weicloud.com>, akpm@...ux-foundation.org, mhocko@...e.com, 
	hannes@...xchg.org, yosryahmed@...gle.com, yuzhao@...gle.com, 
	david@...hat.com, ryan.roberts@....com, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, wangweiyang2@...wei.com, xieym_ict@...mail.com, 
	Kairui Song <ryncsn@...il.com>
Subject: Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the
 tail of LRU after shrinking

On Mon, Nov 18, 2024 at 1:56 AM Barry Song <21cnbao@...il.com> wrote:
>
> On Mon, Nov 18, 2024 at 10:41 PM chenridong <chenridong@...wei.com> wrote:
> >
> >
> >
> > On 2024/11/18 12:14, Barry Song wrote:
> > > On Mon, Nov 18, 2024 at 5:03 PM Matthew Wilcox <willy@...radead.org> wrote:
> > >>
> > >> On Sat, Nov 16, 2024 at 09:16:58AM +0000, Chen Ridong wrote:
> > >>> 2. In shrink_page_list function, if folioN is THP(2M), it may be splited
> > >>>    and added to swap cache folio by folio. After adding to swap cache,
> > >>>    it will submit io to writeback folio to swap, which is asynchronous.
> > >>>    When shrink_page_list is finished, the isolated folios list will be
> > >>>    moved back to the head of inactive lru. The inactive lru may just look
> > >>>    like this, with 512 filioes have been move to the head of inactive lru.
> > >>
> > >> I was hoping that we'd be able to stop splitting the folio when adding
> > >> to the swap cache.  Ideally. we'd add the whole 2MB and write it back
> > >> as a single unit.
> > >
> > > This is already the case: adding to the swapcache doesn’t require splitting
> > > THPs, but failing to allocate 2MB of contiguous swap slots will.
> > >
> > >>
> > >> This is going to become much more important with memdescs.  We'd have to
> > >> allocate 512 struct folios to do this, which would be about 10 4kB pages,
> > >> and if we're trying to swap out memory, we're probably low on memory.
> > >>
> > >> So I don't like this solution you have at all because it doesn't help us
> > >> get to the solution we're going to need in about a year's time.
> > >>
> > >
> > > Ridong might need to clarify why this splitting is occurring. If it’s due to the
> > > failure to allocate swap slots, we still need a solution to address it.
> > >
> > > Thanks
> > > Barry
> >
> > shrink_folio_list
> >   add_to_swap
> >     folio_alloc_swap
> >       get_swap_pages
> >         scan_swap_map_slots
> >         /*
> >         * Swapfile is not block device or not using clusters so unable
> >         * to allocate large entries.
> >         */
> >         if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
> >           return 0;
> >
> > In my test, I use a file as swap, which is not 'SWP_BLKDEV'. So it
> > failed to get get_swap_pages.
>
> Alright, a proper non-rotating swap block device would be much
> better. In your case, though, cluster allocation isn’t supported.

Ah yes. The later part of the swap allocation series removes the non
cluster allocation code path.
It is not merged to mm-unstable yet. So even a swapfile not block
device will get the cluster allocator.

>
> >
> > I think this is a race issue between 'shrink_folio_list' executing and
> > writing back asynchronously. In my test, 512 folios(THP split) were
> > added to swap, only about 60 folios had not been written back when
> > 'move_folios_to_lru' was invoked after 'shrink_folio_list'. What if
> > writing back faster? Maybe this will happen even 32 folios(without THP)
> > are in the 'folio_list' of shrink_folio_list's inputs.
>
> On a real non-rotate swap device, the race condition would occur only when
> contiguous 2MB swap slots are unavailable.
>
> Hi Chris,
> I recall you mentioned unifying the code for swap devices and swap files, or
> for non-rotating and rotating devices. I assume a swap file (not a block device)
> would also be a practical user case?

I assume you mean non-SSD vs SSD device. In this follow up series of
the swap allocator from Kairui, the old non cluster allocator gets
removed, the cluster allocator will be used all the time.

https://lore.kernel.org/linux-mm/20241022192451.38138-4-ryncsn@gmail.com/

Chris