linux-kernel - Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbXjbtowr5wSKR_F_2ou6nVhxK3-+HvSs+P71PYOo0h3UA@mail.gmail.com>
Date: Wed, 3 Sep 2025 05:35:22 -0700
From: Chris Li <chrisl@...nel.org>
To: Barry Song <21cnbao@...il.com>
Cc: Kairui Song <kasong@...cent.com>, linux-mm@...ck.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Matthew Wilcox <willy@...radead.org>, 
	Hugh Dickins <hughd@...gle.com>, Baoquan He <bhe@...hat.com>, Nhat Pham <nphamcs@...il.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Ying Huang <ying.huang@...ux.alibaba.com>, Johannes Weiner <hannes@...xchg.org>, 
	David Hildenbrand <david@...hat.com>, Yosry Ahmed <yosryahmed@...gle.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table

On Tue, Sep 2, 2025 at 4:31 PM Barry Song <21cnbao@...il.com> wrote:
>
> On Wed, Sep 3, 2025 at 1:17 AM Chris Li <chrisl@...nel.org> wrote:
> >
> > On Tue, Sep 2, 2025 at 4:15 AM Barry Song <21cnbao@...il.com> wrote:
> > >
> > > On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@...il.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@...cent.com>
> > > >
> > > > Now swap table is cluster based, which means free clusters can free its
> > > > table since no one should modify it.
> > > >
> > > > There could be speculative readers, like swap cache look up, protect
> > > > them by making them RCU safe. All swap table should be filled with null
> > > > entries before free, so such readers will either see a NULL pointer or
> > > > a null filled table being lazy freed.
> > > >
> > > > On allocation, allocate the table when a cluster is used by any order.
> > > >
> > >
> > > Might be a silly question.
> > >
> > > Just curious—what happens if the allocation fails? Does the swap-out
> > > operation also fail? We sometimes encounter strange issues when memory is
> > > very limited, especially if the reclamation path itself needs to allocate
> > > memory.
> > >
> > > Assume a case where we want to swap out a folio using clusterN. We then
> > > attempt to swap out the following folios with the same clusterN. But if
> > > the allocation of the swap_table keeps failing, what will happen?
> >
> > I think this is the same behavior as the XArray allocation node with no memory.
> > The swap allocator will fail to isolate this cluster, it gets a NULL
> > ci pointer as return value. The swap allocator will try other cluster
> > lists, e.g. non_full, fragment etc.
>
> What I’m actually concerned about is that we keep iterating on this
> cluster. If we try others, that sounds good.

No, the isolation of the current cluster will remove the cluster from
the head and eventually put it back to the tail of the appropriate
list. It will not keep iterating the same cluster. Otherwise trying to
allocate a high order swap entry will also deadlooping on the first
cluster if it fails to allocate swap entries.

>
> > If all of them fail, the folio_alloc_swap() will return -ENOMEM. Which
> > will propagate back to the try to swap out, then the shrink folio
> > list. It will put this page back to the LRU.
> >
> > The shrink folio list either free enough memory (happy path) or not
> > able to free enough memory and it will cause an OOM kill.
> >
> > I believe previously XArray will also return -ENOMEM at insert a
> > pointer and not be able to allocate a node to hold that ponter. It has
> > the same error poperation path. We did not change that.
>
> Yes, I agree there was an -ENOMEM, but the difference is that we
> are allocating much larger now :-)

Even that is not 100% true. The XArray uses kmem_cache. Most of the
time it is allocated from the kmem_cache cached page without hitting
the system page allocation. When kmem_cache runs out of the current
cached page, it will allocate from the system via page allocation, at
least page size.

So from the page allocator point of view, the swap table allocation is
not bigger either.

> One option is to organize every 4 or 8 swap slots into a group for
> allocating or freeing the swap table. This way, we avoid the worst
> case where a single unfreed slot consumes a whole swap table, and
> the allocation size also becomes smaller. However, it’s unclear
> whether the memory savings justify the added complexity and effort.

Keep in mind that XArray also has this fragmentation issue as well.
When a 64 pointer node is free, it will return to the kmem_cache as
free area of the cache page. Only when every object in that page is
free, that page can return to the page allocator. The difference is
that the unused area seating at the swap table can be used
immediately. The unused XArray node will sit in the kmem_cache and
need extra kmem_cache_alloc to get the node to be used in the XArray.
There is also a subtle difference that all xarray share the same
kmem_cache pool for all xarray users. There is no dedicated kmem_cache
pool for swap. The swap node might mix with other xarray nodes, making
it even harder to release the underlying page. The swap table uses the
page directly and it does not have this issue. If you have a swing of
batch jobs causing a lot of swap, when the job is done, those swap
entries will be free and the swap table can return those pages back.
But xarray might not be able to release as many pages because of the
mix usage of the xarray. It depends on what other xarray node was
allocated during the swap usage.

I guess that is too much detail.

>
> Anyway, I’m glad to see the current swap_table moving towards merge
> and look forward to running it on various devices. This should help
> us see if it causes any real issues.

Agree.

Chris