linux-kernel - Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4wK2vXzc0z7+JNwChhtk2gSPcrZ=vHJx=qfXcr+cZiTgw@mail.gmail.com>
Date: Thu, 4 Sep 2025 04:52:35 +0800
From: Barry Song <21cnbao@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: Kairui Song <kasong@...cent.com>, linux-mm@...ck.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Matthew Wilcox <willy@...radead.org>, 
	Hugh Dickins <hughd@...gle.com>, Baoquan He <bhe@...hat.com>, Nhat Pham <nphamcs@...il.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Ying Huang <ying.huang@...ux.alibaba.com>, Johannes Weiner <hannes@...xchg.org>, 
	David Hildenbrand <david@...hat.com>, Yosry Ahmed <yosryahmed@...gle.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table

On Wed, Sep 3, 2025 at 8:35 PM Chris Li <chrisl@...nel.org> wrote:
>
> On Tue, Sep 2, 2025 at 4:31 PM Barry Song <21cnbao@...il.com> wrote:
> >
> > On Wed, Sep 3, 2025 at 1:17 AM Chris Li <chrisl@...nel.org> wrote:
> > >
> > > On Tue, Sep 2, 2025 at 4:15 AM Barry Song <21cnbao@...il.com> wrote:
> > > >
> > > > On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@...il.com> wrote:
> > > > >
> > > > > From: Kairui Song <kasong@...cent.com>
> > > > >
> > > > > Now swap table is cluster based, which means free clusters can free its
> > > > > table since no one should modify it.
> > > > >
> > > > > There could be speculative readers, like swap cache look up, protect
> > > > > them by making them RCU safe. All swap table should be filled with null
> > > > > entries before free, so such readers will either see a NULL pointer or
> > > > > a null filled table being lazy freed.
> > > > >
> > > > > On allocation, allocate the table when a cluster is used by any order.
> > > > >
> > > >
> > > > Might be a silly question.
> > > >
> > > > Just curious—what happens if the allocation fails? Does the swap-out
> > > > operation also fail? We sometimes encounter strange issues when memory is
> > > > very limited, especially if the reclamation path itself needs to allocate
> > > > memory.
> > > >
> > > > Assume a case where we want to swap out a folio using clusterN. We then
> > > > attempt to swap out the following folios with the same clusterN. But if
> > > > the allocation of the swap_table keeps failing, what will happen?
> > >
> > > I think this is the same behavior as the XArray allocation node with no memory.
> > > The swap allocator will fail to isolate this cluster, it gets a NULL
> > > ci pointer as return value. The swap allocator will try other cluster
> > > lists, e.g. non_full, fragment etc.
> >
> > What I’m actually concerned about is that we keep iterating on this
> > cluster. If we try others, that sounds good.
>
> No, the isolation of the current cluster will remove the cluster from
> the head and eventually put it back to the tail of the appropriate
> list. It will not keep iterating the same cluster. Otherwise trying to
> allocate a high order swap entry will also deadlooping on the first
> cluster if it fails to allocate swap entries.
>
> >
> > > If all of them fail, the folio_alloc_swap() will return -ENOMEM. Which
> > > will propagate back to the try to swap out, then the shrink folio
> > > list. It will put this page back to the LRU.
> > >
> > > The shrink folio list either free enough memory (happy path) or not
> > > able to free enough memory and it will cause an OOM kill.
> > >
> > > I believe previously XArray will also return -ENOMEM at insert a
> > > pointer and not be able to allocate a node to hold that ponter. It has
> > > the same error poperation path. We did not change that.
> >
> > Yes, I agree there was an -ENOMEM, but the difference is that we
> > are allocating much larger now :-)
>
> Even that is not 100% true. The XArray uses kmem_cache. Most of the
> time it is allocated from the kmem_cache cached page without hitting
> the system page allocation. When kmem_cache runs out of the current
> cached page, it will allocate from the system via page allocation, at
> least page size.
>

Exactly—that’s what I mean. When we hit the cache, allocation is far more
predictable than when it comes from the buddy allocator.

> So from the page allocator point of view, the swap table allocation is
> not bigger either.

I think the fundamental difference lies in how much pressure we place
on the buddy allocator.

>
> > One option is to organize every 4 or 8 swap slots into a group for
> > allocating or freeing the swap table. This way, we avoid the worst
> > case where a single unfreed slot consumes a whole swap table, and
> > the allocation size also becomes smaller. However, it’s unclear
> > whether the memory savings justify the added complexity and effort.
>
> Keep in mind that XArray also has this fragmentation issue as well.
> When a 64 pointer node is free, it will return to the kmem_cache as
> free area of the cache page. Only when every object in that page is
> free, that page can return to the page allocator. The difference is
> that the unused area seating at the swap table can be used
> immediately. The unused XArray node will sit in the kmem_cache and
> need extra kmem_cache_alloc to get the node to be used in the XArray.
> There is also a subtle difference that all xarray share the same
> kmem_cache pool for all xarray users. There is no dedicated kmem_cache
> pool for swap. The swap node might mix with other xarray nodes, making
> it even harder to release the underlying page. The swap table uses the
> page directly and it does not have this issue. If you have a swing of
> batch jobs causing a lot of swap, when the job is done, those swap
> entries will be free and the swap table can return those pages back.
> But xarray might not be able to release as many pages because of the
> mix usage of the xarray. It depends on what other xarray node was
> allocated during the swap usage.

Yes. If we organize the swap_table in group sizes of 16, 32, 64, 128, and so
on, we might gain the same benefit: those small objects become immediately
available to other allocations—no matter if they are visible to the buddy
allocator.

Anyway, I don’t have data to show whether the added complexity is worth
trying. I’m just glad the current approach is hoped to land and run on
real phones.

>
> I guess that is too much detail.
>
> >
> > Anyway, I’m glad to see the current swap_table moving towards merge
> > and look forward to running it on various devices. This should help
> > us see if it causes any real issues.
>

Thanks
Barry