linux-kernel - Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7BjsuooaHr=6cYTzGsj1nm+xG7jzCVdEsFgxyBHwq4GXw@mail.gmail.com>
Date: Sun, 31 Aug 2025 00:52:39 +0800
From: Kairui Song <ryncsn@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Matthew Wilcox <willy@...radead.org>, Hugh Dickins <hughd@...gle.com>, Barry Song <baohua@...nel.org>, 
	Baoquan He <bhe@...hat.com>, Nhat Pham <nphamcs@...il.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Ying Huang <ying.huang@...ux.alibaba.com>, Johannes Weiner <hannes@...xchg.org>, 
	David Hildenbrand <david@...hat.com>, Yosry Ahmed <yosryahmed@...gle.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and
 switch API

On Sat, Aug 30, 2025 at 11:43 AM Chris Li <chrisl@...nel.org> wrote:
>
> On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@...il.com> wrote:
> >
> > From: Kairui Song <kasong@...cent.com>
> >
> > Introduce basic swap table infrastructures, which are now just a
> > fixed-sized flat array inside each swap cluster, with access wrappers.
> >
> > Each cluster contains a swap table of 512 entries. Each table entry is
> > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > a folio type (pointer), or NULL.
> >
> > In this first step, it only supports storing a folio or shadow, and it
> > is a drop-in replacement for the current swap cache. Convert all swap
> > cache users to use the new sets of APIs. Chris Li has been suggesting
> > using a new infrastructure for swap cache for better performance, and
> > that idea combined well with the swap table as the new backing
> > structure. Now the lock contention range is reduced to 2M clusters,
> > which is much smaller than the 64M address_space. And we can also drop
> > the multiple address_space design.
> >
> > All the internal works are done with swap_cache_get_* helpers. Swap
> > cache lookup is still lock-less like before, and the helper's contexts
> > are same with original swap cache helpers. They still require a pin
> > on the swap device to prevent the backing data from being freed.
> >
> > Swap cache updates are now protected by the swap cluster lock
> > instead of the Xarray lock. This is mostly handled internally, but new
> > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > few new cluster access and locking helpers are also introduced.
> >
> > A fully cluster-based unified swap table can be implemented on top
> > of this to take care of all count tracking and synchronization work,
> > with dynamic allocation. It should reduce the memory usage while
> > making the performance even better.
> >
> > Co-developed-by: Chris Li <chrisl@...nel.org>
> > Signed-off-by: Chris Li <chrisl@...nel.org>
> > Signed-off-by: Kairui Song <kasong@...cent.com>
> > ---
> >  /*
> > - * This must be called only on folios that have
> > - * been verified to be in the swap cache and locked.
> > - * It will never put the folio into the free list,
> > - * the caller has a reference on the folio.
> > + * Replace an old folio in the swap cache with a new one. The caller must
> > + * hold the cluster lock and set the new folio's entry and flags.
> >   */
> > -void delete_from_swap_cache(struct folio *folio)
> > +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> > +                               struct folio *old, struct folio *new)
> > +{
> > +       unsigned int ci_off = swp_cluster_offset(entry);
> > +       unsigned long nr_pages = folio_nr_pages(new);
> > +       unsigned int ci_end = ci_off + nr_pages;
> > +
> > +       VM_WARN_ON_ONCE(entry.val != new->swap.val);
> > +       VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> > +       VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> > +       do {
> > +               WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> > +               __swap_table_set_folio(ci, ci_off, new);
>
> I recall in my original experiment swap cache replacement patch I used
> atomic compare exchange somewhere. It has been a while. Is there a
> reason to not use atomic cmpexchg() or that is in the later part of
> the series?

For now all swap table modifications are protected by ci lock, extra
atomic / cmpxchg is not needed.

We might be able to make use of cmpxchg in later phases. e.g. when
locking a folio is enough to ensure the final consistency of swap
count, cmpxchg can be used as a fast path to increase the swap count.

We can't do that now as the swap count is still managed by swap_map,
not swap table. And swap allocation / dup does not have a clear
definition about how they interact with folios, and range operations
all need the ci lock...  We might be able to figure out a stable way
to handle range operations too once we sort out how folios interact
with SWAP in a later phase, I tried that in the previous long series
and this part seems doable.

I'm not sure if that will benefit a lot, or will it make it more
complex for the high order swap table to be implemented. The cluster
lock is already very fine grained. We can do some experiments in the
future to verify it.

But the good thing is in either case, this is on the right path :)

> > +       } while (++ci_off < ci_end);
> > +
> > +       /*
> > +        * If the old folio is partially replaced (e.g., splitting a large
> > +        * folio, the old folio is shrunk in place, and new split sub folios
> > +        * are added to cache), ensure the new folio doesn't overlap it.
> > +        */
> > +       if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> > +           folio_order(old) != folio_order(new)) {
> > +               ci_off = swp_cluster_offset(old->swap);
> > +               ci_end = ci_off + folio_nr_pages(old);
> > +               while (ci_off++ < ci_end)
> > +                       WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
>
> Will this cause the swap cache to replace less than full folio range
> of the swap entry in range?
> The swap cache set folio should atomically set the full range of swap
> entries. If there is some one race to set some partial range. I
> suspect it should fail and undo the particle set. I recall there are
> some bugs on xarray accidentally fixed by one of your patches related
> to that kind of atomic behavior.
>
> I want to make sure a similar bug does not happen here.
>
> It is worthwhile to double check if the atomic folio set behavior.

Right, some callers that hold the ci lock by themselves (migration /
huge_mm split) have to ensure they do the folio replacement in a
correct way by themselves.

This is the same story for Xarray. These callers just used to hold the
xa lock and manipulate the xarray directly: e.g. split generates new
folios, new sub folios have to be added to swap cache in the right
place to override the old folio. The behavior is the same before /
after this commit, I just added a sanity check here to ensure nothing
went wrong, only to make it more reliable by adding checks in the
debug build.

I checked the logic here multiple times and tested it on multiple
kernel versions that have slightly different code for huge_mm split,
all went well.

>
> Looks good to me otherwise. Just waiting for confirmation of the swap
> cache atomic set behavior.
>
> Chris