linux-kernel - Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7CWPt8Ue7AjkfwAtFEtsZx8wk3Ge4LcFHx9=ur9vUCEyA@mail.gmail.com>
Date: Tue, 2 Sep 2025 19:51:04 +0800
From: Kairui Song <ryncsn@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Matthew Wilcox <willy@...radead.org>, Hugh Dickins <hughd@...gle.com>, Barry Song <baohua@...nel.org>, 
	Baoquan He <bhe@...hat.com>, Nhat Pham <nphamcs@...il.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Ying Huang <ying.huang@...ux.alibaba.com>, Johannes Weiner <hannes@...xchg.org>, 
	David Hildenbrand <david@...hat.com>, Yosry Ahmed <yosryahmed@...gle.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and
 switch API

On Sun, Aug 31, 2025 at 9:00 AM Chris Li <chrisl@...nel.org> wrote:
>
> On Sat, Aug 30, 2025 at 9:53 AM Kairui Song <ryncsn@...il.com> wrote:
> >
> > On Sat, Aug 30, 2025 at 11:43 AM Chris Li <chrisl@...nel.org> wrote:
> > >
> > > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@...il.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@...cent.com>
> > > >
> > > > Introduce basic swap table infrastructures, which are now just a
> > > > fixed-sized flat array inside each swap cluster, with access wrappers.
> > > >
> > > > Each cluster contains a swap table of 512 entries. Each table entry is
> > > > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > > > a folio type (pointer), or NULL.
> > > >
> > > > In this first step, it only supports storing a folio or shadow, and it
> > > > is a drop-in replacement for the current swap cache. Convert all swap
> > > > cache users to use the new sets of APIs. Chris Li has been suggesting
> > > > using a new infrastructure for swap cache for better performance, and
> > > > that idea combined well with the swap table as the new backing
> > > > structure. Now the lock contention range is reduced to 2M clusters,
> > > > which is much smaller than the 64M address_space. And we can also drop
> > > > the multiple address_space design.
> > > >
> > > > All the internal works are done with swap_cache_get_* helpers. Swap
> > > > cache lookup is still lock-less like before, and the helper's contexts
> > > > are same with original swap cache helpers. They still require a pin
> > > > on the swap device to prevent the backing data from being freed.
> > > >
> > > > Swap cache updates are now protected by the swap cluster lock
> > > > instead of the Xarray lock. This is mostly handled internally, but new
> > > > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > > > few new cluster access and locking helpers are also introduced.
> > > >
> > > > A fully cluster-based unified swap table can be implemented on top
> > > > of this to take care of all count tracking and synchronization work,
> > > > with dynamic allocation. It should reduce the memory usage while
> > > > making the performance even better.
> > > >
> > > > Co-developed-by: Chris Li <chrisl@...nel.org>
> > > > Signed-off-by: Chris Li <chrisl@...nel.org>
> > > > Signed-off-by: Kairui Song <kasong@...cent.com>
> > > > ---
> > > >  /*
> > > > - * This must be called only on folios that have
> > > > - * been verified to be in the swap cache and locked.
> > > > - * It will never put the folio into the free list,
> > > > - * the caller has a reference on the folio.
> > > > + * Replace an old folio in the swap cache with a new one. The caller must
> > > > + * hold the cluster lock and set the new folio's entry and flags.
> > > >   */
> > > > -void delete_from_swap_cache(struct folio *folio)
> > > > +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> > > > +                               struct folio *old, struct folio *new)
> > > > +{
> > > > +       unsigned int ci_off = swp_cluster_offset(entry);
> > > > +       unsigned long nr_pages = folio_nr_pages(new);
> > > > +       unsigned int ci_end = ci_off + nr_pages;
> > > > +
> > > > +       VM_WARN_ON_ONCE(entry.val != new->swap.val);
> > > > +       VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> > > > +       VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> > > > +       do {
> > > > +               WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> > > > +               __swap_table_set_folio(ci, ci_off, new);
> > >
> > > I recall in my original experiment swap cache replacement patch I used
> > > atomic compare exchange somewhere. It has been a while. Is there a
> > > reason to not use atomic cmpexchg() or that is in the later part of
> > > the series?
> >
> > For now all swap table modifications are protected by ci lock, extra
> > atomic / cmpxchg is not needed.
> >
> > We might be able to make use of cmpxchg in later phases. e.g. when
> > locking a folio is enough to ensure the final consistency of swap
> > count, cmpxchg can be used as a fast path to increase the swap count.
>
> You did not get what I am asking. Let me clarify.
>
> I mean even if we keep the ci lock, not change that locking
> requirement part. In the above code. Why can't we use cmpexchge to
> make sure that we only overwrite the form "old" -> "new".
> I am not saying we need to do the lockless part here.
>
> I mean in the possible sequence
> WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old); //
> still "old" here, not warning issued
> /// another CPU race writes "old" to "old2" because of a bug.
> __swap_table_set_folio(ci, ci_off, new); // now "new" overwrites
> "old2" without warning.
>
> Has the typical race that you check the value old, then you overwrite
> value new. But what if the old changes to "old2" before you overwrite
> it with "new"?
> You overwrite "old2" silently.
>
> I mean to catch that.
>
> Using cmpxchg will make sure we only change "old" -> "new". We can
> catch the buggy situation above by overwriting "old2" -> "new".
> Also when we find out the entry is "old2" not "old" there, is WARN_ONCE enough?
>
> I also want to discuss what we should do if we did catch the "old2"
> there in the swap cache instead of "old".
> I feel that continuing with WARN_ONCE might not be good enough. It
> will make data corruption popergate.
>
> Should we roll back the new value and fail the swap cache folio set
> function to avoid the possible data corruption?
> if we found "old2", The new guy can't set the folio to the new value.
> Deal with that error. Will that avoid data corruption? Not being able
> to make forward progress is still much better than forward progress
> with data corruption.
>
> I just don't want silent overwritten values we aren't expecting.

Right, I just think-through about this. If we are super cautious
during the early phase, we can have more non-debug checks for
potential bugs.

There are currently three places modifying the swap table: replace
(huge_mm, migration, shmem) / insert (swapin / swapout) / del. I
checked the details, basically in all cases, there is no way to
rollback. Once the data is somehow corrupted, any operation could be
in the wrong direction.

So yeah, let me add some more checks.

I'll slightly adjust swap_cache_add_folio too. In this V1,
swap_cache_add_folio is designed to allow races and returns an int for
potential conflict. But , it should never fail in V1, cause there is
currently no racing caller: we still rely on the SWAP_HAS_CACHE to pin
slots before installing the swap cache. We will kill this ugly dance
very soon in phase 3. (phase 2 removes SYNC_IO swapin is an important
step). I used this version of swap_cache_add_folio from later phases,
just to make it easier later. So in V1 let's make it WARN/BUG if any
conflict folio exists and always return void, that's safer for
catching potential bugs. I'll change swap_cache_add_folio to allow the
race again in a later phase.

For other places, I think a relaxed xchg with WARN/BUG should be just fine.

Later phases can also use something like a CONFIG_DEBUG_VM_SWAP to
wrap these, after things are verified to be stable.