linux-kernel - Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANeU7Q=_TveYGa2OK93YCRyJRYBc7QVi1pTTfQM7vm4Qh9tNbA@mail.gmail.com>
Date: Sat, 30 Aug 2025 18:00:24 -0700
From: Chris Li <chrisl@...nel.org>
To: Kairui Song <ryncsn@...il.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Matthew Wilcox <willy@...radead.org>, Hugh Dickins <hughd@...gle.com>, Barry Song <baohua@...nel.org>, 
	Baoquan He <bhe@...hat.com>, Nhat Pham <nphamcs@...il.com>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	Ying Huang <ying.huang@...ux.alibaba.com>, Johannes Weiner <hannes@...xchg.org>, 
	David Hildenbrand <david@...hat.com>, Yosry Ahmed <yosryahmed@...gle.com>, 
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, Zi Yan <ziy@...dia.com>, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and
 switch API

On Sat, Aug 30, 2025 at 9:53 AM Kairui Song <ryncsn@...il.com> wrote:
>
> On Sat, Aug 30, 2025 at 11:43 AM Chris Li <chrisl@...nel.org> wrote:
> >
> > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@...il.com> wrote:
> > >
> > > From: Kairui Song <kasong@...cent.com>
> > >
> > > Introduce basic swap table infrastructures, which are now just a
> > > fixed-sized flat array inside each swap cluster, with access wrappers.
> > >
> > > Each cluster contains a swap table of 512 entries. Each table entry is
> > > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > > a folio type (pointer), or NULL.
> > >
> > > In this first step, it only supports storing a folio or shadow, and it
> > > is a drop-in replacement for the current swap cache. Convert all swap
> > > cache users to use the new sets of APIs. Chris Li has been suggesting
> > > using a new infrastructure for swap cache for better performance, and
> > > that idea combined well with the swap table as the new backing
> > > structure. Now the lock contention range is reduced to 2M clusters,
> > > which is much smaller than the 64M address_space. And we can also drop
> > > the multiple address_space design.
> > >
> > > All the internal works are done with swap_cache_get_* helpers. Swap
> > > cache lookup is still lock-less like before, and the helper's contexts
> > > are same with original swap cache helpers. They still require a pin
> > > on the swap device to prevent the backing data from being freed.
> > >
> > > Swap cache updates are now protected by the swap cluster lock
> > > instead of the Xarray lock. This is mostly handled internally, but new
> > > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > > few new cluster access and locking helpers are also introduced.
> > >
> > > A fully cluster-based unified swap table can be implemented on top
> > > of this to take care of all count tracking and synchronization work,
> > > with dynamic allocation. It should reduce the memory usage while
> > > making the performance even better.
> > >
> > > Co-developed-by: Chris Li <chrisl@...nel.org>
> > > Signed-off-by: Chris Li <chrisl@...nel.org>
> > > Signed-off-by: Kairui Song <kasong@...cent.com>
> > > ---
> > >  /*
> > > - * This must be called only on folios that have
> > > - * been verified to be in the swap cache and locked.
> > > - * It will never put the folio into the free list,
> > > - * the caller has a reference on the folio.
> > > + * Replace an old folio in the swap cache with a new one. The caller must
> > > + * hold the cluster lock and set the new folio's entry and flags.
> > >   */
> > > -void delete_from_swap_cache(struct folio *folio)
> > > +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> > > +                               struct folio *old, struct folio *new)
> > > +{
> > > +       unsigned int ci_off = swp_cluster_offset(entry);
> > > +       unsigned long nr_pages = folio_nr_pages(new);
> > > +       unsigned int ci_end = ci_off + nr_pages;
> > > +
> > > +       VM_WARN_ON_ONCE(entry.val != new->swap.val);
> > > +       VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> > > +       VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> > > +       do {
> > > +               WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> > > +               __swap_table_set_folio(ci, ci_off, new);
> >
> > I recall in my original experiment swap cache replacement patch I used
> > atomic compare exchange somewhere. It has been a while. Is there a
> > reason to not use atomic cmpexchg() or that is in the later part of
> > the series?
>
> For now all swap table modifications are protected by ci lock, extra
> atomic / cmpxchg is not needed.
>
> We might be able to make use of cmpxchg in later phases. e.g. when
> locking a folio is enough to ensure the final consistency of swap
> count, cmpxchg can be used as a fast path to increase the swap count.

You did not get what I am asking. Let me clarify.

I mean even if we keep the ci lock, not change that locking
requirement part. In the above code. Why can't we use cmpexchge to
make sure that we only overwrite the form "old" -> "new".
I am not saying we need to do the lockless part here.

I mean in the possible sequence
WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old); //
still "old" here, not warning issued
/// another CPU race writes "old" to "old2" because of a bug.
__swap_table_set_folio(ci, ci_off, new); // now "new" overwrites
"old2" without warning.

Has the typical race that you check the value old, then you overwrite
value new. But what if the old changes to "old2" before you overwrite
it with "new"?
You overwrite "old2" silently.

I mean to catch that.

Using cmpxchg will make sure we only change "old" -> "new". We can
catch the buggy situation above by overwriting "old2" -> "new".
Also when we find out the entry is "old2" not "old" there, is WARN_ONCE enough?

I also want to discuss what we should do if we did catch the "old2"
there in the swap cache instead of "old".
I feel that continuing with WARN_ONCE might not be good enough. It
will make data corruption popergate.

Should we roll back the new value and fail the swap cache folio set
function to avoid the possible data corruption?
if we found "old2", The new guy can't set the folio to the new value.
Deal with that error. Will that avoid data corruption? Not being able
to make forward progress is still much better than forward progress
with data corruption.

I just don't want silent overwritten values we aren't expecting.

> We can't do that now as the swap count is still managed by swap_map,
> not swap table. And swap allocation / dup does not have a clear
> definition about how they interact with folios, and range operations
> all need the ci lock...  We might be able to figure out a stable way
> to handle range operations too once we sort out how folios interact
> with SWAP in a later phase, I tried that in the previous long series
> and this part seems doable.

See above I don't mean to change the locking logic here. Only assert
the previous value is old when overwritten to new.

>
> I'm not sure if that will benefit a lot, or will it make it more
> complex for the high order swap table to be implemented. The cluster
> lock is already very fine grained. We can do some experiments in the
> future to verify it.

Not asking to change lock logic. See above.

I feel that we should always use cmpxchg to assign value to the swap
table just to be paranoid. It is data corruption we are risking here.

> But the good thing is in either case, this is on the right path :)
>
> > > +       } while (++ci_off < ci_end);
> > > +
> > > +       /*
> > > +        * If the old folio is partially replaced (e.g., splitting a large
> > > +        * folio, the old folio is shrunk in place, and new split sub folios
> > > +        * are added to cache), ensure the new folio doesn't overlap it.
> > > +        */
> > > +       if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> > > +           folio_order(old) != folio_order(new)) {
> > > +               ci_off = swp_cluster_offset(old->swap);
> > > +               ci_end = ci_off + folio_nr_pages(old);
> > > +               while (ci_off++ < ci_end)
> > > +                       WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> >
> > Will this cause the swap cache to replace less than full folio range
> > of the swap entry in range?
> > The swap cache set folio should atomically set the full range of swap
> > entries. If there is some one race to set some partial range. I
> > suspect it should fail and undo the particle set. I recall there are
> > some bugs on xarray accidentally fixed by one of your patches related
> > to that kind of atomic behavior.
> >
> > I want to make sure a similar bug does not happen here.
> >
> > It is worthwhile to double check if the atomic folio set behavior.
>
> Right, some callers that hold the ci lock by themselves (migration /
> huge_mm split) have to ensure they do the folio replacement in a
> correct way by themselves.
>
> This is the same story for Xarray. These callers just used to hold the
> xa lock and manipulate the xarray directly: e.g. split generates new
> folios, new sub folios have to be added to swap cache in the right
> place to override the old folio. The behavior is the same before /
> after this commit, I just added a sanity check here to ensure nothing
> went wrong, only to make it more reliable by adding checks in the
> debug build.
>
> I checked the logic here multiple times and tested it on multiple
> kernel versions that have slightly different code for huge_mm split,
> all went well.

Thanks for the double checking.

Chris