linux-kernel - Re: [PATCH 1/3] mm, swap: change back to use each swap device's percpu cluster

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRXE0ppned4Kprnz@yjaykim-PowerEdge-T330>
Date: Thu, 13 Nov 2025 20:45:22 +0900
From: YoungJun Park <youngjun.park@....com>
To: Kairui Song <ryncsn@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, cgroups@...r.kernel.org,
	linux-kernel@...r.kernel.org, chrisl@...nel.org, hannes@...xchg.org,
	mhocko@...nel.org, roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	muchun.song@...ux.dev, shikemeng@...weicloud.com, nphamcs@...il.com,
	bhe@...hat.com, baohua@...nel.org, gunho.lee@....com,
	taejoon.song@....com
Subject: Re: [PATCH 1/3] mm, swap: change back to use each swap device's
 percpu cluster

On Thu, Nov 13, 2025 at 02:07:59PM +0800, Kairui Song wrote:
> On Sun, Nov 9, 2025 at 8:54 PM Youngjun Park <youngjun.park@....com> wrote:
> >
> > This reverts commit 1b7e90020eb7 ("mm, swap: use percpu cluster as
> > allocation fast path").
> >
> > Because in the newly introduced swap tiers, the global percpu cluster
> > will cause two issues:
> > 1) it will cause caching oscillation in the same order of different si
> >    if two different memcg can only be allowed to access different si and
> >    both of them are swapping out.
> > 2) It can cause priority inversion on swap devices. Imagine a case where
> >    there are two memcg, say memcg1 and memcg2. Memcg1 can access si A, B
> >    and A is higher priority device. While memcg2 can only access si B.
> >    Then memcg 2 could write the global percpu cluster with si B, then
> >    memcg1 take si B in fast path even though si A is not exhausted.
> >
> > Hence in order to support swap tier, revert commit 1b7e90020eb7 to use
> > each swap device's percpu cluster.
> >
> > Co-developed-by: Baoquan He <bhe@...hat.com>
> > Suggested-by: Kairui Song <kasong@...cent.com>
> > Signed-off-by: Baoquan He <bhe@...hat.com>
> > Signed-off-by: Youngjun Park <youngjun.park@....com>
>
> Hi Youngjun, Baoquan, Thanks for the work on the percpu cluster thing.

Hello Kairui,

> It will be better if you can provide some benchmark result since the
> whole point of global percpu cluster is to improve the performance and
> get rid of the swap slot cache.

After RFC stage,
I will try to prepare benchmark results.

> I'm fine with a small regression but we better be aware of it. And we
> can still figure out some other ways to optimize it. e.g. I remember
> Chris once mentioned an idea of having a per device slot cache, that
> is different from the original slot cache (swap_slot.c): the allocator
> will be aware of it so it will be much cleaner.

Ack, we will work on better optimization.

> > ---
> >  include/linux/swap.h |  13 +++-
> >  mm/swapfile.c        | 151 +++++++++++++------------------------------
> >  2 files changed, 56 insertions(+), 108 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 38ca3df68716..90fa27bb7796 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -250,10 +250,17 @@ enum {
> >  #endif
> >
> >  /*
> > - * We keep using same cluster for rotational device so IO will be sequential.
> > - * The purpose is to optimize SWAP throughput on these device.
> > + * We assign a cluster to each CPU, so each CPU can allocate swap entry from
> > + * its own cluster and swapout sequentially. The purpose is to optimize swapout
> > + * throughput.
> >   */
> > +struct percpu_cluster {
> > +       local_lock_t lock; /* Protect the percpu_cluster above */
>
> I think you mean "below"?

This comment was originally written this way in the earlier code, and it
seems to refer to the percpu_cluster structure itself rather than the
fields below. But I agree it's a bit ambiguous. I'll just remove this
comment since the structure name is self-explanatory. Or change it to below. :)

> > +       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > +};
> > +
> >
> > -/*
> > - * Fast path try to get swap entries with specified order from current
> > - * CPU's swap entry pool (a cluster).
> > - */
> > -static bool swap_alloc_fast(swp_entry_t *entry,
> > -                           int order)
> > -{
> > -       struct swap_cluster_info *ci;
> > -       struct swap_info_struct *si;
> > -       unsigned int offset, found = SWAP_ENTRY_INVALID;
> > -
> > -       /*
> > -        * Once allocated, swap_info_struct will never be completely freed,
> > -        * so checking it's liveness by get_swap_device_info is enough.
> > -        */
> > -       si = this_cpu_read(percpu_swap_cluster.si[order]);
> > -       offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> > -       if (!si || !offset || !get_swap_device_info(si))
> > -               return false;
> > -
> > -       ci = swap_cluster_lock(si, offset);
> > -       if (cluster_is_usable(ci, order)) {
> > -               if (cluster_is_empty(ci))
> > -                       offset = cluster_offset(si, ci);
> > -               found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
> > -               if (found)
> > -                       *entry = swp_entry(si->type, found);
> > -       } else {
> > -               swap_cluster_unlock(ci);
> > -       }
> > -
> > -       put_swap_device(si);
> > -       return !!found;
> > -}
> > -
> >  /* Rotate the device and switch to a new cluster */
> > -static bool swap_alloc_slow(swp_entry_t *entry,
> > +static void swap_alloc_entry(swp_entry_t *entry,
> >                             int order)
>
> It seems you also changed the rotation rule here so every allocation
> of any order is causing a swap device rotation? Before 1b7e90020eb7
> every 64 allocation causes a rotation as we had slot cache
> (swap_slot.c). The global cluster makes the rotation happen for every
> cluster so the overhead is even lower on average. But now a per
> allocation rotation seems a rather high overhead and may cause serious
> fragmentation.

Yeah... The rotation rule has indeed changed. I remember the
discussion about rotation behavior:
https://lore.kernel.org/linux-mm/aPc3lmbJEVTXoV6h@yjaykim-PowerEdge-T330/

After that discussion, I've been thinking about the rotation.
Currently, the requeue happens after every priority list traversal, and this logic
is easily affected by changes.
The rotation logic change behavior change is not not mentioned somtimes.
(as you mentioned in commit 1b7e90020eb7).

I'd like to share some ideas and hear your thoughts:

1. Getting rid of the same priority requeue rule
   - same priority devices get priority - 1 or + 1 after requeue
     (more add or remove as needed to handle any overlapping priority appropriately)

2. Requeue only when a new cluster is allocated
   - Instead of requeueing after every priority list traversal, we
     requeue only when a cluster is fully used
   - This might have some performance impact, but the rotation behavior
     would be similar to the existing one (though slightly different due
     to synchronization and logic processing changes)

Going further with these approaches, if we remove the requeue mechanism
entirely, we could potentially reduce synchronization overhead during
plist traversal. (degrade the lock)

I'm curious what do you think about these possibilities?

Youngjun Park