linux-kernel - Re: [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7CAGSjocez9C+Vs9r4-hUyaPFVTa55E9WFVi9qB=x0hqQ@mail.gmail.com>
Date: Fri, 8 Aug 2025 02:26:06 +0800
From: Kairui Song <ryncsn@...il.com>
To: Chris Li <chrisl@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-mm@...ck.org, Kemeng Shi <shikemeng@...weicloud.com>, 
	Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list

On Thu, Aug 7, 2025 at 1:32 PM Chris Li <chrisl@...nel.org> wrote:
>
> Acked-by: Chris Li <chrisl@...nel.org>
>
> Chris
>
> On Wed, Aug 6, 2025 at 9:18 AM Kairui Song <ryncsn@...il.com> wrote:
> >
> > From: Kairui Song <kasong@...cent.com>
> >
> > Fragment clusters were mostly failing high order allocation already.
> > The reason we scan it through now is that a swap slot may get freed
> > without releasing the swap cache, so a swap map entry will end up in
> > HAS_CACHE only status, and the cluster won't be moved back to non-full
> > or free cluster list. This may cause a higher allocation failure rate.
> >
> > Usually only !SWP_SYNCHRONOUS_IO devices may have a large number of
> > slots stuck in HAS_CACHE only status. Because when a !SWP_SYNCHRONOUS_IO
> > device's usage is low (!vm_swap_full()), it will try to lazy free
> > the swap cache.
> >
> > But this fragment list scan out is a bit overkill. Fragmentation
> > is only an issue for the allocator when the device is getting full,
> > and by that time, swap will be releasing the swap cache aggressively
> > already. Only scan one fragment cluster at a time is good enough to
>
> Only *scanning* one fragment cluster...

Thanks.

Hi, Andrew, can help update this word in the commit message?

>
> > reclaim already pinned slots, and move the cluster back to nonfull.
> >
> > And besides, only high order allocation requires iterating over the
> > list, order 0 allocation will succeed on the first attempt. And
> > high order allocation failure isn't a serious problem.
> >
> > So the iteration of fragment clusters is trivial, but it will slow down
> > large allocation by a lot when the fragment cluster list is long.
> > So it's better to drop this fragment cluster iteration design.
>
> One side note is that we make some trade off here. We trade lower
> success rates >4K swap entry allocation on fragment lists with overall
> faster swap entry time, because we stop searching the fragment list
> early.
>
> I notice this patch will suffer from fragment list trap behavior. The
> clusters go from free -> non full -> fragment -> free again (ignore
> the full list for now). Only when the cluster is completely free
> again, it will reset the cluster back to the free list. Otherwise
> given random swap entry access pattern, and long life cycle of some
> swap entry.  Free clusters are very hard to come by. Once it is in the
> fragment list, it is not easy to move back to a non full list. The
> cluster will eventually gravitate towards the fragment list and trap
> there. This kind of problem is not easy to expose by the kernel
> compile work load, which is a batch job in nature, with very few long
> running processes. If most of the clusters in the swapfile are in the
> fragment list. This will cause us to give up too early and force the
> more expensive swap cache reclaim path more often.
>
> To counter that fragmentation list trap effect,  one idea is that not
> all clusters in the fragment list are equal. If we make the fragment
> list into a few buckets by how empty it is. e.g. >50% vs <50%. I
> expect the <50% free cluster has a very low success rate for order >0
> allocation. Given an order "o", we can have a math formula P(o, count)
> of the success rate if slots are even randomly distributed with count
> slots used. The >50% free cluster will likely have a much higher
> success rate.  We should set a different search termination threshold
> for different bucket classes. That way we can give the cluster a
> chance to move up or down the bucket class. We should try the high
> free bucket before the low free bucket.
>
> That can combat the fragmentation list trap behavior.

That's a very good point!

I'm also thinking about after we remove HAS_CACHE, maybe we can
improve the lazy free policy or scanning design making use of the
better defined swap allocation / freeing workflows.

Just a random idea for now, I'll keep your suggestion in mind.


> BTW, we can have some simple bucket statistics to see what is the
> distribution of fragmented clusters. The bucket class threshold can
> dynamically adjust using the overall fullness of the swapfile.
>
> Chris
>
> >
> > Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> > defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> > only:
> >
> > Before: sys time: 4432.56s
> > After:  sys time: 4430.18s
> >
> > Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
> >
> > Before: sys time: 11609.69s  64kB/swpout: 1787051  64kB/swpout_fallback: 20917
> > After:  sys time: 5572.85s   64kB/swpout: 1797612  64kB/swpout_fallback: 19254
> >
> > Change to 8G ZRAM:
> >
> > Before: sys time: 21524.35s  64kB/swpout: 1687142  64kB/swpout_fallback: 128496
> > After:  sys time: 6278.45s   64kB/swpout: 1679127  64kB/swpout_fallback: 130942
> >
> > Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
> >
> > Before: sys time: 7393.50s  64kB/swpout:1788246  swpout_fallback: 0
> > After:  sys time: 7399.88s  64kB/swpout:1784257  swpout_fallback: 0
> >
> > Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
> >
> > Before: sys time: 26292.26s 64kB/swpout:1645236  swpout_fallback: 138945
> > After:  sys time: 9463.16s  64kB/swpout:1581376  swpout_fallback: 259979
> >
> > The performance is a lot better for large folios, and the large order
> > allocation failure rate is only very slightly higher or unchanged even
> > for !SWP_SYNCHRONOUS_IO devices high pressure.
> >
> > Signed-off-by: Kairui Song <kasong@...cent.com>
> > Acked-by: Nhat Pham <nphamcs@...il.com>
> > ---
> >  mm/swapfile.c | 23 ++++++++---------------
> >  1 file changed, 8 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index b4f3cc712580..1f1110e37f68 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -926,32 +926,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                 swap_reclaim_full_clusters(si, false);
> >
> >         if (order < PMD_ORDER) {
> > -               unsigned int frags = 0, frags_existing;
> > -
> >                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> >                                                         order, usage);
> >                         if (found)
> >                                 goto done;
> > -                       /* Clusters failed to allocate are moved to frag_clusters */
> > -                       frags++;
> >                 }
> >
> > -               frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > -               while (frags < frags_existing &&
> > -                      (ci = isolate_lock_cluster(si, &si->frag_clusters[order]))) {
> > -                       atomic_long_dec(&si->frag_cluster_nr[order]);
> > -                       /*
> > -                        * Rotate the frag list to iterate, they were all
> > -                        * failing high order allocation or moved here due to
> > -                        * per-CPU usage, but they could contain newly released
> > -                        * reclaimable (eg. lazy-freed swap cache) slots.
> > -                        */
> > +               /*
> > +                * Scan only one fragment cluster is good enough. Order 0
> > +                * allocation will surely success, and large allocation
> > +                * failure is not critical. Scanning one cluster still
> > +                * keeps the list rotated and reclaimed (for HAS_CACHE).
> > +                */
> > +               ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
> > +               if (ci) {
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> >                                                         order, usage);
> >                         if (found)
> >                                 goto done;
> > -                       frags++;
> >                 }
> >         }
> >
> > --
> > 2.50.1
> >
> >
>