linux-kernel - Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuPCZ8BEEAWT4QYVAMW4BD_1=77NuJySnWo3TopegOH5Gg@mail.gmail.com>
Date: Tue, 5 Aug 2025 16:35:30 -0700
From: Chris Li <chrisl@...nel.org>
To: Kairui Song <kasong@...cent.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Kemeng Shi <shikemeng@...weicloud.com>, Nhat Pham <nphamcs@...il.com>, 
	Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>, 
	"Huang, Ying" <ying.huang@...ux.alibaba.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters

Acked-by: Chris Li <chrisl@...nel.org>

On Mon, Aug 4, 2025 at 10:25 AM Kairui Song <ryncsn@...il.com> wrote:
>
> From: Kairui Song <kasong@...cent.com>
>
> We prefer a free cluster over a nonfull cluster whenever a CPU local
> cluster is drained to respect the SSD discard behavior [1]. It's not
> a best practice for non-discarding devices. And this is causing a
> chigher fragmentation rate.

Not only does it cause a higher fragmentation rate. It also causes
limit working set size over a long period of continued swapping can
write to the whole swapping partition. That is bad from the SSD point
of view if the swap page access pattern is random. Because at random
access patterns, very few clusters can have all 512 free, which can
reach to the discard. The previously preferred new cluster approach
works best with batched short to medium running cycle jobs, so at the
end of batch, there is a time where most of the working of swap is
released. That can release the nonfull cluster to a free cluster. For
long running jobs and random access of swap entry, very low change
frees a cluster to discard.

This patch will cause the limit working set to only write to a limited
swap area. Which is a good thing from the SSD wearing point of view.

Chris

> So for a non-discarding device, prefer nonfull over free clusters. This
> reduces the fragmentation issue by a lot.
>
> Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
>
> Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071
>
> Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:
>
> Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0
>
> Performance is basically unchanged, and the large allocation failure rate
> is lower. Enabling all mTHP sizes showed a more significant result:
>
> Using the same test setup with 10G ZRAM and enabling all mTHP sizes:
>
> 128kB swap failure rate:
> Before: swpout:449548 swpout_fallback:55894
> After:  swpout:497519 swpout_fallback:3204
>
> 256kB swap failure rate:
> Before: swpout:63938  swpout_fallback:2154
> After:  swpout:65698  swpout_fallback:324
>
> 512kB swap failure rate:
> Before: swpout:11971  swpout_fallback:2218
> After:  swpout:14606  swpout_fallback:4
>
> 2M swap failure rate:
> Before: swpout:12     swpout_fallback:1578
> After:  swpout:1253   swpout_fallback:15
>
> The success rate of large allocations is much higher.
>
> Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
> Signed-off-by: Kairui Song <kasong@...cent.com>
> ---
>  mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
>  1 file changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5fdb3cb2b8b7..4a0cf4fb348d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>         }
>
>  new_cluster:
> -       ci = isolate_lock_cluster(si, &si->free_clusters);
> -       if (ci) {
> -               found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                               order, usage);
> -               if (found)
> -                       goto done;
> +       /*
> +        * If the device need discard, prefer new cluster over nonfull
> +        * to spread out the writes.
> +        */
> +       if (si->flags & SWP_PAGE_DISCARD) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
>         }
>
> -       /* Try reclaim from full clusters if free clusters list is drained */
> -       if (vm_swap_full())
> -               swap_reclaim_full_clusters(si, false);
> -
>         if (order < PMD_ORDER) {
>                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         if (found)
>                                 goto done;
>                 }
> +       }
>
> +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
> +       }
> +
> +       /* Try reclaim full clusters if free and nonfull lists are drained */
> +       if (vm_swap_full())
> +               swap_reclaim_full_clusters(si, false);
> +
> +       if (order < PMD_ORDER) {
>                 /*
>                  * Scan only one fragment cluster is good enough. Order 0
>                  * allocation will surely success, and large allocation
> --
> 2.50.1
>
>