linux-kernel - [PATCH 1/2] mm, swap: don't scan every fragment cluster

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250804172439.2331-2-ryncsn@gmail.com>
Date: Tue,  5 Aug 2025 01:24:38 +0800
From: Kairui Song <ryncsn@...il.com>
To: linux-mm@...ck.org
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Kemeng Shi <shikemeng@...weicloud.com>,
	Chris Li <chrisl@...nel.org>,
	Nhat Pham <nphamcs@...il.com>,
	Baoquan He <bhe@...hat.com>,
	Barry Song <baohua@...nel.org>,
	"Huang, Ying" <ying.huang@...ux.alibaba.com>,
	linux-kernel@...r.kernel.org,
	Kairui Song <kasong@...cent.com>
Subject: [PATCH 1/2] mm, swap: don't scan every fragment cluster

From: Kairui Song <kasong@...cent.com>

Fragment clusters were mostly failing high order allocation already.
The reason we scan it now is that a swap slot may get freed without
releasing the swap cache, so a swap map entry will end up in HAS_CACHE
only status, and the cluster won't be moved back to non-full or free
cluster list.

Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap
device usage is low (!vm_swap_full()) since swap will try to lazy free
the swap cache.

It's unlikely to cause any real issue. Fragmentation is only an issue
when the device is getting full, and by  that time, swap will already
be releasing the swap cache aggressively. And swap cache reclaim happens
when the allocator scans a cluster too. Scanning one fragment cluster
should be good enough to reclaim these pinned slots.

And besides, only high order allocation requires iterating over a
cluster list, order 0 allocation will succeed on the first attempt.
And high order allocation failure isn't a serious problem.

So the iteration of fragment clusters is trivial, but it will slow down
mTHP allocation by a lot when the fragment cluster list is long.
So it's better to drop this fragment cluster iteration design. Only
scanning one fragment cluster is good enough in case any cluster is
stuck in the fragment list; this ensures order 0 allocation never
falls, and large allocations still have an acceptable success rate.

Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
only:

Before: sys time: 4407.28s
After:  sys time: 4425.22s

Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:

Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17653
After:  sys time: 5527.90s   64kB/swpout: 1789358  64kB/swpout_fallback: 17813

Change to 8G ZRAM:

Before: sys time: 21929.17s  64kB/swpout: 1634681  64kB/swpout_fallback: 173056
After:  sys time: 6121.01s   64kB/swpout: 1638155  64kB/swpout_fallback: 189562

Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:

Before: sys time: 7368.41s  64kB/swpout:1787599  swpout_fallback: 0
After:  sys time: 7338.27s  64kB/swpout:1783106  swpout_fallback: 0

Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:

Before: sys time: 28139.60s 64kB/swpout:1645421  swpout_fallback: 148408
After:  sys time: 8941.90s  64kB/swpout:1592973  swpout_fallback: 265010

The performance is a lot better and large order allocation failure rate
is only very slightly higher or unchanged.

Signed-off-by: Kairui Song <kasong@...cent.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 30 ++++++++----------------------
 2 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe6ed2cc3fd..a060d102e0d1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -310,7 +310,6 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
-	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b4f3cc712580..5fdb3cb2b8b7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -470,11 +470,6 @@ static void move_cluster(struct swap_info_struct *si,
 	else
 		list_move_tail(&ci->list, list);
 	spin_unlock(&si->lock);
-
-	if (ci->flags == CLUSTER_FLAG_FRAG)
-		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
-	else if (new_flags == CLUSTER_FLAG_FRAG)
-		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
 	ci->flags = new_flags;
 }
 
@@ -926,32 +921,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		swap_reclaim_full_clusters(si, false);
 
 	if (order < PMD_ORDER) {
-		unsigned int frags = 0, frags_existing;
-
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			/* Clusters failed to allocate are moved to frag_clusters */
-			frags++;
 		}
 
-		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
-		while (frags < frags_existing &&
-		       (ci = isolate_lock_cluster(si, &si->frag_clusters[order]))) {
-			atomic_long_dec(&si->frag_cluster_nr[order]);
-			/*
-			 * Rotate the frag list to iterate, they were all
-			 * failing high order allocation or moved here due to
-			 * per-CPU usage, but they could contain newly released
-			 * reclaimable (eg. lazy-freed swap cache) slots.
-			 */
+		/*
+		 * Scan only one fragment cluster is good enough. Order 0
+		 * allocation will surely success, and large allocation
+		 * failure is not critical. Scanning one cluster still
+		 * keeps the list rotated and reclaimed (for HAS_CACHE).
+		 */
+		ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
+		if (ci) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			frags++;
 		}
 	}
 
@@ -972,7 +960,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
 		while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
-			atomic_long_dec(&si->frag_cluster_nr[o]);
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							0, usage);
 			if (found)
@@ -3224,7 +3211,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
-		atomic_long_set(&si->frag_cluster_nr[i], 0);
 	}
 
 	/*
-- 
2.50.1