lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251205233217.3344186-5-joshua.hahnjy@gmail.com>
Date: Fri,  5 Dec 2025 15:32:15 -0800
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: 
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Alistair Popple <apopple@...dia.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Axel Rasmussen <axelrasmussen@...gle.com>,
	Brendan Jackman <jackmanb@...gle.com>,
	Byungchul Park <byungchul@...com>,
	Christophe Leroy <christophe.leroy@...roup.eu>,
	David Hildenbrand <david@...nel.org>,
	Gregory Price <gourry@...rry.net>,
	Johannes Weiner <hannes@...xchg.org>,
	Jonathan Corbet <corbet@....net>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Madhavan Srinivasan <maddy@...ux.ibm.com>,
	Matthew Brost <matthew.brost@...el.com>,
	Michael Ellerman <mpe@...erman.id.au>,
	Michal Hocko <mhocko@...e.com>,
	Mike Rapoport <rppt@...nel.org>,
	Nicholas Piggin <npiggin@...il.com>,
	Qi Zheng <zhengqi.arch@...edance.com>,
	Rakie Kim <rakie.kim@...com>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Wei Xu <weixugc@...gle.com>,
	Ying Huang <ying.huang@...ux.alibaba.com>,
	Yuanchu Xie <yuanchu@...gle.com>,
	Zi Yan <ziy@...dia.com>,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	linuxppc-dev@...ts.ozlabs.org
Subject: [RFC LPC2025 PATCH 4/4] mm/vmscan: Deprecate zone_reclaim_mode

zone_reclaim_mode was introduced in 2005 to work around the NUMA
penalties associated with allocating memory on remote nodes. It changed
the page allocator's behavior to prefer stalling and performing direct
reclaim locally over allocating on a remote node.

In 2014, zone_reclaim_mode was disabled by default, as it was deemed as
unsuitable for most workloads [1]. Since then, and especially since
2005, a lot has changed. NUMA penalties are lower than they used to
before, and we now have much more extensive infrastructure to control
NUMA spillage (NUMA balancing, memory.reclaim, tiering / promotion /
demotion). Together, these changes make remote memory access a much more
appealing alternative compared to stalling the system, when there might
be free memory in other nodes.

This is not to say that there are no workloads that perform better with
NUMA locality. However, zone_reclaim_mode is a system-wide setting that
makes this bet for all running workloads on the machine. Today, we have
many more alternatives that can provide more fine-grained control over
allocation strategy, such as mbind or set_mempolicy.

Deprecate zone_reclaim_mode in favor of modern alternatives, such as
NUMA balancing, membinding, and promotion/demotion mechanisms. This
improves code readability and maintainability, especially in the page
allocation code.

[1] Commit 4f9b16a64753 ("mm: disable zone_reclaim_mode by default")

Signed-off-by: Joshua Hahn <joshua.hahnjy@...il.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 41 -------------------------
 arch/powerpc/include/asm/topology.h     |  4 ---
 include/linux/topology.h                |  6 ----
 include/uapi/linux/mempolicy.h          | 14 ---------
 mm/internal.h                           | 11 -------
 mm/page_alloc.c                         |  4 +--
 mm/vmscan.c                             | 18 -----------
 7 files changed, 2 insertions(+), 96 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index ea2fd3feb9c6..635b16c1867e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -76,7 +76,6 @@ Currently, these files are in /proc/sys/vm:
 - vfs_cache_pressure_denom
 - watermark_boost_factor
 - watermark_scale_factor
-- zone_reclaim_mode
 
 
 admin_reserve_kbytes
@@ -1046,43 +1045,3 @@ going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
 that the number of free pages kswapd maintains for latency reasons is
 too small for the allocation bursts occurring in the system. This knob
 can then be used to tune kswapd aggressiveness accordingly.
-
-
-zone_reclaim_mode
-=================
-
-Zone_reclaim_mode allows someone to set more or less aggressive approaches to
-reclaim memory when a zone runs out of memory. If it is set to zero then no
-zone reclaim occurs. Allocations will be satisfied from other zones / nodes
-in the system.
-
-This is value OR'ed together of
-
-=	===================================
-1	Zone reclaim on
-2	Zone reclaim writes dirty pages out
-4	Zone reclaim swaps pages
-=	===================================
-
-zone_reclaim_mode is disabled by default.  For file servers or workloads
-that benefit from having their data cached, zone_reclaim_mode should be
-left disabled as the caching effect is likely to be more important than
-data locality.
-
-Consider enabling one or more zone_reclaim mode bits if it's known that the
-workload is partitioned such that each partition fits within a NUMA node
-and that accessing remote memory would cause a measurable performance
-reduction.  The page allocator will take additional actions before
-allocating off node pages.
-
-Allowing zone reclaim to write out pages stops processes that are
-writing large amounts of data from dirtying pages on other nodes. Zone
-reclaim will write out dirty pages if a zone fills up and so effectively
-throttle the process. This may decrease the performance of a single process
-since it cannot use all of system memory to buffer the outgoing writes
-anymore but it preserve the memory on other nodes so that the performance
-of other processes running on other nodes will not be affected.
-
-Allowing regular swap effectively restricts allocations to the local
-node unless explicitly overridden by memory policies or cpuset
-configurations.
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f19ca44512d1..49015b2b0d8d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -10,10 +10,6 @@ struct drmem_lmb;
 
 #ifdef CONFIG_NUMA
 
-/*
- * If zone_reclaim_mode is enabled, a RECLAIM_DISTANCE of 10 will mean that
- * all zones on all nodes will be eligible for zone_reclaim().
- */
 #define RECLAIM_DISTANCE 10
 
 #include <asm/mmzone.h>
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 6575af39fd10..37018264ca1e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -50,12 +50,6 @@ int arch_update_cpu_topology(void);
 #define node_distance(from,to)	((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
 #endif
 #ifndef RECLAIM_DISTANCE
-/*
- * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
- * (in whatever arch specific measurement units returned by node_distance())
- * and node_reclaim_mode is enabled then the VM will only call node_reclaim()
- * on nodes within this distance.
- */
 #define RECLAIM_DISTANCE 30
 #endif
 
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8fbbe613611a..194f922dad9b 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -65,18 +65,4 @@ enum {
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
 
-/*
- * Enabling zone reclaim means the page allocator will attempt to fulfill
- * the allocation request on the current node by triggering reclaim and
- * trying to shrink the current node.
- * Fallback allocations on the next candidates in the zonelist are considered
- * when reclaim fails to free up enough memory in the current node/zone.
- *
- * These bit locations are exposed in the vm.zone_reclaim_mode sysctl.
- * New bits are OK, but existing bits should not be changed.
- */
-#define RECLAIM_ZONE	(1<<0)	/* Enable zone reclaim */
-#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
-
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/internal.h b/mm/internal.h
index 743fcebe53a8..a2df0bf3f458 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1197,24 +1197,13 @@ static inline void mminit_verify_zonelist(void)
 #endif /* CONFIG_DEBUG_MEMORY_INIT */
 
 #ifdef CONFIG_NUMA
-extern int node_reclaim_mode;
-
 extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
-#define node_reclaim_mode 0
-
 static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	return NUMA_NO_NODE;
 }
 #endif
-
-static inline bool node_reclaim_enabled(void)
-{
-	/* Is any node_reclaim_mode bit set? */
-	return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
-}
-
 /*
  * mm/memory-failure.c
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9524713c81b7..bf4faec4ebe6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3823,8 +3823,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 * If kswapd is already active on a node, keep looking
 		 * for other nodes that might be idle. This can happen
 		 * if another process has NUMA bindings and is causing
-		 * kswapd wakeups on only some nodes. Avoid accidental
-		 * "node_reclaim_mode"-like behavior in this case.
+		 * kswapd wakeups on only some nodes. Avoid accidentally
+		 * overpressuring the local node when remote nodes are free.
 		 */
 		if (skip_kswapd_nodes &&
 		    !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4e23289efba4..f480a395df65 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7503,16 +7503,6 @@ static const struct ctl_table vmscan_sysctl_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_TWO_HUNDRED,
 	},
-#ifdef CONFIG_NUMA
-	{
-		.procname	= "zone_reclaim_mode",
-		.data		= &node_reclaim_mode,
-		.maxlen		= sizeof(node_reclaim_mode),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= SYSCTL_ZERO,
-	}
-#endif
 };
 
 static int __init kswapd_init(void)
@@ -7529,14 +7519,6 @@ static int __init kswapd_init(void)
 module_init(kswapd_init)
 
 #ifdef CONFIG_NUMA
-/*
- * Node reclaim mode
- *
- * If non-zero call node_reclaim when the number of free pages falls below
- * the watermarks.
- */
-int node_reclaim_mode __read_mostly;
-
 /*
  * Try to free up some pages from this node through reclaim.
  */
-- 
2.47.3

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ