linux-kernel - [RFC PATCH] futex: Dynamically allocate futex_queues depending on nr_node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20260128101358.20954-1-kprateek.nayak@amd.com>
Date: Wed, 28 Jan 2026 10:13:58 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	<linux-kernel@...r.kernel.org>
CC: Peter Zijlstra <peterz@...radead.org>, Darren Hart <dvhart@...radead.org>,
	Davidlohr Bueso <dave@...olabs.net>, André Almeida
	<andrealmeid@...lia.com>, K Prateek Nayak <kprateek.nayak@....com>
Subject: [RFC PATCH] futex: Dynamically allocate futex_queues depending on nr_node_ids

CONFIG_NODES_SHIFT (which influences MAX_NUMNODES) is often configured
generously by distros while the actual number of possible NUMA nodes on
most systems is often quite conservative.

Instead of reserving MAX_NUMNODES worth of space for futex_queues,
dynamically allocate it based on "nr_node_ids" at the time of
futex_init().

"nr_node_ids" at the time of futex_init() is cached as "nr_futex_queues"
to compensate for the extra dereference necessary to access the elements
of futex_queues which ends up in a different cacheline now.

Running 5 runs of perf bench futex showed no measurable impact for any
variants on a dual socket 3rd generation AMD EPYC system (2 x 64C/128T):

  variant           locking/futex    base + patch  %diff
  futex/hash            1220783.2       1333296.2   (9%)
  futex/wake              0.71186         0.72584   (2%)
  futex/wake-parallel     0.00624         0.00664   (6%)
  futex/requeue           0.25088         0.26102   (4%)
  futex/lock-pi              57.6            57.8   (0%)

  Note: futex/hash had noticeable run to run variance on test machine.

"nr_node_ids" can rarely be larger than num_possible_nodes() but the
additional space allows for simpler handling of node index in presence
of sparse node_possible_map.

Reported-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
Sebastian,

Does this work for your concerns with the large "MAX_NUMNODES" values on
most distros? It does put the "queues" into a separate cacheline from
the __futex_data.

The other option is to dynamically allocate the entire __futex_data as:

  struct {
         unsigned long            hashmask;
         unsigned int             hashshift;
         unsigned int             nr_queues;
         struct futex_hash_bucket *queues[] __counted_by(nr_queues);
  } *__futex_data __ro_after_init;

with a variable length "queues" at the end if we want to ensure
everything ends up in the same cacheline but all the __futex_data
member access would then be pointer dereferencing which might not be
ideal.

Thoughts?
---
 kernel/futex/core.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 125804fbb5cb..d8567c2ca72a 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -56,11 +56,13 @@
 static struct {
 	unsigned long            hashmask;
 	unsigned int		 hashshift;
-	struct futex_hash_bucket *queues[MAX_NUMNODES];
+	unsigned int		 nr_queues;
+	struct futex_hash_bucket **queues;
 } __futex_data __read_mostly __aligned(2*sizeof(long));

 #define futex_hashmask	(__futex_data.hashmask)
 #define futex_hashshift	(__futex_data.hashshift)
+#define nr_futex_queues	(__futex_data.nr_queues)
 #define futex_queues	(__futex_data.queues)

 struct futex_private_hash {
@@ -439,10 +441,10 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
 		 * NOTE: this isn't perfectly uniform, but it is fast and
 		 * handles sparse node masks.
 		 */
-		node = (hash >> futex_hashshift) % nr_node_ids;
+		node = (hash >> futex_hashshift) % nr_futex_queues;
 		if (!node_possible(node)) {
 			node = find_next_bit_wrap(node_possible_map.bits,
-						  nr_node_ids, node);
+						  nr_futex_queues, node);
 		}
 	}

@@ -1987,6 +1989,10 @@ static int __init futex_init(void)
 	size = sizeof(struct futex_hash_bucket) * hashsize;
 	order = get_order(size);

+	nr_futex_queues = nr_node_ids;
+	futex_queues = kcalloc(nr_futex_queues, sizeof(*futex_queues), GFP_KERNEL);
+	BUG_ON(!futex_queues);
+
 	for_each_node(n) {
 		struct futex_hash_bucket *table;

base-commit: c42ba5a87bdccbca11403b7ca8bad1a57b833732
-- 
2.34.1