[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20260128101358.20954-1-kprateek.nayak@amd.com>
Date: Wed, 28 Jan 2026 10:13:58 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Thomas Gleixner <tglx@...nel.org>, Ingo Molnar <mingo@...hat.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
<linux-kernel@...r.kernel.org>
CC: Peter Zijlstra <peterz@...radead.org>, Darren Hart <dvhart@...radead.org>,
Davidlohr Bueso <dave@...olabs.net>, André Almeida
<andrealmeid@...lia.com>, K Prateek Nayak <kprateek.nayak@....com>
Subject: [RFC PATCH] futex: Dynamically allocate futex_queues depending on nr_node_ids
CONFIG_NODES_SHIFT (which influences MAX_NUMNODES) is often configured
generously by distros while the actual number of possible NUMA nodes on
most systems is often quite conservative.
Instead of reserving MAX_NUMNODES worth of space for futex_queues,
dynamically allocate it based on "nr_node_ids" at the time of
futex_init().
"nr_node_ids" at the time of futex_init() is cached as "nr_futex_queues"
to compensate for the extra dereference necessary to access the elements
of futex_queues which ends up in a different cacheline now.
Running 5 runs of perf bench futex showed no measurable impact for any
variants on a dual socket 3rd generation AMD EPYC system (2 x 64C/128T):
variant locking/futex base + patch %diff
futex/hash 1220783.2 1333296.2 (9%)
futex/wake 0.71186 0.72584 (2%)
futex/wake-parallel 0.00624 0.00664 (6%)
futex/requeue 0.25088 0.26102 (4%)
futex/lock-pi 57.6 57.8 (0%)
Note: futex/hash had noticeable run to run variance on test machine.
"nr_node_ids" can rarely be larger than num_possible_nodes() but the
additional space allows for simpler handling of node index in presence
of sparse node_possible_map.
Reported-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
Sebastian,
Does this work for your concerns with the large "MAX_NUMNODES" values on
most distros? It does put the "queues" into a separate cacheline from
the __futex_data.
The other option is to dynamically allocate the entire __futex_data as:
struct {
unsigned long hashmask;
unsigned int hashshift;
unsigned int nr_queues;
struct futex_hash_bucket *queues[] __counted_by(nr_queues);
} *__futex_data __ro_after_init;
with a variable length "queues" at the end if we want to ensure
everything ends up in the same cacheline but all the __futex_data
member access would then be pointer dereferencing which might not be
ideal.
Thoughts?
---
kernel/futex/core.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 125804fbb5cb..d8567c2ca72a 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -56,11 +56,13 @@
static struct {
unsigned long hashmask;
unsigned int hashshift;
- struct futex_hash_bucket *queues[MAX_NUMNODES];
+ unsigned int nr_queues;
+ struct futex_hash_bucket **queues;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_hashmask (__futex_data.hashmask)
#define futex_hashshift (__futex_data.hashshift)
+#define nr_futex_queues (__futex_data.nr_queues)
#define futex_queues (__futex_data.queues)
struct futex_private_hash {
@@ -439,10 +441,10 @@ __futex_hash(union futex_key *key, struct futex_private_hash *fph)
* NOTE: this isn't perfectly uniform, but it is fast and
* handles sparse node masks.
*/
- node = (hash >> futex_hashshift) % nr_node_ids;
+ node = (hash >> futex_hashshift) % nr_futex_queues;
if (!node_possible(node)) {
node = find_next_bit_wrap(node_possible_map.bits,
- nr_node_ids, node);
+ nr_futex_queues, node);
}
}
@@ -1987,6 +1989,10 @@ static int __init futex_init(void)
size = sizeof(struct futex_hash_bucket) * hashsize;
order = get_order(size);
+ nr_futex_queues = nr_node_ids;
+ futex_queues = kcalloc(nr_futex_queues, sizeof(*futex_queues), GFP_KERNEL);
+ BUG_ON(!futex_queues);
+
for_each_node(n) {
struct futex_hash_bucket *table;
base-commit: c42ba5a87bdccbca11403b7ca8bad1a57b833732
--
2.34.1
Powered by blists - more mailing lists