linux-kernel - Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <587d45c3-2098-4914-9dfc-275b5d0b9bb7@linux.ibm.com>
Date: Wed, 26 Mar 2025 00:34:23 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: André Almeida <andrealmeid@...lia.com>,
        Darren Hart <dvhart@...radead.org>,
        Davidlohr Bueso <dave@...olabs.net>, Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Valentin Schneider <vschneid@...hat.com>,
        Waiman Long <longman@...hat.com>, linux-kernel@...r.kernel.org,
        "Nysal Jan K.A." <nysal@...ux.ibm.com>
Subject: Re: [PATCH v10 00/21] futex: Add support task local hash maps,
 FUTEX2_NUMA and FUTEX2_MPOL

Hi Sebastian.

On 3/18/25 18:54, Shrikanth Hegde wrote:
> 
>> The complete tree is at
>>     https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/ 
>> staging.git/log/?h=futex_local_v10
>>     https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/ 
>> staging.git futex_local_v10
>>
> 
> Hi Sebastian. Thanks for working on this (along with bringing back 
> FUTEX2 NUMA) which
> might help large systems with many futexes.
> 
> I tried this in one of our systems(Single NUMA, 80 CPUs), I see 
> significant reduction in futex/hash.
> Maybe i am missing some config or doing something stupid w.r.t to 
> benchmarking.
> I am trying to understand this stuff.
> 
> I ran "perf bench futex all" as is. No change has been made to perf.
> =========================================
> Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
> # Running futex/hash benchmark...
> Run summary [PID 45758]: 80 threads, each operating on 1024 [private] 
> futexes for 10 secs.
> Averaged 1556023 operations/sec (+- 0.08%), total secs = 10   <<--- 1.5M
> 
> =========================================
> With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for 
> TIMERs.
> 
> # Running futex/hash benchmark...
> Run summary [PID 8644]: 80 threads, each operating on 1024 [private] 
> futexes for 10 secs.
> Averaged 150382 operations/sec (+- 0.42%), total secs = 10   <<-- 0.15M, 
> close to 10x down.
> 
> =========================================
> 
> Did try a git bisect based on the futex/hash numbers. It narrowed it to 
> this one.
> first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: 
> Allow automatic allocation of process wide futex hash.
> 
> Is this expected given the complexity of hash function change?

So, did some more bench-marking using the same perf futex hash.
I see that perf creates N threads and binds each thread to a CPU and then
calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
only futex_hash is exercised.

Numbers with different threads. (private futexes)
threads		baseline		with series    (ratio)
1		3386265			3266560		0.96	
10		1972069			 821565		0.41
40		1580497			 277900		0.17
80		1555482			 150450		0.096


With Shared Futex: (-s option)
Threads		baseline		with series    (ratio)
80		590144			 585067		0.99

After looking into code, and after some hacking, could get the
performance back with below change. this is likely functionally not correct.
the reason for below change is,

1. perf report showed significant time in futex_private_hash_put.
    so removed rcu usage for users. that brought some improvements.
    from 150k to 300k. Is there a better way to do this users protection?

2. Since number of buckets would be less by default, this would cause hb
    collision. This was seen by queued_spin_lock_slowpath. Increased the hash
    bucket size what was before the series. That brought the numbers back to
    1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.

Note: Just increasing the hash bucket size without point 1, didn't matter much.

-------
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 363a7692909d..7d01bf8caa13 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -65,7 +65,7 @@ static struct {
  #define futex_queues	(__futex_data.queues)
  
  struct futex_private_hash {
-	rcuref_t	users;
+	int	users;
  	unsigned int	hash_mask;
  	struct rcu_head	rcu;
  	void		*mm;
@@ -200,7 +200,7 @@ static bool __futex_pivot_hash(struct mm_struct *mm,
  	fph = rcu_dereference_protected(mm->futex_phash,
  					lockdep_is_held(&mm->futex_hash_lock));
  	if (fph) {
-		if (!rcuref_is_dead(&fph->users)) {
+		if (!(fph->users)) {
  			mm->futex_phash_new = new;
  			return false;
  		}
@@ -247,7 +247,7 @@ struct futex_private_hash *futex_private_hash(void)
  		if (!fph)
  			return NULL;
  
-		if (rcuref_get(&fph->users))
+		if ((fph->users))
  			return fph;
  	}
  	futex_pivot_hash(mm);
@@ -256,7 +256,7 @@ struct futex_private_hash *futex_private_hash(void)
  
  bool futex_private_hash_get(struct futex_private_hash *fph)
  {
-	return rcuref_get(&fph->users);
+	return !!(fph->users);
  }
  
  void futex_private_hash_put(struct futex_private_hash *fph)
@@ -265,7 +265,7 @@ void futex_private_hash_put(struct futex_private_hash *fph)
  	 * Ignore the result; the DEAD state is picked up
  	 * when rcuref_get() starts failing via rcuref_is_dead().
  	 */
-	if (rcuref_put(&fph->users))
+	if ((fph->users))
  		wake_up_var(fph->mm);
  }
  
@@ -1509,7 +1509,7 @@ void futex_hash_free(struct mm_struct *mm)
  	kvfree(mm->futex_phash_new);
  	fph = rcu_dereference_raw(mm->futex_phash);
  	if (fph) {
-		WARN_ON_ONCE(rcuref_read(&fph->users) > 1);
+		WARN_ON_ONCE((fph->users) > 1);
  		kvfree(fph);
  	}
  }
@@ -1524,7 +1524,7 @@ static bool futex_pivot_pending(struct mm_struct *mm)
  		return false;
  
  	fph = rcu_dereference(mm->futex_phash);
-	return !rcuref_read(&fph->users);
+	return !!(fph->users);
  }
  
  static bool futex_hash_less(struct futex_private_hash *a,
@@ -1576,7 +1576,7 @@ static int futex_hash_allocate(unsigned int hash_slots, bool custom)
  	if (!fph)
  		return -ENOMEM;
  
-	rcuref_init(&fph->users, 1);
+	fph->users = 1;
  	fph->hash_mask = hash_slots ? hash_slots - 1 : 0;
  	fph->custom = custom;
  	fph->mm = mm;
@@ -1671,6 +1671,8 @@ int futex_hash_allocate_default(void)
  	if (current_buckets >= buckets)
  		return 0;
  
+	buckets = 32768;
+
  	return futex_hash_allocate(buckets, false);
  }
  
@@ -1732,6 +1734,8 @@ static int __init futex_init(void)
  	hashsize = max(4, hashsize);
  	hashsize = roundup_pow_of_two(hashsize);
  #endif
+	hashsize = 32768;
+
  	futex_hashshift = ilog2(hashsize);
  	size = sizeof(struct futex_hash_bucket) * hashsize;
  	order = get_order(size);