linux-kernel - Re: [patch V3 12/18] posix-timers: Improve hash table performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9A-TWTX9Yl6zlst@localhost.localdomain>
Date: Tue, 11 Mar 2025 14:44:45 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: LKML <linux-kernel@...r.kernel.org>,
	Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Benjamin Segall <bsegall@...gle.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Andrey Vagin <avagin@...nvz.org>,
	Pavel Tikhomirov <ptikhomirov@...tuozzo.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Cyrill Gorcunov <gorcunov@...il.com>
Subject: Re: [patch V3 12/18] posix-timers: Improve hash table performance

Le Sat, Mar 08, 2025 at 05:48:38PM +0100, Thomas Gleixner a écrit :
> Eric and Ben reported a significant performance bottleneck on the global
> hash, which is used to store posix timers for lookup.
> 
> Eric tried to do a lockless validation of a new timer ID before trying to
> insert the timer, but that does not solve the problem.
> 
> For the non-contended case this is a pointless exercise and for the
> contended case this extra lookup just creates enough interleaving that all
> tasks can make progress.
> 
> There are actually two real solutions to the problem:
> 
>   1) Provide a per process (signal struct) xarray storage
> 
>   2) Implement a smarter hash like the one in the futex code
> 
> #1 works perfectly fine for most cases, but the fact that CRIU enforced a
>    linear increasing timer ID to restore timers makes this problematic.
> 
>    It's easy enough to create a sparse timer ID space, which amounts very
>    fast to a large junk of memory consumed for the xarray. 2048 timers with
>    a ID offset of 512 consume more than one megabyte of memory for the
>    xarray storage.
> 
> #2 The main advantage of the futex hash is that it uses per hash bucket
>    locks instead of a global hash lock. Aside of that it is scaled
>    according to the number of CPUs at boot time.
> 
> Experiments with artifical benchmarks have shown that a scaled hash with
> per bucket locks comes pretty close to the xarray performance and in some
> scenarios it performes better.
> 
> Test 1:
> 
>      A single process creates 20000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them:
> 
>             mainline        Eric   newhash   xarray
> create         23 ms       23 ms      9 ms     8 ms
> getoverrun     14 ms       14 ms      5 ms     4 ms
> 
> Test 2:
> 
>      A single process creates 50000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them:
> 
>             mainline        Eric   newhash   xarray
> create         98 ms      219 ms     20 ms    18 ms
> getoverrun     62 ms       62 ms     10 ms     9 ms
> 
> Test 3:
> 
>      A single process creates 100000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them:
> 
>             mainline        Eric   newhash   xarray
> create        313 ms      750 ms     48 ms    33 ms
> getoverrun    261 ms      260 ms     20 ms    14 ms
> 
> Erics changes create quite some overhead in the create() path due to the
> double list walk, as the main issue according to perf is the list walk
> itself. With 100k timers each hash bucket contains ~200 timers, which in
> the worst case need to be all inspected. The same problem applies for
> getoverrun() where the lookup has to walk through the hash buckets to find
> the timer it is looking for.
> 
> The scaled hash obviously reduces hash collisions and lock contention
> significantly. This becomes more prominent with concurrency.
> 
> Test 4:
> 
>      A process creates 63 threads and all threads wait on a barrier before
>      each instance creates 20000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them. The threads are pinned on
>      seperate CPUs to achive maximum concurrency. The numbers are the
>      average times per thread:
> 
>             mainline        Eric   newhash   xarray
> create     180239 ms    38599 ms    579 ms   813 ms
> getoverrun   2645 ms     2642 ms     32 ms     7 ms
> 
> Test 5:
> 
>      A process forks 63 times and all forks wait on a barrier before each
>      instance creates 20000 timers and afterwards invokes
>      timer_getoverrun(2) on each of them. The processes are pinned on
>      seperate CPUs to achive maximum concurrency. The numbers are the
>      average times per process:
> 
>             mainline        eric   newhash   xarray
> create     157253 ms    40008 ms     83 ms    60 ms
> getoverrun   2611 ms     2614 ms     40 ms     4 ms
> 
> So clearly the reduction of lock contention with Eric's changes makes a
> significant difference for the create() loop, but it does not mitigate the
> problem of long list walks, which is clearly visible on the getoverrun()
> side because that is purely dominated by the lookup itself. Once the timer
> is found, the syscall just reads from the timer structure with no other
> locks or code paths involved and returns.
> 
> The reason for the difference between the thread and the fork case for the
> new hash and the xarray is that both suffer from contention on
> sighand::siglock and the xarray suffers additionally from contention on the
> xarray lock on insertion.
> 
> The only case where the reworked hash slighly outperforms the xarray is a
> tight loop which creates and deletes timers.
> 
> Test 4:
> 
>      A process creates 63 threads and all threads wait on a barrier before
>      each instance runs a loop which creates and deletes a timer 100000
>      times in a row. The threads are pinned on seperate CPUs to achive
>      maximum concurrency. The numbers are the average times per thread:
> 
>             mainline        Eric   newhash   xarray
> loop	    5917  ms	 5897 ms   5473 ms  7846 ms
> 
> Test 5:
> 
>      A process forks 63 times and all forks wait on a barrier before each
>      each instance runs a loop which creates and deletes a timer 100000
>      times in a row. The processes are pinned on seperate CPUs to achive
>      maximum concurrency. The numbers are the average times per process:
> 
>             mainline        Eric   newhash   xarray
> loop	     5137 ms	 7828 ms    891 ms   872 ms
> 
> In both test there is not much contention on the hash, but the ucount
> accounting for the signal and in the thread case the sighand::siglock
> contention (plus the xarray locking) contribute dominantly to the overhead.
> 
> As the memory consumption of the xarray in the sparse ID case is
> significant, the scaled hash with per bucket locks seems to be the better
> overall option. While the xarray has faster lookup times for a large number
> of timers, the actual syscall usage, which requires the lookup is not an
> extreme hotpath. Most applications utilize signal delivery and all syscalls
> except timer_getoverrun(2) are all but cheap.
> 
> So implement a scaled hash with per bucket locks, which offers the best
> tradeoff between performance and memory consumption.
> 
> Reported-by: Eric Dumazet <edumazet@...gle.com>
> Reported-by: Benjamin Segall <bsegall@...gle.com>
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>

Acked-by: Frederic Weisbecker <frederic@...nel.org>