[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bcad4168-52a0-44cc-b0f0-9346f30d8d80@redhat.com>
Date: Thu, 31 Oct 2024 16:18:59 -0400
From: Waiman Long <llong@...hat.com>
To: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
linux-kernel@...r.kernel.org
Cc: André Almeida <andrealmeid@...lia.com>,
Darren Hart <dvhart@...radead.org>, Davidlohr Bueso <dave@...olabs.net>,
Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [RFC v2 PATCH 0/4] futex: Add support task local hash maps.
On 10/31/24 11:56 AM, Sebastian Andrzej Siewior wrote:
> On 2024-10-28 13:13:54 [+0100], To linux-kernel@...r.kernel.org wrote:
>> Need to do
>> more testing.
> So there is "perf bench futex hash". On a 256 CPU NUMA box:
> perf bench futex hash -t 240 -m -s -b $hb
> and hb 2 … 131072 (moved the allocation to kvmalloc) I get the following
> (averaged over 3 three runs)
>
> buckets op/sec
> 2 9158.33
> 4 21665.66 + ~136%
> 8 44686.66 + ~106
> 16 84144.33 + ~ 88
> 32 139998.33 + ~ 66
> 64 279957.0 + ~ 99
> 128 509533.0 + ~100
> 256 1019846.0 + ~100
> 512 1634940.0 + ~ 60
> 1024 1834859.33 + ~ 12
> 1868129.33 (global hash, 65536 hash)
> 2048 1912071.33 + ~ 4
> 4096 1918686.66 + ~ 0
> 8192 1922285.66 + ~ 0
> 16384 1923017.0 + ~ 0
> 32768 1923319.0 + ~ 0
> 65536 1932906.0 + ~ 0
> 131072 2042571.33 + ~ 5
>
> By doubling the hash size the ops/sec almost double until 256 slots.
> After 2048 slots the increase is almost noise (except for the last
> entry).
>
> Pinning the bench to individual CPUs belonging to a NUMA node and
> running the same test with 110 threads only (avg over 5 runs):
> ops/sec global ops/sec local
> node 0 2278572.2 2534827.4
> node 1 2229838.6 2437498.8
> node 0+1 2542602.4 2535749.8
Looking at the performance data, we should probably use the global hash
table to maximize throughput if latency isn't important.
AFAICT, the reason why patch 4 allocates a local hash whenever the first
thread is created to avoid a race between the same futex hashed on both
the local and global hash tables. Correct me if my understanding is
incorrect. That will enforce all multithreaded processes to use local
hash tables for private futexes even if they don't care about latency.
Maybe we should limit the auto local hash table allocation only to RT
processes. To avoid the race, we could add a flag to indicate if a
private futex has ever been hashed in the kernel and avoid local hash
creation in this case and probably also when the prctl() is being called
to create local hash table.
My 2 cents.
Cheers,
Powered by blists - more mailing lists