[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f660749e-d515-4208-9610-ffc4155b4a0d@arm.com>
Date: Wed, 10 Sep 2025 13:42:08 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: "Huang, Ying" <ying.huang@...ux.alibaba.com>
Cc: Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
Mark Rutland <mark.rutland@....com>, James Morse <james.morse@....com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on
local CPU
On 10/09/2025 11:57, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@....com> writes:
>
>> Hi All,
>>
>> This is an RFC for my implementation of an idea from James Morse to avoid
>> broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
>> have ever observed the pgtable entry for the TLB entry that is being
>> invalidated. It turns out that x86 does something similar in principle.
>>
>> The primary feedback I'm looking for is; is this actually correct and safe?
>> James and I both believe it to be, but it would be useful to get further
>> validation.
>>
>> Beyond that, the next question is; does it actually improve performance?
>> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
>> do a much better job of sustaining the overall number of "tlb shootdowns per
>> second" after the change:
>>
>> +------------+--------------------------+--------------------------+--------------------------+
>> | | Baseline (v6.15) | tlbi local | Improvement |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec |
>> | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% |
>> | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% |
>> | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% |
>> | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% |
>> | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% |
>> | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% |
>> | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>>
>> But looking at real-world benchmarks, I haven't yet found anything where it
>> makes a huge difference; When compiling the kernel, it reduces kernel time by
>> ~2.2%, but overall wall time remains the same. I'd be interested in any
>> suggestions for workloads where this might prove valuable.
>>
>> All mm selftests have been run and no regressions are observed. Applies on
>> v6.17-rc3.
>
> I have used redis (a single threaded in-memory database) to test the
> patchset on an ARM server. 32 redis-server processes are run on the
> NUMA node 1 to enlarge the overhead of TLBI broadcast. 32
> memtier-benchmark processes are run on the NUMA node 0 accordingly.
> Snapshot is triggered constantly in redis-server, which fork(), saves
> memory database to disk, exit(), so that COW in the redis-server will
> trigger a large amount of TLBI. Basically, this tests the performance
> of redis-server during snapshot. The test time is about 300s. Test
> results show that the benchmark score can improve ~4.5% with the
> patchset.
>
> Feel free to add my
>
> Tested-by: Huang Ying <ying.huang@...ux.alibaba.com>
>
> in the future versions.
Thanks for this - very useful!
>
> ---
> Best Regards,
> Huang, Ying
Powered by blists - more mailing lists