[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <06ca976e-03ff-4376-a29d-24993282dda0@arm.com>
Date: Tue, 2 Sep 2025 17:56:29 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: Will Deacon <will@...nel.org>, Mark Rutland <mark.rutland@....com>,
James Morse <james.morse@....com>, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active on
local CPU
On 02/09/2025 17:47, Catalin Marinas wrote:
> On Fri, Aug 29, 2025 at 04:35:06PM +0100, Ryan Roberts wrote:
>> Beyond that, the next question is; does it actually improve performance?
>> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
>> do a much better job of sustaining the overall number of "tlb shootdowns per
>> second" after the change:
>>
>> +------------+--------------------------+--------------------------+--------------------------+
>> | | Baseline (v6.15) | tlbi local | Improvement |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec |
>> | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>> | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% |
>> | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% |
>> | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% |
>> | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% |
>> | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% |
>> | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% |
>> | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% |
>> +------------+-------------+------------+-------------+------------+-------------+------------+
>>
>> But looking at real-world benchmarks, I haven't yet found anything where it
>> makes a huge difference; When compiling the kernel, it reduces kernel time by
>> ~2.2%, but overall wall time remains the same. I'd be interested in any
>> suggestions for workloads where this might prove valuable.
>
> I suspect it's highly dependent on hardware and how it handles the DVM
> messages. There were some old proposals from Fujitsu:
>
> https://lore.kernel.org/linux-arm-kernel/20190617143255.10462-1-indou.takao@jp.fujitsu.com/
>
> Christoph Lameter (Ampere) also followed with some refactoring in this
> area to allow a boot-configurable way to do TLBI via IS ops or IPI:
>
> https://lore.kernel.org/linux-arm-kernel/20231207035703.158053467@gentwo.org/
>
> (for some reason, the patches did not make it to the list, I have them
> in my inbox if you are interested)
>
> I don't remember any real-world workload, more like hand-crafted
> mprotect() loops.
>
> Anyway, I think the approach in your series doesn't have downsides, it's
> fairly clean and addresses some low-hanging fruits. For multi-threaded
> workloads where a flush_tlb_mm() is cheaper than a series of per-page
> TLBIs, I think we can wait for that hardware to be phased out. The TLBI
> range operations should significantly reduce the DVM messages between
> CPUs.
I'll gather some more numbers and try to make a case for merging it then. I
don't really want to add complexity if there is no clear value.
Thanks for the review.
Thanks,
Ryan
Powered by blists - more mailing lists