[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <874itk1dy3.fsf@DESKTOP-5N7EMDA>
Date: Wed, 03 Sep 2025 10:12:52 +0800
From: "Huang, Ying" <ying.huang@...ux.alibaba.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Catalin Marinas <catalin.marinas@....com>, Will Deacon
<will@...nel.org>, Mark Rutland <mark.rutland@....com>, James Morse
<james.morse@....com>, linux-arm-kernel@...ts.infradead.org,
linux-kernel@...r.kernel.org, Takao Indoh <indou.takao@...fujitsu.com>,
QI Fuli <qi.fuli@...itsu.com>, Andrea Arcangeli <aarcange@...hat.com>,
Rafael Aquini <aquini@...hat.com>
Subject: Re: [RFC PATCH v1 0/2] Don't broadcast TLBI if mm was only active
on local CPU
Hi, Ryan,
Ryan Roberts <ryan.roberts@....com> writes:
> Hi All,
>
> This is an RFC for my implementation of an idea from James Morse to avoid
> broadcasting TBLIs to remote CPUs if it can be proven that no remote CPU could
> have ever observed the pgtable entry for the TLB entry that is being
> invalidated. It turns out that x86 does something similar in principle.
>
> The primary feedback I'm looking for is; is this actually correct and safe?
> James and I both believe it to be, but it would be useful to get further
> validation.
>
> Beyond that, the next question is; does it actually improve performance?
> stress-ng's --tlb-shootdown stressor suggests yes; as concurrency increases, we
> do a much better job of sustaining the overall number of "tlb shootdowns per
> second" after the change:
>
> +------------+--------------------------+--------------------------+--------------------------+
> | | Baseline (v6.15) | tlbi local | Improvement |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> | nr_threads | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec | ops/sec |
> | | (real time) | (cpu time) | (real time) | (cpu time) | (real time) | (cpu time) |
> +------------+-------------+------------+-------------+------------+-------------+------------+
> | 1 | 9109 | 2573 | 8903 | 3653 | -2% | 42% |
> | 4 | 8115 | 1299 | 9892 | 1059 | 22% | -18% |
> | 8 | 5119 | 477 | 11854 | 1265 | 132% | 165% |
> | 16 | 4796 | 286 | 14176 | 821 | 196% | 187% |
> | 32 | 1593 | 38 | 15328 | 474 | 862% | 1147% |
> | 64 | 1486 | 19 | 8096 | 131 | 445% | 589% |
> | 128 | 1315 | 16 | 8257 | 145 | 528% | 806% |
> +------------+-------------+------------+-------------+------------+-------------+------------+
>
> But looking at real-world benchmarks, I haven't yet found anything where it
> makes a huge difference; When compiling the kernel, it reduces kernel time by
> ~2.2%, but overall wall time remains the same. I'd be interested in any
> suggestions for workloads where this might prove valuable.
>
> All mm selftests have been run and no regressions are observed. Applies on
> v6.17-rc3.
Thanks for working on this.
Several previous TLBI broadcast optimization have been tried before,
Cced the original authors for discussion. Some workloads show good
improvement,
https://lore.kernel.org/lkml/20190617143255.10462-1-indou.takao@jp.fujitsu.com/
https://lore.kernel.org/all/20200203201745.29986-1-aarcange@redhat.com/
Especially in the following mail,
https://lore.kernel.org/all/20200314031609.GB2250@redhat.com/
---
Best Regards,
Huang, Ying
Powered by blists - more mailing lists