linux-kernel - Re: [linus:master] [x86/mm/tlb] 7e33001b8b: will-it-scale.per_thread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <e1d810d7c2bc77961828ee38ef322f5ec49181d8.camel@surriel.com>
Date: Sat, 30 Nov 2024 14:56:36 -0500
From: Rik van Riel <riel@...riel.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev, 
	lkp@...el.com, linux-kernel@...r.kernel.org, Ingo Molnar
 <mingo@...nel.org>,  Andy Lutomirski	 <luto@...nel.org>, Peter Zijlstra
 <peterz@...radead.org>
Subject: Re: [linus:master] [x86/mm/tlb] 7e33001b8b:
 will-it-scale.per_thread_ops 20.7% improvement

On Sat, 2024-11-30 at 09:54 -0800, Linus Torvalds wrote:
> On Sat, 30 Nov 2024 at 09:31, Rik van Riel <riel@...riel.com> wrote:
> > 
> > 1) Stop using the mm_cpumask altogether on x86
> 
> I think you would still want it as a "this is the upper bound" thing
> -
> exactly like your lazy code effectively does now.
> 
> It's not giving some precise "these are the CPU's that have TLB
> contents", but instead just a "these CPU's *might* have TLB
> contents".
> 
> But that's a *big* win for any single-threaded case, to not have to
> walk over potentially hundreds of CPUs when that thing has only ever
> actually been on one or two cores.
> 
> Because a lot of short-lived processes only ever live on a single
> CPU.
> 
Good point. We do want to keep optimizations for single
threaded processes in place.

> The benchmarks you are optimizing for - as well as the ones that
> regress - are
> 
>  (a) made up micobenchmark loads
> 
>  (b) ridiculously many threads
> 
> and I think you should take some of what they say with a big pinch of
> salt.
> 
> Those "20% difference" numbers aren't actually *real*, is what I'm
> saying.

Agreed that it won't be a 20% difference on real
workloads, but there are a few real world workloads
where these optimizations do make a fairly significant
difference.

For example, this change below made a 2% performance
difference for a memcache style workload on 2 socket
systems back in 2018, when CPU counts were much smaller
than today:

e9d8c6155768 ("x86/mm/tlb: Skip atomic operations for 'init_mm' in
switch_mm_irqs_off()")

> 
> > 2) Instead, at context switch time just update
> >    per_cpu variables like cpu_tlbstate.loaded_mm
> >    and friends
> 
> See aboive. I think you'll still want to limit the actual real
> situation of "look, ma, I'm a single-threaded compiler".
> 
> > 3) At (much rarer) TLB flush time:
> >    - Iterate over all CPUs
> 
> Change this to "iterate over mm_cpumask", and I think it will work a
> whole lot better.
> 
> Because yes, clearly with just the *pure* lazy mm_cpumask, you won
> some at scheduling time, but you lost a *lot* by just forcing
> pointless stale IPIs instead.

I struggle to think of a way to synchronize clearing
bits from the mm_cpumask that does not involve IPIs,
but I suppose we could rate limit that clearing to
something like once a second?

The rest of the time we could compare whether a
CPU's cpustate_loaded_mm matches the target mm, and
skip sending an IPI to that CPU?

We already seem to be passing info through to
tlb_is_not_lazy, so the logic could all be implemented
inside there if we wanted to.

-- 
All Rights Reversed.