[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e1d810d7c2bc77961828ee38ef322f5ec49181d8.camel@surriel.com>
Date: Sat, 30 Nov 2024 14:56:36 -0500
From: Rik van Riel <riel@...riel.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
lkp@...el.com, linux-kernel@...r.kernel.org, Ingo Molnar
<mingo@...nel.org>, Andy Lutomirski <luto@...nel.org>, Peter Zijlstra
<peterz@...radead.org>
Subject: Re: [linus:master] [x86/mm/tlb] 7e33001b8b:
will-it-scale.per_thread_ops 20.7% improvement
On Sat, 2024-11-30 at 09:54 -0800, Linus Torvalds wrote:
> On Sat, 30 Nov 2024 at 09:31, Rik van Riel <riel@...riel.com> wrote:
> >
> > 1) Stop using the mm_cpumask altogether on x86
>
> I think you would still want it as a "this is the upper bound" thing
> -
> exactly like your lazy code effectively does now.
>
> It's not giving some precise "these are the CPU's that have TLB
> contents", but instead just a "these CPU's *might* have TLB
> contents".
>
> But that's a *big* win for any single-threaded case, to not have to
> walk over potentially hundreds of CPUs when that thing has only ever
> actually been on one or two cores.
>
> Because a lot of short-lived processes only ever live on a single
> CPU.
>
Good point. We do want to keep optimizations for single
threaded processes in place.
> The benchmarks you are optimizing for - as well as the ones that
> regress - are
>
> (a) made up micobenchmark loads
>
> (b) ridiculously many threads
>
> and I think you should take some of what they say with a big pinch of
> salt.
>
> Those "20% difference" numbers aren't actually *real*, is what I'm
> saying.
Agreed that it won't be a 20% difference on real
workloads, but there are a few real world workloads
where these optimizations do make a fairly significant
difference.
For example, this change below made a 2% performance
difference for a memcache style workload on 2 socket
systems back in 2018, when CPU counts were much smaller
than today:
e9d8c6155768 ("x86/mm/tlb: Skip atomic operations for 'init_mm' in
switch_mm_irqs_off()")
>
> > 2) Instead, at context switch time just update
> > per_cpu variables like cpu_tlbstate.loaded_mm
> > and friends
>
> See aboive. I think you'll still want to limit the actual real
> situation of "look, ma, I'm a single-threaded compiler".
>
> > 3) At (much rarer) TLB flush time:
> > - Iterate over all CPUs
>
> Change this to "iterate over mm_cpumask", and I think it will work a
> whole lot better.
>
> Because yes, clearly with just the *pure* lazy mm_cpumask, you won
> some at scheduling time, but you lost a *lot* by just forcing
> pointless stale IPIs instead.
I struggle to think of a way to synchronize clearing
bits from the mm_cpumask that does not involve IPIs,
but I suppose we could rate limit that clearing to
something like once a second?
The rest of the time we could compare whether a
CPU's cpustate_loaded_mm matches the target mm, and
skip sending an IPI to that CPU?
We already seem to be passing info through to
tlb_is_not_lazy, so the logic could all be implemented
inside there if we wanted to.
--
All Rights Reversed.
Powered by blists - more mailing lists