linux-kernel - Re: [PATCH] x86,tlb: update mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1df2caa3-4c5d-4bd9-88bb-66a07bf1eb65@intel.com>
Date: Fri, 8 Nov 2024 12:31:56 -0800
From: Dave Hansen <dave.hansen@...el.com>
To: Rik van Riel <riel@...riel.com>, Dave Hansen <dave.hansen@...ux.intel.com>
Cc: Andy Lutomirski <luto@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
 Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, x86@...nel.org,
 "H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org,
 kernel-team@...a.com
Subject: Re: [PATCH] x86,tlb: update mm_cpumask lazily

On 11/8/24 11:31, Rik van Riel wrote:
> On busy multi-threaded workloads, there can be significant contention
> on the mm_cpumask at context switch time.
> 
> Reduce that contention by updating mm_cpumask lazily, setting the CPU bit
> at context switch time (if not already set), and clearing the CPU bit at
> the first TLB flush sent to a CPU where the process isn't running.
> 
> When a flurry of TLB flushes for a process happen, only the first one
> will be sent to CPUs where the process isn't running. The others will
> be sent to CPUs where the process is currently running.

So I guess it comes down to balancing:

The cpumask_clear_cpu() happens on every mm switch which can be
thousands of times a second.  But it's _relatively_ cheap: dozens or a
couple hundred cycles.

with:

Skipping the cpumask_clear_cpu() will cause more TLB flushes. It can
cause at most one extra TLB flush for each time a process is migrated
off a CPU and never returns.  This is _relatively_ expensive: on the
order of thousands of cycles to send and receive an IPI.

Migrations are obviously the enemy here, but they're the enemy for lots
of _other_ reasons too, which is a really nice property.

The only thing I can think of that really worries me is some kind of
forked worker model where before this patch you would have:

 * fork()
 * run on CPU A
 * ... migrate to CPU B
 * malloc()/free(), needs to flush B only
 * exit()

and after:

 * fork()
 * run on CPU A
 * ... migrate to CPU B
 * malloc()/free(), needs to flush A+B, including IPI
 * exit()

Where that IPI wasn't needed at *all* before.  But that's totally contrived.

So I think this is the kind of thing we'd want to apply to -rc1 and let
the robots poke at it for a few weeks.  But it does seem like a sound
idea to me.