linux-kernel - Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from tlbbatch code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f5fab528-c8fd-47ea-ba25-6d59ff517b03@intel.com>
Date: Tue, 4 Mar 2025 09:51:48 -0800
From: Dave Hansen <dave.hansen@...el.com>
To: Brendan Jackman <jackmanb@...gle.com>, Borislav Petkov <bp@...en8.de>
Cc: Rik van Riel <riel@...riel.com>, x86@...nel.org,
 linux-kernel@...r.kernel.org, peterz@...radead.org,
 dave.hansen@...ux.intel.com, zhengqi.arch@...edance.com,
 nadav.amit@...il.com, thomas.lendacky@....com, kernel-team@...a.com,
 linux-mm@...ck.org, akpm@...ux-foundation.org, jannh@...gle.com,
 mhklinux@...look.com, andrew.cooper3@...rix.com, Manali.Shukla@....com,
 mingo@...nel.org
Subject: Re: [PATCH v14 11/13] x86/mm: do targeted broadcast flushing from
 tlbbatch code

On 3/4/25 07:33, Brendan Jackman wrote:
> On Tue, Mar 04, 2025 at 03:11:34PM +0100, Borislav Petkov wrote:
>> On Tue, Mar 04, 2025 at 12:52:47PM +0000, Brendan Jackman wrote:
>>> https://lore.kernel.org/all/CA+i-1C31TrceZiizC_tng_cc-zcvKsfXLAZD_XDftXnp9B2Tdw@mail.gmail.com/
>>
>> Lemme try to understand what you're suggesting on that subthread:
>>
>>> static inline void arch_start_context_switch(struct task_struct *prev)
>>> {
>>>     arch_paravirt_start_context_switch(prev);
>>>     tlb_start_context_switch(prev);
>>> }
>>
>> This kinda makes sense to me...
> 
> Yeah so basically my concern here is that we are doing something
> that's about context switching, but we're doing it in mm-switching
> code, entangling an assumption that "context_switch() must either call
> this function or that function". Whereas if we just call it explicitly
> from context_switch() it will be much clearer.

I was coming to a similar conclusion. All of the nastiness here would
come from an operation like:

	INVLPGB
	=> get scheduled on another CPU
	TLBSYNC

But there's no nastiness with:

	INVLPGB
	=> switch to init_mm
	TLBSYNC

at *all*. Because the TLBSYNC still works just fine. In fact, it *has*
to work just fine because you can get an TLB flush IPI in that window
already.

>>> Now I think about it... if we always tlbsync() before a context switch, is the
>>> cant_migrate() above actually required? I think with that, even if we migrated
>>> in the middle of e.g.  broadcast_kernel_range_flush(), we'd be fine? (At
>>> least, from the specific perspective of the invplgb code, presumably having
>>> preemption on there would break things horribly in other ways).
>>
>> I think we still need it because you need to TLBSYNC on the same CPU you've
>> issued the INVLPGB and actually, you want all TLBs to have been synched
>> system-wide.
>>
>> Or am I misunderstanding it?
> 
> Er, I might be exposing my own ignorance here. I was thinking that you
> always go through context_switch() before you get migrated, but I
> might not understand hwo migration happens.

Let's take a step back. Most of our IPI-based TLB flushes end up in this
code:

        preempt_disable();
        smp_call_function_many_cond(...);
        preempt_enable();

We don't have any fanciness around to keep the initiating thread on the
same CPU or check for pending TLB flushes at context switch time or lazy
tlb entry. We don't send asynchronous IPIs from the tlbbatch code and
then check for completion at batch finish time.

There's a lot of complexity to stuffing these TLB flushes into a
microarchitectural buffer and making *sure* they are flushed out.
INVLPGB is not free. It's not clear at all to me that doing loads of
them in reclaim code is superior to doing loads of:

	inc_mm_tlb_gen(mm);
	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));

and then just zapping the whole TLB on the next context switch.

I think we should defer this for now. Let's look at it again when there
are some performance numbers to justify it.