linux-kernel - Re: [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bf0a64788c5d42f06f3a61c996dc104af9e69206.camel@surriel.com>
Date: Mon, 13 Jan 2025 22:13:08 -0500
From: Rik van Riel <riel@...riel.com>
To: Nadav Amit <nadav.amit@...il.com>
Cc: the arch/x86 maintainers <x86@...nel.org>, Linux Kernel Mailing List
	 <linux-kernel@...r.kernel.org>, Borislav Petkov <bp@...en8.de>, 
	peterz@...radead.org, Dave Hansen <dave.hansen@...ux.intel.com>, 
	zhengqi.arch@...edance.com, thomas.lendacky@....com, kernel-team@...a.com, 
 "open list:MEMORY MANAGEMENT"
	 <linux-mm@...ck.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	jannh@...gle.com
Subject: Re: [PATCH v4 09/12] x86/mm: enable broadcast TLB invalidation for
 multi-threaded processes

On Mon, 2025-01-13 at 15:09 +0200, Nadav Amit wrote:
> 
> > On 12 Jan 2025, at 17:53, Rik van Riel <riel@...riel.com> wrote:
> > 
> > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> > +	u16 global_asid;
> > +	bool asid_transition;
> 
> As I later note, there are various ordering issues between the two.
> Would it be
> just easier to combine them into one field? I know everybody hates
> bitfields so
> I don’t suggest it, but there are other ways...

It's certainly an option, but I think we figured out
the ordering issues, so at this point documentation
and readability of the code might be more important
for future proofing?

> 
> > @@ -170,6 +180,10 @@ static inline int init_new_context(struct
> > task_struct *tsk,
> > static inline void destroy_context(struct mm_struct *mm)
> > {
> > 	destroy_context_ldt(mm);
> > +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> 
> I’d prefer to use IS_ENABLED() and to have a stub for 
> destroy_context_free_global_asid().
> 
> > +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> > +		destroy_context_free_global_asid(mm);
> > +#endif
> > }

I'll think about how to do this cleaner.

I would like to keep the cpu feature test in the
inline function, so we don't do an unnecessary
function call on systems without INVLPGB.

> > 
> > +static inline bool in_asid_transition(const struct flush_tlb_info
> > *info)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> > +		return false;
> > +
> > +	return info->mm && info->mm->context.asid_transition;
> 
> READ_ONCE(context.asid_transition) ?

Changed for the next version.

> 
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
> 
> Having a VM_WARN_ON() here might be nice.

Added. Thank you.

> 
> > +static u16 get_global_asid(void)
> > +{
> > +	lockdep_assert_held(&global_asid_lock);
> > +
> > +	do {
> > +		u16 start = last_global_asid;
> > +		u16 asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > +		if (asid >= MAX_ASID_AVAILABLE) {
> > +			reset_global_asid_space();
> > +			continue;
> > +		}
> > +
> > +		/* Claim this global ASID. */
> > +		__set_bit(asid, global_asid_used);
> > +		last_global_asid = asid;
> > +		return asid;
> > +	} while (1);
> 
> This does not make me feel easy at all. I do not understand
> why it might happen. The caller should’ve already checked the global
> ASID
> is available under the lock. If it is not obvious from the code,
> perhaps
> refactoring is needed.
> 
The caller checks whether we have a global ASID available,
anywhere in the namespace.

This code will claim a specific one.

I guess the global_asid_available-- line could be moved
into get_global_asid() to improve readability?

> > +static bool mm_active_cpus_exceeds(struct mm_struct *mm, int
> > threshold)
> > +{
> > +	int count = 0;
> > +	int cpu;
> > +
> > +	/* This quick check should eliminate most single threaded
> > programs. */
> > +	if (cpumask_weight(mm_cpumask(mm)) <= threshold)
> > +		return false;
> > +
> > +	/* Slower check to make sure. */
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/* Skip the CPUs that aren't really running this
> > process. */
> > +		if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> > +			continue;
> 
> Do you really want to make loaded_mm accessed from other cores? Does
> this
> really provide worthy benefit?
> 
> Why not just use cpumask_weight() and be done with it? Anyhow it’s a
> heuristic.

We recently added some code to make mm_cpumask
maintenance a lot lazier, which can result in
more CPUs remaining in the bitmap while not
running the mm.

As for accessing loaded_mm from other CPUs, we
already have to do that in finish_asid_transition.

I don't see any good way around that, but I'm open
to suggestions :)

> 
> > +	/*
> > +	 * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > +	 * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > +	 * ordering for synchronization.
> > +	 *
> > +	 * While the process has threads still using a dynamic
> > ASID,
> > +	 * TLB invalidation IPIs continue to get sent.
> > +	 *
> > +	 * This code sets asid_transition first, before assigning
> > the
> > +	 * global ASID.
> > +	 *
> > +	 * The TLB flush code will only verify the ASID transition
> > +	 * after it has seen the new global ASID for the process.
> > +	 */
> > +	WRITE_ONCE(mm->context.asid_transition, true);
> 
> I would prefer smp_wmb() and document where the matching smp_rmb()
> (or smp_mb) is.
> 
> > +	WRITE_ONCE(mm->context.global_asid, get_global_asid());
> > +
> > +	global_asid_available--;
> > +}
> > 

I would be happy with either style of ordering.

It's all a bit of a no-op anyway, because x86 does not
do stores out of order.

> > +static void finish_asid_transition(struct flush_tlb_info *info)
> > +{
> > +	struct mm_struct *mm = info->mm;
> > +	int bc_asid = mm_global_asid(mm);
> > +	int cpu;
> > +
> > +	if (!READ_ONCE(mm->context.asid_transition))
> > +		return;
> > +
> > +	for_each_cpu(cpu, mm_cpumask(mm)) {
> > +		/*
> > +		 * The remote CPU is context switching. Wait for
> > that to
> > +		 * finish, to catch the unlikely case of it
> > switching to
> > +		 * the target mm with an out of date ASID.
> > +		 */
> > +		while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm,
> > cpu)) == LOADED_MM_SWITCHING)
> > +			cpu_relax();
> 
> Although this code should rarely run, it seems bad for a couple of
> reasons:
> 
> 1. It is a new busy-wait in a very delicate place. Lockdep is blind
> to this
>    change.

This is true. However, if a CPU gets stuck in the middle
of switch_mm_irqs_off, won't we have a bigger problem?

> 
> 2. cpu_tlbstate is supposed to be private for each core - that’s why
> there
>    is cpu_tlbstate_shared. But I really think loaded_mm should be
> kept
>    private.
> 
> Can't we just do one TLB shootdown if 
> 	cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids

That conflicts with a future optimization of simply
not maintaining the mm_cpumask at all for processes
that use broadcast TLB invalidation.

After all, once we no longer use the mm_cpumask for
anything any more, why incur the overhead of maintaining
it?

I would like to understand more about the harm of
reading loaded_mm. How is reading loaded_mm worse
than reading other per-CPU variables that are written
from the same code paths?

> 
> > +	/* All the CPUs running this process are using the global
> > ASID. */
> 
> I guess it’s ordered with the flushes (the flushes must complete
> first).
> 
If there are any flushes.

If all the CPUs we scanned are already using the
global ASID, we do not need any additional ordering
in here, since any CPUs switching to this mm afterward
will be seeing the new global ASID.

> > +	WRITE_ONCE(mm->context.asid_transition, false);
> > +}
> > 


> > +	} else do {
> > +		/*
> > +		 * Calculate how many pages can be flushed at
> > once; if the
> > +		 * remainder of the range is less than one page,
> > flush one.
> > +		 */
> > +		nr = min(maxnr, (info->end - addr) >> info-
> > >stride_shift);
> > +		nr = max(nr, 1);
> > +
> > +		invlpgb_flush_user_nr_nosync(kern_pcid(asid),
> > addr, nr, pmd);
> > +		/* Do any CPUs supporting INVLPGB need PTI? */
> > +		if (static_cpu_has(X86_FEATURE_PTI))
> > +			invlpgb_flush_user_nr_nosync(user_pcid(asi
> > d), addr, nr, pmd);
> > +		addr += nr << info->stride_shift;
> > +	} while (addr < info->end);
> 
> I would have preferred for instead of while...

Changed that for the next version. Thank you.

> > @@ -1049,9 +1385,12 @@ void flush_tlb_mm_range(struct mm_struct
> > *mm, unsigned long start,
> > 	 * a local TLB flush is needed. Optimize this use-case by
> > calling
> > 	 * flush_tlb_func_local() directly in this case.
> > 	 */
> > -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> 
> I think an smp_rmb() here would communicate the fact
> in_asid_transition() and
> mm_global_asid() must be ordered.
> 
> > +	if (mm_global_asid(mm)) {
> > +		broadcast_tlb_flush(info);

We have the barrier already, in the form of
inc_mm_tlb_gen a few lines up.

Or are you talking about a barrier (or READ_ONCE?)
inside of mm_global_asid() to make sure the compiler
cannot reorder things around mm_global_asid()?

> > +	} else if (cpumask_any_but(mm_cpumask(mm), cpu) <
> > nr_cpu_ids) {
> > 		info->trim_cpumask = should_trim_cpumask(mm);
> > 		flush_tlb_multi(mm_cpumask(mm), info);
> > +		consider_global_asid(mm);
> > 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> > 		lockdep_assert_irqs_enabled();
> > 		local_irq_disable();
> > -- 
> > 2.47.1
> > 
> 

-- 
All Rights Reversed.