linux-kernel - Re: [PATCH v11 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b067a9fc-ff5f-4baa-a1ff-3fa749ae4d44@intel.com>
Date: Fri, 14 Feb 2025 11:53:04 -0800
From: Dave Hansen <dave.hansen@...el.com>
To: Rik van Riel <riel@...riel.com>, x86@...nel.org
Cc: linux-kernel@...r.kernel.org, bp@...en8.de, peterz@...radead.org,
 dave.hansen@...ux.intel.com, zhengqi.arch@...edance.com,
 nadav.amit@...il.com, thomas.lendacky@....com, kernel-team@...a.com,
 linux-mm@...ck.org, akpm@...ux-foundation.org, jackmanb@...gle.com,
 jannh@...gle.com, mhklinux@...look.com, andrew.cooper3@...rix.com,
 Manali Shukla <Manali.Shukla@....com>
Subject: Re: [PATCH v11 09/12] x86/mm: enable broadcast TLB invalidation for
 multi-threaded processes

On 2/13/25 08:14, Rik van Riel wrote:
> Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
> and newer CPUs.

Could we please just zap the "on AMD EPYC 3 and newer CPUs" from all of
these patches? It can be mentioned once in the cover letter or
something, but it doesn't need to be repeated.

> In order to not exhaust PCID space, and keep TLB flushes local for single
> threaded processes, we only hand out broadcast ASIDs to processes active on
> 4 or more CPUs.

Please no "we's". Use imperative voice. This also needs some fleshing
out. Perhaps:

	There is not enough room in the 12-bit ASID address space to
	hand out broadcast ASIDs to every process. Only hand out
	broadcast ASIDs to processes when they are observed to be
	simultaneously running on 4 or more CPUs.

	Most importantly, this ensures that single threaded processes
	continue to use the cheaper, legacy, local TLB invalidation
	instructions like INVLPG.

> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	u16 global_asid;
> +	bool asid_transition;
> +#endif
> +
>  } mm_context_t;

Please give these at least a line or two comment explaining what they do.

>  #define INIT_MM_CONTEXT(mm)						\
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 795fdd53bd0a..d670699d32c2 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
>  #define enter_lazy_tlb enter_lazy_tlb
>  extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
>  
> +extern void destroy_context_free_global_asid(struct mm_struct *mm);
> +
>  /*
>   * Init a new mm.  Used on mm copies, like at fork()
>   * and on mm's that are brand-new, like at execve().
> @@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
>  		mm->context.execute_only_pkey = -1;
>  	}
>  #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> +		mm->context.global_asid = 0;
> +		mm->context.asid_transition = false;
> +	}
> +#endif
> +
>  	mm_reset_untag_mask(mm);
>  	init_new_context_ldt(mm);
>  	return 0;
> @@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
>  static inline void destroy_context(struct mm_struct *mm)
>  {
>  	destroy_context_ldt(mm);
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		destroy_context_free_global_asid(mm);
> +#endif
>  }
>  
>  extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index bda7080dec83..3080cb8d21dc 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -6,6 +6,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/sched.h>
>  
> +#include <asm/barrier.h>
>  #include <asm/processor.h>
>  #include <asm/cpufeature.h>
>  #include <asm/special_insns.h>
> @@ -239,6 +240,78 @@ void flush_tlb_one_kernel(unsigned long addr);
>  void flush_tlb_multi(const struct cpumask *cpumask,
>  		      const struct flush_tlb_info *info);
>  
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +static inline bool is_dyn_asid(u16 asid)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return true;
> +
> +	return asid < TLB_NR_DYN_ASIDS;
> +}

If possible, could you avoid double-defining the helpers that will
compile with and without CONFIG_X86_BROADCAST_TLB_FLUSH? Just put the
#ifdef around the ones that *need* it.

...
> +static inline bool in_asid_transition(struct mm_struct *mm)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return false;
> +
> +	return mm && READ_ONCE(mm->context.asid_transition);
> +}
> +
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> +	u16 asid;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> +		return 0;
> +
> +	asid = smp_load_acquire(&mm->context.global_asid);
> +
> +	/* mm->context.global_asid is either 0, or a global ASID */
> +	VM_WARN_ON_ONCE(asid && is_dyn_asid(asid));
> +
> +	return asid;
> +}

Yay, some kind of custom lock. :)

Could give us a little idea what the locking rules are here and why this
neds the READ_ONCE() and smp_load_acquire()?

...
> +	/*
> +	 * TLB consistency for global ASIDs is maintained with broadcast TLB
> +	 * flushing. The TLB is never outdated, and does not need flushing.
> +	 */

This is another case where I think using the word "broadcast" is not
helping.

Here's the problem: INVLPGB is a "INVLPG" that's broadcast. So the name
INVLPGB is fine. INVLPGB is *a* way to broadcast INVLPG which is *a*
kind of TLB invalidation.

But, to me "broadcast TLB flushing" is a broad term. In arguably
includes INVLPGB and normal IPI-based flushing. Just like the function
naming in the earlier patch, I think we need a better term here.

> +	if (static_cpu_has(X86_FEATURE_INVLPGB)) {
> +		u16 global_asid = mm_global_asid(next);
> +
> +		if (global_asid) {
> +			*new_asid = global_asid;
> +			*need_flush = false;
> +			return;
> +		}
> +	}



> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH

How cleanly could we throw this hunk in a new file? I always dislike big
#ifdefs like this. They seem like magnets for causing weird compile
problems.

> +/*
> + * Logic for broadcast TLB invalidation.
> + */
> +static DEFINE_RAW_SPINLOCK(global_asid_lock);

The global lock definitely needs some discussion in the changelog.

> +static u16 last_global_asid = MAX_ASID_AVAILABLE;
> +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 };

Isn't the initialization to all 0's superfluous for a global variable?

> +static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_global_asid_space(void)
> +{
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	/*
> +	 * A global TLB flush guarantees that any stale entries from
> +	 * previously freed global ASIDs get flushed from the TLB
> +	 * everywhere, making these global ASIDs safe to reuse.
> +	 */
> +	invlpgb_flush_all_nonglobals();

Ugh, my suggestion to use the term "global ASID" really doesn't work
here, does it?

Also, isn't a invlpgb_flush_all_nonglobals() _relatively_ slow? It has
to go out and talk over the fabric to every CPU, right? This is also
holding a global lock.

That's seems worrisome.

> +	/*
> +	 * Clear all the previously freed global ASIDs from the
> +	 * broadcast_asid_used bitmap, now that the global TLB flush
> +	 * has made them actually available for re-use.
> +	 */
> +	bitmap_andnot(global_asid_used, global_asid_used,
> +			global_asid_freed, MAX_ASID_AVAILABLE);
> +	bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
> +
> +	/*
> +	 * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
> +	 * assignments, for tasks doing IPI based TLB shootdowns.
> +	 * Restart the search from the start of the global ASID space.
> +	 */
> +	last_global_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_global_asid(void)
> +{
> +
> +	u16 asid;
> +
> +	lockdep_assert_held(&global_asid_lock);
> +
> +	/* The previous allocated ASID is at the top of the address space. */
> +	if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
> +		reset_global_asid_space();
> +
> +	asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
> +
> +	if (asid >= MAX_ASID_AVAILABLE) {
> +		/* This should never happen. */
> +		VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n",
> +				global_asid_available);
> +		return 0;
> +	}
> +
> +	/* Claim this global ASID. */
> +	__set_bit(asid, global_asid_used);
> +	last_global_asid = asid;
> +	global_asid_available--;
> +	return asid;
> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a global
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> +	u16 global_asid = mm_global_asid(next);
> +
> +	/* Process is transitioning to a global ASID */
> +	if (global_asid && prev_asid != global_asid)
> +		return true;
> +
> +	/* Transition from global->local ASID does not currently happen. */
> +	if (!global_asid && is_global_asid(prev_asid))
> +		return true;
> +
> +	return false;
> +}
I'm going to throw in the towel at this point. This patch needs to get
broken up. It's more at once than my poor little brain can handle.

The _least_ it can do is introduce the stub functions and injection into
existing code changes, first. Then, in a second patch, introduce the
real implementation.

I also suspect that a big chunk of the ASID allocator could be broken
out and introduced separately.

Another example is broadcast_tlb_flush(). To reduce complexity in _this_
patch, it could do something suboptimal like always do a
invlpgb_flush_all_nonglobals() regardless of the kind of flush it gets.
Then, in a later patch, you could optimize it.