[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b067a9fc-ff5f-4baa-a1ff-3fa749ae4d44@intel.com>
Date: Fri, 14 Feb 2025 11:53:04 -0800
From: Dave Hansen <dave.hansen@...el.com>
To: Rik van Riel <riel@...riel.com>, x86@...nel.org
Cc: linux-kernel@...r.kernel.org, bp@...en8.de, peterz@...radead.org,
dave.hansen@...ux.intel.com, zhengqi.arch@...edance.com,
nadav.amit@...il.com, thomas.lendacky@....com, kernel-team@...a.com,
linux-mm@...ck.org, akpm@...ux-foundation.org, jackmanb@...gle.com,
jannh@...gle.com, mhklinux@...look.com, andrew.cooper3@...rix.com,
Manali Shukla <Manali.Shukla@....com>
Subject: Re: [PATCH v11 09/12] x86/mm: enable broadcast TLB invalidation for
multi-threaded processes
On 2/13/25 08:14, Rik van Riel wrote:
> Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPYC 3
> and newer CPUs.
Could we please just zap the "on AMD EPYC 3 and newer CPUs" from all of
these patches? It can be mentioned once in the cover letter or
something, but it doesn't need to be repeated.
> In order to not exhaust PCID space, and keep TLB flushes local for single
> threaded processes, we only hand out broadcast ASIDs to processes active on
> 4 or more CPUs.
Please no "we's". Use imperative voice. This also needs some fleshing
out. Perhaps:
There is not enough room in the 12-bit ASID address space to
hand out broadcast ASIDs to every process. Only hand out
broadcast ASIDs to processes when they are observed to be
simultaneously running on 4 or more CPUs.
Most importantly, this ensures that single threaded processes
continue to use the cheaper, legacy, local TLB invalidation
instructions like INVLPG.
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> + u16 global_asid;
> + bool asid_transition;
> +#endif
> +
> } mm_context_t;
Please give these at least a line or two comment explaining what they do.
> #define INIT_MM_CONTEXT(mm) \
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 795fdd53bd0a..d670699d32c2 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -139,6 +139,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
> #define enter_lazy_tlb enter_lazy_tlb
> extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
>
> +extern void destroy_context_free_global_asid(struct mm_struct *mm);
> +
> /*
> * Init a new mm. Used on mm copies, like at fork()
> * and on mm's that are brand-new, like at execve().
> @@ -161,6 +163,14 @@ static inline int init_new_context(struct task_struct *tsk,
> mm->context.execute_only_pkey = -1;
> }
> #endif
> +
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> + mm->context.global_asid = 0;
> + mm->context.asid_transition = false;
> + }
> +#endif
> +
> mm_reset_untag_mask(mm);
> init_new_context_ldt(mm);
> return 0;
> @@ -170,6 +180,10 @@ static inline int init_new_context(struct task_struct *tsk,
> static inline void destroy_context(struct mm_struct *mm)
> {
> destroy_context_ldt(mm);
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> + if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
> + destroy_context_free_global_asid(mm);
> +#endif
> }
>
> extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index bda7080dec83..3080cb8d21dc 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -6,6 +6,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/sched.h>
>
> +#include <asm/barrier.h>
> #include <asm/processor.h>
> #include <asm/cpufeature.h>
> #include <asm/special_insns.h>
> @@ -239,6 +240,78 @@ void flush_tlb_one_kernel(unsigned long addr);
> void flush_tlb_multi(const struct cpumask *cpumask,
> const struct flush_tlb_info *info);
>
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
> +static inline bool is_dyn_asid(u16 asid)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> + return true;
> +
> + return asid < TLB_NR_DYN_ASIDS;
> +}
If possible, could you avoid double-defining the helpers that will
compile with and without CONFIG_X86_BROADCAST_TLB_FLUSH? Just put the
#ifdef around the ones that *need* it.
...
> +static inline bool in_asid_transition(struct mm_struct *mm)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> + return false;
> +
> + return mm && READ_ONCE(mm->context.asid_transition);
> +}
> +
> +static inline u16 mm_global_asid(struct mm_struct *mm)
> +{
> + u16 asid;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
> + return 0;
> +
> + asid = smp_load_acquire(&mm->context.global_asid);
> +
> + /* mm->context.global_asid is either 0, or a global ASID */
> + VM_WARN_ON_ONCE(asid && is_dyn_asid(asid));
> +
> + return asid;
> +}
Yay, some kind of custom lock. :)
Could give us a little idea what the locking rules are here and why this
neds the READ_ONCE() and smp_load_acquire()?
...
> + /*
> + * TLB consistency for global ASIDs is maintained with broadcast TLB
> + * flushing. The TLB is never outdated, and does not need flushing.
> + */
This is another case where I think using the word "broadcast" is not
helping.
Here's the problem: INVLPGB is a "INVLPG" that's broadcast. So the name
INVLPGB is fine. INVLPGB is *a* way to broadcast INVLPG which is *a*
kind of TLB invalidation.
But, to me "broadcast TLB flushing" is a broad term. In arguably
includes INVLPGB and normal IPI-based flushing. Just like the function
naming in the earlier patch, I think we need a better term here.
> + if (static_cpu_has(X86_FEATURE_INVLPGB)) {
> + u16 global_asid = mm_global_asid(next);
> +
> + if (global_asid) {
> + *new_asid = global_asid;
> + *need_flush = false;
> + return;
> + }
> + }
> +#ifdef CONFIG_X86_BROADCAST_TLB_FLUSH
How cleanly could we throw this hunk in a new file? I always dislike big
#ifdefs like this. They seem like magnets for causing weird compile
problems.
> +/*
> + * Logic for broadcast TLB invalidation.
> + */
> +static DEFINE_RAW_SPINLOCK(global_asid_lock);
The global lock definitely needs some discussion in the changelog.
> +static u16 last_global_asid = MAX_ASID_AVAILABLE;
> +static DECLARE_BITMAP(global_asid_used, MAX_ASID_AVAILABLE) = { 0 };
> +static DECLARE_BITMAP(global_asid_freed, MAX_ASID_AVAILABLE) = { 0 };
Isn't the initialization to all 0's superfluous for a global variable?
> +static int global_asid_available = MAX_ASID_AVAILABLE - TLB_NR_DYN_ASIDS - 1;
> +
> +static void reset_global_asid_space(void)
> +{
> + lockdep_assert_held(&global_asid_lock);
> +
> + /*
> + * A global TLB flush guarantees that any stale entries from
> + * previously freed global ASIDs get flushed from the TLB
> + * everywhere, making these global ASIDs safe to reuse.
> + */
> + invlpgb_flush_all_nonglobals();
Ugh, my suggestion to use the term "global ASID" really doesn't work
here, does it?
Also, isn't a invlpgb_flush_all_nonglobals() _relatively_ slow? It has
to go out and talk over the fabric to every CPU, right? This is also
holding a global lock.
That's seems worrisome.
> + /*
> + * Clear all the previously freed global ASIDs from the
> + * broadcast_asid_used bitmap, now that the global TLB flush
> + * has made them actually available for re-use.
> + */
> + bitmap_andnot(global_asid_used, global_asid_used,
> + global_asid_freed, MAX_ASID_AVAILABLE);
> + bitmap_clear(global_asid_freed, 0, MAX_ASID_AVAILABLE);
> +
> + /*
> + * ASIDs 0-TLB_NR_DYN_ASIDS are used for CPU-local ASID
> + * assignments, for tasks doing IPI based TLB shootdowns.
> + * Restart the search from the start of the global ASID space.
> + */
> + last_global_asid = TLB_NR_DYN_ASIDS;
> +}
> +
> +static u16 get_global_asid(void)
> +{
> +
> + u16 asid;
> +
> + lockdep_assert_held(&global_asid_lock);
> +
> + /* The previous allocated ASID is at the top of the address space. */
> + if (last_global_asid >= MAX_ASID_AVAILABLE - 1)
> + reset_global_asid_space();
> +
> + asid = find_next_zero_bit(global_asid_used, MAX_ASID_AVAILABLE, last_global_asid);
> +
> + if (asid >= MAX_ASID_AVAILABLE) {
> + /* This should never happen. */
> + VM_WARN_ONCE(1, "Unable to allocate global ASID despite %d available\n",
> + global_asid_available);
> + return 0;
> + }
> +
> + /* Claim this global ASID. */
> + __set_bit(asid, global_asid_used);
> + last_global_asid = asid;
> + global_asid_available--;
> + return asid;
> +}
> +
> +/*
> + * Returns true if the mm is transitioning from a CPU-local ASID to a global
> + * (INVLPGB) ASID, or the other way around.
> + */
> +static bool needs_global_asid_reload(struct mm_struct *next, u16 prev_asid)
> +{
> + u16 global_asid = mm_global_asid(next);
> +
> + /* Process is transitioning to a global ASID */
> + if (global_asid && prev_asid != global_asid)
> + return true;
> +
> + /* Transition from global->local ASID does not currently happen. */
> + if (!global_asid && is_global_asid(prev_asid))
> + return true;
> +
> + return false;
> +}
I'm going to throw in the towel at this point. This patch needs to get
broken up. It's more at once than my poor little brain can handle.
The _least_ it can do is introduce the stub functions and injection into
existing code changes, first. Then, in a second patch, introduce the
real implementation.
I also suspect that a big chunk of the ASID allocator could be broken
out and introduced separately.
Another example is broadcast_tlb_flush(). To reduce complexity in _this_
patch, it could do something suboptimal like always do a
invlpgb_flush_all_nonglobals() regardless of the kind of flush it gets.
Then, in a later patch, you could optimize it.
Powered by blists - more mailing lists