[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9858d9d264cbdb27db7f3df237ba2410522150e.camel@surriel.com>
Date: Mon, 20 Jan 2025 11:09:19 -0500
From: Rik van Riel <riel@...riel.com>
To: Nadav Amit <nadav.amit@...il.com>, x86@...nel.org
Cc: linux-kernel@...r.kernel.org, bp@...en8.de, peterz@...radead.org,
dave.hansen@...ux.intel.com, zhengqi.arch@...edance.com,
thomas.lendacky@....com, kernel-team@...a.com, linux-mm@...ck.org,
akpm@...ux-foundation.org, jannh@...gle.com, mhklinux@...look.com,
andrew.cooper3@...rix.com
Subject: Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for
multi-threaded processes
On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote:
>
>
> On 20/01/2025 4:40, Rik van Riel wrote:
> >
> > +static inline void broadcast_tlb_flush(struct flush_tlb_info
> > *info)
> > +{
> > + VM_WARN_ON_ONCE(1);
>
> Not sure why not the use VM_WARN_ONCE() instead with some more
> informative message (anyhow, a string is allocated for it).
>
VM_WARN_ON_ONCE only has a condition, not a message.
> >
> > +static u16 get_global_asid(void)
> > +{
> > + lockdep_assert_held(&global_asid_lock);
> > +
> > + do {
> > + u16 start = last_global_asid;
> > + u16 asid = find_next_zero_bit(global_asid_used,
> > MAX_ASID_AVAILABLE, start);
> > +
> > + if (asid >= MAX_ASID_AVAILABLE) {
> > + reset_global_asid_space();
> > + continue;
> > + }
>
> I think that unless something is awfully wrong, you are supposed to
> at
> most call reset_global_asid_space() once. So if that's the case, why
> not
> do it this way?
>
> Instead, you can get rid of the loop and just do:
>
> asid = find_next_zero_bit(global_asid_used,
> MAX_ASID_AVAILABLE, start);
>
> If you want, you can warn if asid >= MAX_ASID_AVAILABLE and have some
> fallback. But the loop, is just confusing in my opinion for no
> reason.
I can get rid of the loop. You're right that the code
can just call find_next_zero_bit after calling
reset_global_asid_space.
>
> > + /* Slower check to make sure. */
> > + for_each_cpu(cpu, mm_cpumask(mm)) {
> > + /* Skip the CPUs that aren't really running this
> > process. */
> > + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) != mm)
> > + continue;
>
> Then perhaps at least add a comment next to loaded_mm, that it's not
> private per-se, but rarely accessed by other cores?
>
I don't see any comment in struct tlb_state that
suggests it was ever private to begin with.
Which comment are you referring to that should
be edited?
> >
> > +
> > + /*
> > + * The transition from IPI TLB flushing, with a dynamic
> > ASID,
> > + * and broadcast TLB flushing, using a global ASID, uses
> > memory
> > + * ordering for synchronization.
> > + *
> > + * While the process has threads still using a dynamic
> > ASID,
> > + * TLB invalidation IPIs continue to get sent.
> > + *
> > + * This code sets asid_transition first, before assigning
> > the
> > + * global ASID.
> > + *
> > + * The TLB flush code will only verify the ASID transition
> > + * after it has seen the new global ASID for the process.
> > + */
> > + WRITE_ONCE(mm->context.asid_transition, true);
> > + WRITE_ONCE(mm->context.global_asid, get_global_asid());
>
> I know it is likely correct in practice (due to TSO memory model),
> but
> it is not clear, at least for me, how those write order affects the
> rest
> of the code. I managed to figure out how it relates to the reads in
> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say
> it
> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb).
>
What kind of wording should we add here to make it
easier to understand?
"The TLB invalidation code reads these variables in
the opposite order in which they are written" ?
> > + /*
> > + * If at least one CPU is not using the global
> > ASID yet,
> > + * send a TLB flush IPI. The IPI should cause
> > stragglers
> > + * to transition soon.
> > + *
> > + * This can race with the CPU switching to another
> > task;
> > + * that results in a (harmless) extra IPI.
> > + */
> > + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid,
> > cpu)) != bc_asid) {
> > + flush_tlb_multi(mm_cpumask(info->mm),
> > info);
> > + return;
>
> I am trying to figure out why we return here. The transition might
> not
> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it
> unconditionally?
The transition _should_ be over, but what if another
CPU got an NMI while in the middle of switch_mm_irqs_off,
and set its own bit in the mm_cpumask after we send this
IPI?
On the other hand, if it sets its mm_cpumask bit after
this point, it will also load the mm->context.global_asid
after this point, and should definitely get the new ASID.
I think we are probably fine to set asid_transition to
false here, but I've had to tweak this code so much over
the past months that I don't feel super confident any more :)
>
> > + /*
> > + * TLB flushes with INVLPGB are kicked off asynchronously.
> > + * The inc_mm_tlb_gen() guarantees page table updates are
> > done
> > + * before these TLB flushes happen.
> > + */
> > + if (info->end == TLB_FLUSH_ALL) {
> > + invlpgb_flush_single_pcid_nosync(kern_pcid(asid));
> > + /* Do any CPUs supporting INVLPGB need PTI? */
> > + if (static_cpu_has(X86_FEATURE_PTI))
> > + invlpgb_flush_single_pcid_nosync(user_pcid
> > (asid));
> > + } else for (; addr < info->end; addr += nr << info-
> > >stride_shift) {
>
> I guess I was wrong, and do-while was cleaner here.
>
> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT...
>
We set maxnr to 1 for larger stride shifts at the top of the function:
/* Flushing multiple pages at once is not supported with 1GB
pages. */
if (info->stride_shift > PMD_SHIFT)
maxnr = 1;
> [ I guess the cleanest way was to change get_flush_tlb_info to mask
> the
> low bits of start and end based on ((1ull << stride_shift) - 1). But
> whatever... ]
I'll change it back :)
I'm just happy this code is getting lots of attention,
and we're improving it with time.
> > @@ -573,6 +874,23 @@ void switch_mm_irqs_off(struct mm_struct
> > *unused, struct mm_struct *next,
> > !cpumask_test_cpu(cpu,
> > mm_cpumask(next))))
> > cpumask_set_cpu(cpu, mm_cpumask(next));
> >
> > + /*
> > + * Check if the current mm is transitioning to a
> > new ASID.
> > + */
> > + if (needs_global_asid_reload(next, prev_asid)) {
> > + next_tlb_gen = atomic64_read(&next-
> > >context.tlb_gen);
> > +
> > + choose_new_asid(next, next_tlb_gen,
> > &new_asid, &need_flush);
> > + goto reload_tlb;
>
> Not a fan of the goto's when they are not really needed, and I don't
> think it is really needed here. Especially that the name of the tag
> "reload_tlb" does not really convey that the page-tables are reloaded
> at
> that point.
In this particular case, the CPU continues running with
the same page tables, but with a different PCID.
>
> > + }
> > +
> > + /*
> > + * Broadcast TLB invalidation keeps this PCID up
> > to date
> > + * all the time.
> > + */
> > + if (is_global_asid(prev_asid))
> > + return;
>
> Hard for me to convince myself
When a process uses a global ASID, we always send
out TLB invalidations using INVLPGB.
The global ASID should always be up to date.
>
> > @@ -769,6 +1092,16 @@ static void flush_tlb_func(void *info)
> > if (unlikely(loaded_mm == &init_mm))
> > return;
> >
> > + /* Reload the ASID if transitioning into or out of a
> > global ASID */
> > + if (needs_global_asid_reload(loaded_mm, loaded_mm_asid)) {
> > + switch_mm_irqs_off(NULL, loaded_mm, NULL);
>
> I understand you want to reuse that logic, but it doesn't seem
> reasonable to me. It both doesn't convey what you want to do, and can
> lead to undesired operations - cpu_tlbstate_update_lam() for
> instance.
> Probably the impact on performance is minor, but it is an opening for
> future mistakes.
My worry with having a separate code path here is
that the separate code path could bit rot, and we
could introduce bugs that way.
I would rather have a tiny performance impact in
what is a rare code path, than a rare (and hard
to track down) memory corruption due to bit rot.
>
> > + loaded_mm_asid =
> > this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> > + }
> > +
> > + /* Broadcast ASIDs are always kept up to date with
> > INVLPGB. */
> > + if (is_global_asid(loaded_mm_asid))
> > + return;
>
> The comment does not clarify to me, and I don't manage to clearly
> explain to myself, why it is guaranteed that all the IPI TLB flushes,
> which were potentially issued before the transition, are not needed.
>
IPI TLB flushes that were issued before the transition went
to the CPUs when they were using dynamic ASIDs (numbers 1-5).
Reloading the TLB with a different PCID, even pointed at the
same page tables, means that the TLB should load the
translations fresh from the page tables, and not re-use any
that it had previously loaded under a different PCID.
--
All Rights Reversed.
Powered by blists - more mailing lists