linux-kernel - Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <855298e6e981378c3afeab93b8c3cb821a7a5b88.camel@surriel.com>
Date: Thu, 09 Jan 2025 15:16:57 -0500
From: Rik van Riel <riel@...riel.com>
To: Dave Hansen <dave.hansen@...el.com>, x86@...nel.org
Cc: linux-kernel@...r.kernel.org, kernel-team@...a.com, 
	dave.hansen@...ux.intel.com, luto@...nel.org, peterz@...radead.org, 
	tglx@...utronix.de, mingo@...hat.com, bp@...en8.de, hpa@...or.com, 
	akpm@...ux-foundation.org, nadav.amit@...il.com,
 zhengqi.arch@...edance.com, 	linux-mm@...ck.org
Subject: Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes

On Mon, 2025-01-06 at 09:21 -0800, Dave Hansen wrote:
> On 12/30/24 09:53, Rik van Riel wrote:
> 
> 
> > +static void broadcast_kernel_range_flush(unsigned long start,
> > unsigned long end)
> > +{
> > +	unsigned long addr;
> > +	unsigned long maxnr = invlpgb_count_max;
> > +	unsigned long threshold = tlb_single_page_flush_ceiling *
> > maxnr;
> 
> The 'tlb_single_page_flush_ceiling' value was determined by looking
> at
> _local_ invalidation cost. Could you talk a bit about why it's also a
> good value to use for remote invalidations? Does it hold up for
> INVLPGB
> the same way it did for good ol' INVLPG? Has there been any explicit
> testing here to find a good value?
> 
> I'm also confused by the multiplication here. Let's say
> invlpgb_count_max==20 and tlb_single_page_flush_ceiling==30.
> 
> You would need to switch away from single-address invalidation when
> the
> number of addresses is >20 for INVLPGB functional reasons. But you'd
> also need to switch away when >30 for performance reasons
> (tlb_single_page_flush_ceiling).
> 
> But I don't understand how that would make the threshold 20*30=600
> invalidations.

I have not done any measurement to see how
flushing with INVLPGB stacks up versus
local TLB flushes.

What makes INVLPGB potentially slower:
- These flushes are done globally

What makes INVLPGB potentially faster:
- Multiple flushes can be pending simultaneously,
  and executed in any convenient order by the CPUs.
- Wait once on completion of all the queued flushes.

Another thing that makes things interesting is the 
TLB entry coalescing done by AMD CPUs.

When multiple pages are both virtually and physically
contiguous in memory (which is fairly common), the
CPU can use a single TLB entry to map up to 8 of them.

That means if we issue eg. 20 INVLPGB flushes for
8 4kB pages each, instead of the CPUs needing to
remove 160 TLB entries, there might only be 50.

I just guessed at the numbers used in my code,
while trying to sort out the details elsewhere
in the code.

How should we go about measuring the tradeoffs
between invalidation time, and the time spent
in TLB misses from flushing unnecessary stuff?

> 
> > +	/*
> > +	 * TLBSYNC only waits for flushes originating on the same
> > CPU.
> > +	 * Disabling migration allows us to wait on all flushes.
> > +	 */
> 
> Imperative voice here too, please:
> 
> 	Disable migration to wait on all flushes.
> 
> > +	guard(preempt)();
> > +
> > +	if (end == TLB_FLUSH_ALL ||
> > +	    (end - start) > threshold << PAGE_SHIFT) {
> 
> This is basically a copy-and-paste of the "range vs. global" flush
> logic, but taking 'invlpgb_count_max' into account.
> 
> It would be ideal if those limit checks could be consolidated. I
> suspect
> that when the 'threshold' calculation above gets clarified that they
> may
> be easier to consolidate.

Maybe?

I implemented another suggestion in the code,
and the start of flush_tlb_kernel_range() now
looks like this:

void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
        if (broadcast_kernel_range_flush(start, end))
                return;

If we are to consolidate the limit check, it
should probably be in a helper function somewhere,
and not by spreading out the broadcast flush calls.

> 
> BTW, what is a typical value for 'invlpgb_count_max'? Is it more or
> less
> than the typical value for 'tlb_single_page_flush_ceiling'?
> 
> Maybe we should just lower 'tlb_single_page_flush_ceiling' if
> 'invlpgb_count_max' falls below it so we only have _one_ runtime
> value
> to consider.

The value for invlpgb_count_max on both Milan
and Bergamo CPUs appears to be 8. That is, the
CPU reports we can flush 7 additional pages
beyond a single page.

This matches the number of PTEs that can be 
cached in one TLB entry if they are contiguous
and aligned, and matches one cache line worth
of PTE entries.

> 
> 
> > +		invlpgb_flush_all();
> > +	} else {
> > +		unsigned long nr;
> > +		for (addr = start; addr < end; addr += nr <<
> > PAGE_SHIFT) {
> > +			nr = min((end - addr) >> PAGE_SHIFT,
> > maxnr);
> > +			invlpgb_flush_addr(addr, nr);
> > +		}
> > +	}
> > +
> > +	tlbsync();
> > +}
> > +
> >  static void do_kernel_range_flush(void *info)
> >  {
> >  	struct flush_tlb_info *f = info;
> > @@ -1089,6 +1115,11 @@ static void do_kernel_range_flush(void
> > *info)
> >  
> >  void flush_tlb_kernel_range(unsigned long start, unsigned long
> > end)
> >  {
> > +	if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> > +		broadcast_kernel_range_flush(start, end);
> > +		return;
> > +	}
> > +
> >  	/* Balance as user space task's flush, a bit conservative
> > */
> >  	if (end == TLB_FLUSH_ALL ||
> >  	    (end - start) > tlb_single_page_flush_ceiling <<
> > PAGE_SHIFT) {
> 
> I also wonder if this would all get simpler if we give in and
> *always*
> call get_flush_tlb_info(). That would provide a nice single place to
> consolidate the "all vs. ranged" flush logic.

Possibly. That might be a good way to unify that
threshold check?

That should probably be a separate patch, though.


-- 
All Rights Reversed.