linux-kernel - Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <234e853b-0c35-45de-b40d-2058976e49e0@www.fastmail.com>
Date:   Mon, 06 Jun 2022 13:48:00 -0700
From:   "Andy Lutomirski" <luto@...nel.org>
To:     "Nadav Amit" <nadav.amit@...il.com>,
        "Dave Hansen" <dave.hansen@...ux.intel.com>
Cc:     "Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>,
        "Nadav Amit" <namit@...are.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        "Ingo Molnar" <mingo@...nel.org>,
        "Thomas Gleixner" <tglx@...utronix.de>,
        "the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible

On Mon, Jun 6, 2022, at 11:01 AM, Nadav Amit wrote:
> From: Nadav Amit <namit@...are.com>
>
> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
> contended and reading it should (arguably) be avoided as much as
> possible.
>
> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
> even when it is not necessary (e.g., the mm was already switched).
> This is wasteful.
>
> Moreover, one of the existing optimizations is to read mm's tlb_gen to
> see if there are additional in-flight TLB invalidations and flush the
> entire TLB in such a case. However, if the request's tlb_gen was already
> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
> by the overhead of the check itself.
>
> Running will-it-scale with tlb_flush1_threads show a considerable
> benefit on 56-core Skylake (up to +24%):

Acked-by: Andy Lutomirski <luto@...nel.org>

But...

I'm suspicious that the analysis is missing something.  Under this kind of workload, there are a whole bunch of flushes being initiated, presumably in parallel.  Each flush does an RMW on mm_tlb_gen (which will make the cacheline exclusive on the initiating CPU).  And each flush sends out an IPI, and the IPI handler reads mm_tlb_gen (which makes the cacheline shared) when it updates the local tlb_gen.  So you're doing (at least!) an E->S and S->E transition per flush.  Your patch doesn't change this.

But your patch does add a whole new case in which the IPI handler simply doesn't flush!  I think it takes either quite a bit of racing or a well-timed context switch to hit that case, but, if you hit it, then you skip a flush and you skip the read of mm_tlb_gen.

Have you tested what happens if you do something like your patch but you also make the mm_tlb_gen read unconditional?  I'm curious if there's more to the story than you're seeing.

You could also contemplate a somewhat evil hack in which you don't read mm_tlb_gen even if you *do* flush and instead use f->new_tlb_gen.  That would potentially do a bit of extra flushing but would avoid the flush path causing the E->S transition.  (Which may be of dubious value for real workloads, since I don't think there's a credible way to avoid having context switches read mm_tlb_gen.)

--Andy