linux-kernel - Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Mon, 6 Jun 2022 21:06:47 +0000
From:   Nadav Amit <namit@...are.com>
To:     Andy Lutomirski <luto@...nel.org>
CC:     Dave Hansen <dave.hansen@...ux.intel.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        the arch/x86 maintainers <x86@...nel.org>,
        Nadav Amit <nadav.amit@...il.com>
Subject: Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible

On Jun 6, 2022, at 1:48 PM, Andy Lutomirski <luto@...nel.org> wrote:

> ⚠ External Email
> 
> On Mon, Jun 6, 2022, at 11:01 AM, Nadav Amit wrote:
>> From: Nadav Amit <namit@...are.com>
>> 
>> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
>> contended and reading it should (arguably) be avoided as much as
>> possible.
>> 
>> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
>> even when it is not necessary (e.g., the mm was already switched).
>> This is wasteful.
>> 
>> Moreover, one of the existing optimizations is to read mm's tlb_gen to
>> see if there are additional in-flight TLB invalidations and flush the
>> entire TLB in such a case. However, if the request's tlb_gen was already
>> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
>> by the overhead of the check itself.
>> 
>> Running will-it-scale with tlb_flush1_threads show a considerable
>> benefit on 56-core Skylake (up to +24%):
> 
> Acked-by: Andy Lutomirski <luto@...nel.org>
> 
> But...
> 
> I'm suspicious that the analysis is missing something. Under this kind of workload, there are a whole bunch of flushes being initiated, presumably in parallel. Each flush does an RMW on mm_tlb_gen (which will make the cacheline exclusive on the initiating CPU). And each flush sends out an IPI, and the IPI handler reads mm_tlb_gen (which makes the cacheline shared) when it updates the local tlb_gen. So you're doing (at least!) an E->S and S->E transition per flush. Your patch doesn't change this.
> 
> But your patch does add a whole new case in which the IPI handler simply doesn't flush! I think it takes either quite a bit of racing or a well-timed context switch to hit that case, but, if you hit it, then you skip a flush and you skip the read of mm_tlb_gen.
> 
> Have you tested what happens if you do something like your patch but you also make the mm_tlb_gen read unconditional? I'm curious if there's more to the story than you're seeing.
> 
> You could also contemplate a somewhat evil hack in which you don't read mm_tlb_gen even if you *do* flush and instead use f->new_tlb_gen. That would potentially do a bit of extra flushing but would avoid the flush path causing the E->S transition. (Which may be of dubious value for real workloads, since I don't think there's a credible way to avoid having context switches read mm_tlb_gen.)

Thanks Andy. I still think that the performance comes from saving cache
accesses, which are skipped in certain cases in this workload. I would note
that this patch comes from me profiling will-it-scale, after Dave complained
that I ruined the performance in some other patch. So this is not a random
“I tried something and it’s better”.

I vaguely remember profiling the number of cache-[something] and seeing an
effect, and I cannot explain such performance improvement by just skipping a
flush. But...

Having said all of that, I will run at least the first experiment that you
asked for. I was considering skipping reading mm_tlb_gen completely, but for
the reasons that you mentioned considered it as something that might
introduce performance regression for workloads that are more important than
will-it-scale.

I would also admit that I am not sure how to completely prevent speculative
read of mm->tlb_gen. I guess a serializing instruction is out of the
question, so this optimization is a best-effort.