linux-kernel - Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <8F37526F-8189-483A-A16E-E0EB8662AD98@amacapital.net>
Date:   Mon, 1 Feb 2021 16:14:58 -0800
From:   Andy Lutomirski <luto@...capital.net>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Linux-MM <linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>,
        Andy Lutomirski <luto@...nel.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        X86 ML <x86@...nel.org>
Subject: Re: [RFC 15/20] mm: detect deferred TLB flushes in vma granularity


> On Feb 1, 2021, at 2:04 PM, Nadav Amit <nadav.amit@...il.com> wrote:
> 
> 
>> 
>> On Jan 30, 2021, at 4:11 PM, Nadav Amit <nadav.amit@...il.com> wrote:
>> 
>> From: Nadav Amit <namit@...are.com>
>> 
>> Currently, deferred TLB flushes are detected in the mm granularity: if
>> there is any deferred TLB flush in the entire address space due to NUMA
>> migration, pte_accessible() in x86 would return true, and
>> ptep_clear_flush() would require a TLB flush. This would happen even if
>> the PTE resides in a completely different vma.
> 
> [ snip ]
> 
>> +static inline void read_defer_tlb_flush_gen(struct mmu_gather *tlb)
>> +{
>> +    struct mm_struct *mm = tlb->mm;
>> +    u64 mm_gen;
>> +
>> +    /*
>> +     * Any change of PTE before calling __track_deferred_tlb_flush() must be
>> +     * performed using RMW atomic operation that provides a memory barriers,
>> +     * such as ptep_modify_prot_start().  The barrier ensure the PTEs are
>> +     * written before the current generation is read, synchronizing
>> +     * (implicitly) with flush_tlb_mm_range().
>> +     */
>> +    smp_mb__after_atomic();
>> +
>> +    mm_gen = atomic64_read(&mm->tlb_gen);
>> +
>> +    /*
>> +     * This condition checks for both first deferred TLB flush and for other
>> +     * TLB pending or executed TLB flushes after the last table that we
>> +     * updated. In the latter case, we are going to skip a generation, which
>> +     * would lead to a full TLB flush. This should therefore not cause
>> +     * correctness issues, and should not induce overheads, since anyhow in
>> +     * TLB storms it is better to perform full TLB flush.
>> +     */
>> +    if (mm_gen != tlb->defer_gen) {
>> +        VM_BUG_ON(mm_gen < tlb->defer_gen);
>> +
>> +        tlb->defer_gen = inc_mm_tlb_gen(mm);
>> +    }
>> +}
> 
> Andy’s comments managed to make me realize this code is wrong. We must
> call inc_mm_tlb_gen(mm) every time.
> 
> Otherwise, a CPU that saw the old tlb_gen and updated it in its local
> cpu_tlbstate on a context-switch. If the process was not running when the
> TLB flush was issued, no IPI will be sent to the CPU. Therefore, later
> switch_mm_irqs_off() back to the process will not flush the local TLB.
> 
> I need to think if there is a better solution. Multiple calls to
> inc_mm_tlb_gen() during deferred flushes would trigger a full TLB flush
> instead of one that is specific to the ranges, once the flush actually takes
> place. On x86 it’s practically a non-issue, since anyhow any update of more
> than 33-entries or so would cause a full TLB flush, but this is still ugly.
> 

What if we had a per-mm ring buffer of flushes?  When starting a flush, we would stick the range in the ring buffer and, when flushing, we would read the ring buffer to catch up.  This would mostly replace the flush_tlb_info struct, and it would let us process multiple partial flushes together.