linux-kernel - Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <B6566AA4-2E8A-43BD-A224-0F6D4747B8FB@surriel.com>
Date:   Wed, 18 Jul 2018 16:58:28 -0400
From:   Rik van Riel <riel@...riel.com>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Mike Galbraith <efault@....de>,
        kernel-team <kernel-team@...com>, Ingo Molnar <mingo@...nel.org>,
        Dave Hansen <dave.hansen@...el.com>
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier



> On Jul 17, 2018, at 4:04 PM, Andy Lutomirski <luto@...nel.org> wrote:
> 
> 
> I think you've introduced a minor-ish performance regression due to
> changing the old (admittedly terribly documented) control flow a bit.
> Before, if real_prev == next, we would skip:
> 
> load_mm_cr4(next);
> switch_ldt(real_prev, next);
> 
> Now we don't any more.  I think you should reinstate that
> optimization.  It's probably as simple as wrapping them in an if
> (real_priv != next) with a comment like /* Remote changes that would
> require a cr4 or ldt reload will unconditionally send an IPI even to
> lazy CPUs.  So, if we aren't changing our mm, we don't need to refresh
> cr4 or the ldt */

Looks like switch_ldt already skips reloading the LDT when prev equals
next, or when they simply have the same LDT values:

        if (unlikely((unsigned long)prev->context.ldt |
                     (unsigned long)next->context.ldt))
                load_mm_ldt(next);

It appears that the cr4 bits have a similar optimization:

static inline void cr4_set_bits(unsigned long mask)
{
        unsigned long cr4, flags;

        local_irq_save(flags);
        cr4 = this_cpu_read(cpu_tlbstate.cr4);
        if ((cr4 | mask) != cr4)
                __cr4_set(cr4 | mask);
        local_irq_restore(flags);
}

> 
> Hmm.  load_mm_cr4() should bypass itself when mm == &init_mm.  Want to
> fix that part or should I?
> 
Looks like there might not be anything to do here, after all.

On to the lazy TLB mm_struct refcounting stuff :)