linux-kernel - Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 17 Jul 2018 13:04:07 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Rik van Riel <riel@...riel.com>
Cc:     LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Andrew Lutomirski <luto@...nel.org>,
        Mike Galbraith <efault@....de>,
        kernel-team <kernel-team@...com>, Ingo Molnar <mingo@...nel.org>,
        Dave Hansen <dave.hansen@...el.com>
Subject: Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

On Mon, Jul 16, 2018 at 12:03 PM, Rik van Riel <riel@...riel.com> wrote:
> Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
> when all it really needs to do is reload %CR3 at the next context switch,
> assuming no page table pages got freed.
>
> Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
> which checks whether .tlb_gen changed, and the TLB invalidation code, which
> increments .tlb_gen whenever page table entries get invalidated.
>
> The atomic increment in inc_mm_tlb_gen is its own barrier; the context
> switch code adds an explicit barrier between reading tlbstate.is_lazy and
> next->context.tlb_gen.
>
> Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
> are not removed from the mm_cpumask(mm), since that would prevent the TLB
> flush IPIs at page table free time from being sent to all the CPUs
> that need them.
>
> This patch reduces total CPU use in the system by about 1-2% for a
> memcache workload on two socket systems, and by about 1% for a heavily
> multi-process netperf between two systems.
>

I'm not 100% certain I'm replying to the right email, and I haven't
gotten the tip-bot notification at all, but:

I think you've introduced a minor-ish performance regression due to
changing the old (admittedly terribly documented) control flow a bit.
Before, if real_prev == next, we would skip:

load_mm_cr4(next);
switch_ldt(real_prev, next);

Now we don't any more.  I think you should reinstate that
optimization.  It's probably as simple as wrapping them in an if
(real_priv != next) with a comment like /* Remote changes that would
require a cr4 or ldt reload will unconditionally send an IPI even to
lazy CPUs.  So, if we aren't changing our mm, we don't need to refresh
cr4 or the ldt */

Hmm.  load_mm_cr4() should bypass itself when mm == &init_mm.  Want to
fix that part or should I?

--Andy