lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1527915842.7898.93.camel@surriel.com>
Date:   Sat, 02 Jun 2018 01:04:02 -0400
From:   Rik van Riel <riel@...riel.com>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     Mike Galbraith <efault@....de>,
        LKML <linux-kernel@...r.kernel.org>, songliubraving@...com,
        kernel-team <kernel-team@...com>, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm

On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
> On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@...riel.com> wrote:
> > 
> > On Fri, 1 Jun 2018 14:21:58 -0700
> > Andy Lutomirski <luto@...nel.org> wrote:
> > 
> > > Hmm.  I wonder if there's a more clever data structure than a
> > > bitmap
> > > that we could be using here.  Each CPU only ever needs to be in
> > > one
> > > mm's cpumask, and each cpu only ever changes its own state in the
> > > bitmask.  And writes are much less common than reads for most
> > > workloads.
> > 
> > It would be easy enough to add an mm_struct pointer to the
> > per-cpu tlbstate struct, and iterate over those.
> > 
> > However, that would be an orthogonal change to optimizing
> > lazy TLB mode.
> > 
> > Does the (untested) patch below make sense as a potential
> > improvement to the lazy TLB heuristic?
> > 
> > ---8<---
> > Subject: x86,tlb: workload dependent per CPU lazy TLB switch
> > 
> > Lazy TLB mode is a tradeoff between flushing the TLB and touching
> > the mm_cpumask(&init_mm) at context switch time, versus potentially
> > incurring a remote TLB flush IPI while in lazy TLB mode.
> > 
> > Whether this pays off is likely to be workload dependent more than
> > anything else. However, the current heuristic keys off hardware
> > type.
> > 
> > This patch changes the lazy TLB mode heuristic to a dynamic, per-
> > CPU
> > decision, dependent on whether we recently received a remote TLB
> > shootdown while in lazy TLB mode.
> > 
> > This is a very simple heuristic. When a CPU receives a remote TLB
> > shootdown IPI while in lazy TLB mode, a counter in the same cache
> > line is set to 16. Every time we skip lazy TLB mode, the counter
> > is decremented.
> > 
> > While the counter is zero (no recent TLB flush IPIs), allow lazy
> > TLB mode.
> 
> Hmm, cute.  That's not a bad idea at all.  It would be nice to get
> some kind of real benchmark on both PCID and !PCID.  If nothing else,
> I would expect the threshold (16 in your patch) to want to be lower
> on
> PCID systems.

That depends on how well we manage to get rid of
the cpumask manipulation overhead. On the PCID
system we first found this issue, the atomic
accesses to the mm_cpumask took about 4x as much
CPU time as the TLB invalidation itself.

That kinda limits how much the cost of cheaper
TLB flushes actually help :)

I agree this code should get some testing.

-- 
All Rights Reversed.
Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ