linux-kernel - Re: [PATCH] x86,switch_mm: skip atomic operations for init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrWMouz7VzYgwC=Ys3-C2O7ZVCn4FNwesH1np4c9iG1g_A@mail.gmail.com>
Date:   Fri, 1 Jun 2018 14:21:58 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     riel@...riel.com
Cc:     Andrew Lutomirski <luto@...nel.org>,
        Mike Galbraith <efault@....de>,
        LKML <linux-kernel@...r.kernel.org>, songliubraving@...com,
        kernel-team <kernel-team@...com>, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm

On Fri, Jun 1, 2018 at 1:35 PM Rik van Riel <riel@...riel.com> wrote:
>
> On Fri, 2018-06-01 at 13:03 -0700, Andy Lutomirski wrote:
> > Mike, you never did say: do you have PCID on your CPU?  Also, what is
> > your workload doing to cause so many switches back and forth between
> > init_mm and a task.
> >
> > The point of the optimization is that switching to init_mm() should
> > be
> > fairly fast on a PCID system, whereas an IPI to do the deferred flush
> > is very expensive regardless of PCID.
>
> While I am sure that bit is true, Song and I
> observed about 4x as much CPU use in the atomic
> operations in cpumask_clear_cpu and cpumask_set_cpu
> (inside switch_mm_irqs_off) as we saw CPU used
> in the %cr3 reload itself.
>
> Given how expensive those cpumask updates are,
> lazy TLB mode might always be worth it, especially
> on larger systems.
>

Hmm.  I wonder if there's a more clever data structure than a bitmap
that we could be using here.  Each CPU only ever needs to be in one
mm's cpumask, and each cpu only ever changes its own state in the
bitmask.  And writes are much less common than reads for most
workloads.