linux-kernel - Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrVqHDLo09HcaoeOoAVK8w+cNWkSNTLkDDU=evUhaXkyhQ@mail.gmail.com>
Date:   Fri, 10 Jul 2020 10:04:54 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Nicholas Piggin <npiggin@...il.com>
Cc:     linux-arch <linux-arch@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Arnd Bergmann <arnd@...db.de>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
        Linux-MM <linux-mm@...ck.org>, Anton Blanchard <anton@...abs.org>
Subject: Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@...il.com> wrote:
>
> And get rid of the generic sync_core_before_usermode facility.
>
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
>
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
>
> This makes lazy tlb code a bit more modular.

The mm-switching and TLB-management has often had the regrettable
property that it gets wired up in a way that seems to work at the time
but doesn't have clear semantics, and I'm a bit concerned that this
patch is in that category.  If I'm understanding right, you're trying
to enforce the property that exiting lazy TLB mode will promise to
sync the core eventually.  But this has all kinds of odd properties:

 - Why is exit_lazy_tlb() getting called at all in the relevant cases?
 When is it permissible to call it?  I look at your new code and see:

> +/*
> + * Ensure that a core serializing instruction is issued before returning
> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to
> + * user-space through sysexit, sysrel, and sysretq, which are not core
> + * serializing.
> + *
> + * See the membarrier comment in finish_task_switch as to why this is done
> + * in exit_lazy_tlb.
> + */
> +#define exit_lazy_tlb exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> +{
> +       /* Switching mm is serializing with write_cr3 */
> +        if (tsk->mm != mm)
> +                return;

And my brain says WTF?  Surely you meant something like if
(WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened?  how do
we try to recover well enough to get a crashed logged at least? */ }

So this needs actual documentation, preferably in comments near the
function, of what the preconditions are and what this mm parameter is.
Once that's done, then we could consider whether it's appropriate to
have this function promise to sync the core under some conditions.

 - This whole structure seems to rely on the idea that switching mm
syncs something.  I periodically ask chip vendor for non-serializing
mm switches.  Specifically, in my dream world, we have totally
separate user and kernel page tables.  Changing out the user tables
doesn't serialize or even create a fence.  Instead it creates the
minimum required pipeline hazard such that user memory access and
switches to usermode will make sure they access memory through the
correct page tables.  I haven't convinced a chip vendor yet, but there
are quite a few hundreds of cycles to be saved here.  With this in
mind, I see the fencing aspects of the TLB handling code as somewhat
of an accident.  I'm fine with documenting them and using them to
optimize other paths, but I think it should be explicit.  For example:

/* Also does a full barrier?  (Or a sync_core()-style barrier.)
However, if you rely on this, you must document it in a comment where
you call this function. *?
void switch_mm_irqs_off()
{
}

This is kind of like how we strongly encourage anyone using smp_?mb()
to document what they are fencing against.

Also, as it stands, I can easily see in_irq() ceasing to promise to
serialize.  There are older kernels for which it does not promise to
serialize.  And I have plans to make it stop serializing in the
nearish future.

--Andy