linux-kernel - Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <YFzhUjvs4eKiwHtA@google.com>
Date:   Thu, 25 Mar 2021 19:15:30 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Ben Gardon <bgardon@...gle.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding
 during NX zapping

On Tue, Mar 23, 2021, Ben Gardon wrote:
> On Tue, Mar 23, 2021 at 11:58 AM Sean Christopherson <seanjc@...gle.com> wrote:
> >
> > On Tue, Mar 23, 2021, Ben Gardon wrote:
> > > On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@...gle.com> wrote:
> > > >
> > > > On Mon, Mar 22, 2021, Ben Gardon wrote:
> > > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > > > > yielding. Since we should only need to zap one SPTE, the yield should
> > > > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > > > > that only one SPTE is zapped we would have to specify the root though.
> > > > > Otherwise we could end up zapping all the entries for the same GFN
> > > > > range under an unrelated root.
> > > >
> > > > Hmm, I originally did exactly that, but changed my mind because this zaps far
> > > > more than 1 SPTE.  This is zapping a SP that could be huge, but is not, which
> > > > means it's guaranteed to have a non-zero number of child SPTEs.  The worst case
> > > > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.
> > >
> > > It's true that there are potentially 512^2 child sptes, but the code
> > > to clear those after the single PUD spte is cleared doesn't yield
> > > anyway. If the TDP MMU is only  operating with one root (as we would
> > > expect in most cases), there should only be one chance for it to
> > > yield.
> >
> > Ah, right, I was thinking all the iterative flows yielded.  Disallowing
> > kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best
> > fix.  Any objection to me sending v2 with that?
> 
> That sounds good to me.

Ewww.  This analysis isn't 100% accurate.  It's actually impossible for
zap_gfn_range() to yield in this case.  Even though it may walk multiple roots
and levels, "yielded_gfn == next_last_level_gfn" will hold true until the iter
attempts to step sideways.  When the iter attempts to step sideways, it will
always do so at the level that matches the zapping level, and so will always
step past "end".  Thus, tdp_root_for_each_pte() will break without ever
yielding.

That being said, I'm still going to send a patch to explicitly prevent this path
from yielding.  Relying on the above is waaaay too subtle and fragile.

> > > I've considered how we could allow the recursive changed spte handlers
> > > to yield, but it gets complicated quite fast because the caller needs
> > > to know if it yielded and reset the TDP iterator to the root, and
> > > there are some cases (mmu notifiers + vCPU path) where yielding is not
> > > desirable.
> >
> > Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy MMU
> > iterators.
> >
> > > >
> > > > But, I didn't consider the interplay between invalid_list and the TDP MMU
> > > > yielding.  Hrm.