linux-kernel - Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YPCuTiNET/hJHqOY@google.com>
Date:   Thu, 15 Jul 2021 21:53:18 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Brijesh Singh <brijesh.singh@....com>
Cc:     Dave Hansen <dave.hansen@...el.com>, x86@...nel.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        linux-efi@...r.kernel.org, platform-driver-x86@...r.kernel.org,
        linux-coco@...ts.linux.dev, linux-mm@...ck.org,
        linux-crypto@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Joerg Roedel <jroedel@...e.de>,
        Tom Lendacky <thomas.lendacky@....com>,
        "H. Peter Anvin" <hpa@...or.com>, Ard Biesheuvel <ardb@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Andy Lutomirski <luto@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Sergio Lopez <slp@...hat.com>, Peter Gonda <pgonda@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        David Rientjes <rientjes@...gle.com>,
        Dov Murik <dovmurik@...ux.ibm.com>,
        Tobin Feldman-Fitzthum <tobin@....com>,
        Borislav Petkov <bp@...en8.de>,
        Michael Roth <michael.roth@....com>,
        Vlastimil Babka <vbabka@...e.cz>, tony.luck@...el.com,
        npmccallum@...hat.com, brijesh.ksingh@...il.com
Subject: Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the
 RMP fault for user address

On Mon, Jul 12, 2021, Brijesh Singh wrote:
> 
> 
> On 7/12/21 11:29 AM, Dave Hansen wrote:
> > On 7/12/21 9:24 AM, Brijesh Singh wrote:
> > > Apologies if I was not clear in the messaging, that's exactly what I
> > > mean that we don't feed RMP entries during the page state change.
> > > 
> > > The sequence of the operation is:
> > > 
> > > 1. Guest issues a VMGEXIT (page state change) to add a page in the RMP
> > > 2. Hyperivosr adds the page in the RMP table.
> > > 
> > > The check will be inside the hypervisor (#2), to query the backing page
> > > type, if the backing page is from the hugetlbfs, then don't add the page
> > > in the RMP, and fail the page state change VMGEXIT.
> > 
> > Right, but *LOOOOOONG* before that, something walked the page tables and
> > stuffed the PFN into the NPT (that's the AMD equivalent of EPT, right?).
> >   You could also avoid this whole mess by refusing to allow hugetblfs to
> > be mapped into the guest in the first place.
> > 
> 
> Ah, that should be doable. For SEV stuff, we require the VMM to register the
> memory region to the hypervisor during the VM creation time. I can check the
> hugetlbfs while registering the memory region and fail much earlier.

That's technically unnecessary, because this patch is working on the wrong set of
page tables when handling faults from KVM.

The host page tables constrain KVM's NPT, but the two are not mirrors of each
other.  Specifically, KVM cannot exceed the size of the host page tables because
that would give the guest access to memory it does not own, but KVM isn't required
to use the same size as the host.  E.g. a 1gb page in the host can be 1gb, 2mb, or
4kb in the NPT.

The code "works" because the size contraints mean it can't get false negatives,
only false positives, false positives will never be fatal, e.g. the fault handler
may unnecessarily demote a 1gb, and demoting a host page will further constrain
KVM's NPT.

The distinction matters because it changes our options.  For RMP violations on
NPT due to page size mismatches, KVM can and should handle the fault without
consulting the primary MMU, i.e. by demoting the NPT entry.  That means KVM does
not need to care about hugetlbfs or any other backing type that cannot be split
since KVM will never initiate a host page split in response to a #NPT RMP violation.

That doesn't mean that hugetlbfs will magically work since e.g. get/put_user()
will fault and fail, but that's a generic non-KVM problem since nothing prevents
remapping and/or accessing the page(s) outside of KVM context.

The other reason to not disallow hugetlbfs and co. is that a guest that's
enlightened to operate at 2mb granularity, e.g. always do page state changes on
2mb chunks, can play nice with hugetlbfs without ever hitting an RMP violation.

Last thought, have we taken care in the guest side of things to work at 2mb
granularity when possible?  AFAICT, PSMASH is effectively a one-way street since
RMPUPDATE to restore a 2mb RMP is destructive, i.e. requires PVALIDATE on the
entire 2mb chunk, and the guest can't safely do that without reinitializing the
whole page, e.g. would either lose data or have to save/init/restore.