lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230828161304.dolvx6bgnxrqilmj@box.shutemov.name>
Date:   Mon, 28 Aug 2023 19:13:04 +0300
From:   "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>
To:     "Michael Kelley (LINUX)" <mikelley@...rosoft.com>
Cc:     KY Srinivasan <kys@...rosoft.com>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>,
        Dexuan Cui <decui@...rosoft.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "luto@...nel.org" <luto@...nel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "thomas.lendacky@....com" <thomas.lendacky@....com>,
        "sathyanarayanan.kuppuswamy@...ux.intel.com" 
        <sathyanarayanan.kuppuswamy@...ux.intel.com>,
        "seanjc@...gle.com" <seanjc@...gle.com>,
        "rick.p.edgecombe@...el.com" <rick.p.edgecombe@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [RFC PATCH 1/1] x86/mm: Mark CoCo VM pages invalid while moving
 between private and shared

On Mon, Aug 28, 2023 at 02:22:11PM +0000, Michael Kelley (LINUX) wrote:
> From: Michael Kelley (LINUX) <mikelley@...rosoft.com> Sent: Tuesday, August 15, 2023 7:54 PM
> > 
> > From: kirill.shutemov@...ux.intel.com <kirill.shutemov@...ux.intel.com> Sent: Sunday,
> > August 6, 2023 3:20 PM
> > >
> > > On Thu, Jul 06, 2023 at 09:41:59AM -0700, Michael Kelley wrote:
> > > > In a CoCo VM when a page transitions from private to shared, or vice
> > > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > > be notified of the change. Because there are two separate steps, there's
> > > > a window where the settings are inconsistent.  Normally the code that
> > > > initiates the transition (via set_memory_decrypted() or
> > > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > > during a transition, so the window of inconsistency is not a problem.
> > > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > > pages at arbitrary times, which could access a transitioning page during
> > > > the window.  In such a case, CoCo VM specific exceptions are taken
> > > > (depending on the CoCo architecture in use).  Current code in those
> > > > exception handlers recovers and does "fixup" on the result returned by
> > > > load_unaligned_zeropad().  Unfortunately, this exception handling and
> > > > fixup code is tricky and somewhat fragile.  At the moment, it is
> > > > broken for both TDX and SEV-SNP.
> > >
> > 
> > Thanks for looking at this.  I'm finally catching up after being out on
> > vacation for a week.
> > 
> > > I believe it is not fixed for TDX. Is it still a problem for SEV-SNP?
> > 
> > I presume you meant "now fixed for TDX", which I agree with.  Tom
> > Lendacky has indicated that there's still a problem with SEV-SNP.   He
> > could fix that problem, but this approach of marking the pages
> > invalid obviates the need for Tom's fix.
> > 
> > >
> > > > There's also a problem with the current code in paravisor scenarios:
> > > > TDX Partitioning and SEV-SNP in vTOM mode. The exceptions need
> > > > to be forwarded from the paravisor to the Linux guest, but there
> > > > are no architectural specs for how to do that.
> > 
> > The TD Partitioning case (and the similar SEV-SNP vTOM mode case) is
> > what doesn't have a solution.  To elaborate, with TD Partitioning, #VE
> > is sent to the containing VM, not the main Linux guest VM.  For
> > everything except an EPT violation, the containing VM can handle
> > the exception on behalf of the main Linux guest by doing the
> > appropriate emulation.  But for an EPT violation, the containing
> > VM can only terminate the guest.  It doesn't have sufficient context
> > to handle a "valid" #VE with EPT violation generated due to
> > load_unaligned_zeropad().  My proposed scheme of marking the
> > pages invalid avoids generating those #VEs and lets TD Partitioning
> > (and similarly for SEV-SNP vTOM) work as intended with a paravisor.
> > 
> > > >
> > > > To avoid these complexities of the CoCo exception handlers, change
> > > > the core transition code in __set_memory_enc_pgtable() to do the
> > > > following:
> > > >
> > > > 1.  Remove aliasing mappings
> > > > 2.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > > > 3.  Flush the TLB globally
> > > > 4.  Flush the data cache if needed
> > > > 5.  Set/clear the encryption attribute as appropriate
> > > > 6.  Notify the hypervisor of the page status change
> > > > 7.  Add back the PRESENT bit
> > >
> > > Okay, looks safe.
> > >
> > > > With this approach, load_unaligned_zeropad() just takes its normal
> > > > page-fault-based fixup path if it touches a page that is transitioning.
> > > > As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> > > > are completely decoupled.  CoCo VM page transitions can proceed
> > > > without needing to handle architecture-specific exceptions and fix
> > > > things up. This decoupling reduces the complexity due to separate
> > > > TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> > > > introduce new capabilities in future versions of the TDX and SEV-SNP
> > > > architectures. Paravisor scenarios work properly without needing
> > > > to forward exceptions.
> > > >
> > > > This approach may make __set_memory_enc_pgtable() slightly slower
> > > > because of touching the PTEs three times instead of just once. But
> > > > the run time of this function is already dominated by the hypercall
> > > > and the need to flush the TLB at least once and maybe twice. In any
> > > > case, this function is only used for CoCo VM page transitions, and
> > > > is already unsuitable for hot paths.
> > > >
> > > > The architecture specific callback function for notifying the
> > > > hypervisor typically must translate guest kernel virtual addresses
> > > > into guest physical addresses to pass to the hypervisor.  Because
> > > > the PTEs are invalid at the time of callback, the code for doing the
> > > > translation needs updating.  virt_to_phys() or equivalent continues
> > > > to work for direct map addresses.  But vmalloc addresses cannot use
> > > > vmalloc_to_page() because that function requires the leaf PTE to be
> > > > valid. Instead, slow_virt_to_phys() must be used. Both functions
> > > > manually walk the page table hierarchy, so performance is the same.
> > >
> > > Uhmm.. But why do we expected slow_virt_to_phys() to work on non-present
> > > page table entries? It seems accident for me that it works now. Somebody
> > > forgot pte_present() check.
> > 
> > I didn't research the history of slow_virt_to_phys(), but I'll do so.
> > 
> 
> The history of slow_virt_to_phys() doesn't show any discussion of
> whether it is expected to work for non-present PTEs.  However, the
> page table walking is done by lookup_address(), which explicitly
> *does* work for non-present PTEs.  For example, lookup_address()
> is used in cpa_flush() to find a PTE.  cpa_flush() then checks to see if
> the PTE is present.

Which is totally fine thing to do. Present bit is the only info you can
always rely to be valid for non-present PTE.

But it is nitpicking on my side. Here you control lifecycle of the PTE, so
you know that PFN will stay valid for the PTE.

I guess the right thing to do is to use lookup_address() and get pfn from
the PTE with the comment why it is valid.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ