[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <59bdfee24a9c0f7656f7c83e65789d72ab203edc.camel@intel.com>
Date: Thu, 7 Dec 2023 12:58:53 +0000
From: "Huang, Kai" <kai.huang@...el.com>
To: "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
"mhkelley58@...il.com" <mhkelley58@...il.com>,
"Cui, Dexuan" <decui@...rosoft.com>,
"jpiotrowski@...ux.microsoft.com" <jpiotrowski@...ux.microsoft.com>
CC: "cascardo@...onical.com" <cascardo@...onical.com>,
"tim.gardner@...onical.com" <tim.gardner@...onical.com>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"thomas.lendacky@....com" <thomas.lendacky@....com>,
"roxana.nicolescu@...onical.com" <roxana.nicolescu@...onical.com>,
"stable@...r.kernel.org" <stable@...r.kernel.org>,
"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"kys@...rosoft.com" <kys@...rosoft.com>,
"stefan.bader@...onical.com" <stefan.bader@...onical.com>,
"nik.borisov@...e.com" <nik.borisov@...e.com>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"hpa@...or.com" <hpa@...or.com>,
"peterz@...radead.org" <peterz@...radead.org>,
"wei.liu@...nel.org" <wei.liu@...nel.org>,
"sashal@...nel.org" <sashal@...nel.org>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"bp@...en8.de" <bp@...en8.de>, "x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early
TDX init
>
> > I think we are lacking background of this usage model and how it works. For
> > instance, typically L2 is created by L1, and L1 is responsible for L2's device
> > I/O emulation. I don't quite understand how could L0 emulate L2's device I/O?
> >
> > Can you provide more information?
>
> Let's differentiate between fast and slow I/O. The whole point of the paravisor in
> L1 is to provide device emulation for slow I/O: TPM, RTC, NVRAM, IO-APIC, serial ports.
>
> But fast I/O is designed to bypass it and go straight to L0. Hyper-V uses paravirtual
> vmbus devices for fast I/O (net/block). The vmbus protocol has awareness of page visibility
> built-in and uses native (GHCI on TDX, GHCB on SNP) mechanisms for notifications. So once
> everything is set up (rings/buffers in swiotlb), the I/O for fast devices does not
> involve L1. This is only possible when the VM manages C-bit itself.
Yeah that makes sense. Thanks for the info.
>
> I think the same thing could work for virtio if someone would "enlighten" vring
> notification calls (instead of I/O or MMIO instructions).
>
> >
> > >
> > > >
> > > > >
> > > > > Whats missing is the tdx_guest flag is not exposed to userspace in /proc/cpuinfo,
> > > > > and as a result dmesg does not currently display:
> > > > > "Memory Encryption Features active: Intel TDX".
> > > > >
> > > > > That's what I set out to correct.
> > > > >
> > > > > > So far I see that you try to get kernel think that it runs as TDX guest,
> > > > > > but not really. This is not very convincing model.
> > > > > >
> > > > >
> > > > > No that's not accurate at all. The kernel is running as a TDX guest so I
> > > > > want the kernel to know that.
> > > > >
> > > >
> > > > But it isn't. It runs on a hypervisor which is a TDX guest, but this doesn't
> > > > make itself a TDX guest.>
> > >
> > > That depends on your definition of "TDX guest". The TDX 1.5 TD partitioning spec
> > > talks of TDX-enlightened L1 VMM, (optionally) TDX-enlightened L2 VM and Unmodified
> > > Legacy L2 VM. Here we're dealing with a TDX-enlightened L2 VM.
> > >
> > > If a guest runs inside an Intel TDX protected TD, is aware of memory encryption and
> > > issues TDVMCALLs - to me that makes it a TDX guest.
> >
> > The thing I don't quite understand is what enlightenment(s) requires L2 to issue
> > TDVMCALL and know "encryption bit".
> >
> > The reason that I can think of is:
> >
> > If device I/O emulation of L2 is done by L0 then I guess it's reasonable to make
> > L2 aware of the "encryption bit" because L0 can only write emulated data to
> > shared buffer. The shared buffer must be initially converted by the L2 by using
> > MAP_GPA TDVMCALL to L0 (to zap private pages in S-EPT etc), and L2 needs to know
> > the "encryption bit" to set up its page table properly. L1 must be aware of
> > such private <-> shared conversion too to setup page table properly so L1 must
> > also be notified.
>
> Your description is correct, except that L2 uses a hypercall (hv_mark_gpa_visibility())
> to notify L1 and L1 issues the MAP_GPA TDVMCALL to L0.
In TDX partitioning IIUC L1 and L2 use different secure-EPT page table when
mapping GPA of L1 and L2. Therefore IIUC entries of both secure-EPT table which
map to the "to be converted page" need to be zapped.
I am not entirely sure whether using hv_mark_gpa_visibility() is suffice? As if
the MAP_GPA was from L1 then I am not sure L0 is easy to zap secure-EPT entry
for L2.
But anyway these are details probably we don't need to consider.
>
> C-bit awareness is necessary to setup the whole swiotlb pool to be host visible for
> DMA.
Agreed.
>
> >
> > The concern I am having is whether there's other usage model(s) that we need to
> > consider. For instance, running both unmodified L2 and enlightened L2. Or some
> > L2 only needs TDVMCALL enlightenment but no "encryption bit".
> >
>
> Presumably unmodified L2 and enlightened L2 are already covered by current code but
> require excessive trapping to L1.
>
> I can't see a usecase for TDVMCALLs but no "encryption bit".
>
> > In other words, that seems pretty much L1 hypervisor/paravisor implementation
> > specific. I am wondering whether we can completely hide the enlightenment(s)
> > logic to hypervisor/paravisor specific code but not generically mark L2 as TDX
> > guest but still need to disable TDCALL sort of things.
>
> That's how it currently works - all the enlightenments are in hypervisor/paravisor
> specific code in arch/x86/hyperv and drivers/hv and the vm is not marked with
> X86_FEATURE_TDX_GUEST.
And I believe there's a reason that the VM is not marked as TDX guest.
>
> But without X86_FEATURE_TDX_GUEST userspace has no unified way to discover that an
> environment is protected by TDX and also the VM gets classified as "AMD SEV" in dmesg.
> This is due to CC_ATTR_GUEST_MEM_ENCRYPT being set but X86_FEATURE_TDX_GUEST not.
Can you provide more information about what does _userspace_ do here?
What's the difference if it sees a TDX guest or a normal non-coco guest in
/proc/cpuinfo?
Looks the whole purpose of this series is to make userspace happy by advertising
TDX guest to /proc/cpuinfo. But if we do that we will have bad side-effect in
the kernel so that we need to do things in your patch 2/3.
That doesn't seem very convincing. Is there any other way that userspace can
utilize, e.g., any HV hypervisor/paravisor specific attributes that are exposed
to userspace?
Powered by blists - more mailing lists