lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250514230502.6b64da7f.zhiw@nvidia.com>
Date: Wed, 14 May 2025 23:05:02 +0300
From: Zhi Wang <zhiw@...dia.com>
To: Xu Yilun <yilun.xu@...ux.intel.com>
CC: Jason Gunthorpe <jgg@...dia.com>, Alexey Kardashevskiy <aik@....com>,
	<kvm@...r.kernel.org>, <dri-devel@...ts.freedesktop.org>,
	<linux-media@...r.kernel.org>, <linaro-mm-sig@...ts.linaro.org>,
	<sumit.semwal@...aro.org>, <christian.koenig@....com>, <pbonzini@...hat.com>,
	<seanjc@...gle.com>, <alex.williamson@...hat.com>,
	<vivek.kasireddy@...el.com>, <dan.j.williams@...el.com>,
	<yilun.xu@...el.com>, <linux-coco@...ts.linux.dev>,
	<linux-kernel@...r.kernel.org>, <lukas@...ner.de>, <yan.y.zhao@...el.com>,
	<daniel.vetter@...ll.ch>, <leon@...nel.org>, <baolu.lu@...ux.intel.com>,
	<zhenzhong.duan@...el.com>, <tao1.su@...el.com>
Subject: Re: [RFC PATCH 00/12] Private MMIO support for private assigned dev

On Wed, 14 May 2025 17:47:12 +0800
Xu Yilun <yilun.xu@...ux.intel.com> wrote:

> On Tue, May 13, 2025 at 01:03:15PM +0300, Zhi Wang wrote:
> > On Mon, 12 May 2025 11:06:17 -0300
> > Jason Gunthorpe <jgg@...dia.com> wrote:
> > 
> > > On Mon, May 12, 2025 at 07:30:21PM +1000, Alexey Kardashevskiy
> > > wrote:
> > > 
> > > > > > I'm surprised by this.. iommufd shouldn't be doing PCI
> > > > > > stuff, it is just about managing the translation control of
> > > > > > the device.
> > > > > 
> > > > > I have a little difficulty to understand. Is TSM bind PCI
> > > > > stuff? To me it is. Host sends PCI TDISP messages via PCI DOE
> > > > > to put the device in TDISP LOCKED state, so that device
> > > > > behaves differently from before. Then why put it in IOMMUFD?
> > > > 
> > > > 
> > > > "TSM bind" sets up the CPU side of it, it binds a VM to a piece
> > > > of IOMMU on the host CPU. The device does not know about the
> > > > VM, it just enables/disables encryption by a request from the
> > > > CPU (those start/stop interface commands). And IOMMUFD won't be
> > > > doing DOE, the platform driver (such as AMD CCP) will. Nothing
> > > > to do for VFIO here.
> > > > 
> > > > We probably should notify VFIO about the state transition but I
> > > > do not know VFIO would want to do in response.
> > > 
> > > We have an awkward fit for what CCA people are doing to the
> > > various Linux APIs. Looking somewhat maximally across all the
> > > arches a "bind" for a CC vPCI device creation operation does:
> > > 
> > >  - Setup the CPU page tables for the VM to have access to the MMIO
> > >  - Revoke hypervisor access to the MMIO
> > >  - Setup the vIOMMU to understand the vPCI device
> > >  - Take over control of some of the IOVA translation, at least for
> > > T=1, and route to the the vIOMMU
> > >  - Register the vPCI with any attestation functions the VM might
> > > use
> > >  - Do some DOE stuff to manage/validate TDSIP/etc
> > > 
> > > So we have interactions of things controlled by PCI, KVM, VFIO,
> > > and iommufd all mushed together.
> > > 
> > > iommufd is the only area that already has a handle to all the
> > > required objects:
> > >  - The physical PCI function
> > >  - The CC vIOMMU object
> > >  - The KVM FD
> > >  - The CC vPCI object
> > > 
> > > Which is why I have been thinking it is the right place to manage
> > > this.
> > > 
> > > It doesn't mean that iommufd is suddenly doing PCI stuff, no, that
> > > stays in VFIO.
> > > 
> > > > > > So your issue is you need to shoot down the dmabuf during
> > > > > > vPCI device destruction?
> > > > > 
> > > > > I assume "vPCI device" refers to assigned device in both
> > > > > shared mode & prvate mode. So no, I need to shoot down the
> > > > > dmabuf during TSM unbind, a.k.a. when assigned device is
> > > > > converting from private to shared. Then recover the dmabuf
> > > > > after TSM unbind. The device could still work in VM in shared
> > > > > mode.
> > > 
> > > What are you trying to protect with this? Is there some intelism
> > > where you can't have references to encrypted MMIO pages?
> > > 
> > 
> > I think it is a matter of design choice. The encrypted MMIO page is
> > related to the TDI context and secure second level translation table
> > (S-EPT). and S-EPT is related to the confidential VM's context.
> > 
> > AMD and ARM have another level of HW control, together
> > with a TSM-owned meta table, can simply mask out the access to those
> > encrypted MMIO pages. Thus, the life cycle of the encrypted
> > mappings in the second level translation table can be de-coupled
> > from the TDI unbound. They can be reaped un-harmfully later by
> > hypervisor in another path.
> > 
> > While on Intel platform, it doesn't have that additional level of
> > HW control by design. Thus, the cleanup of encrypted MMIO page
> > mapping in the S-EPT has to be coupled tightly with TDI context
> > destruction in the TDI unbind process.
> 
> Thanks for the accurate explanation. Yes, in TDX, the
> references/mapping to the encrypted MMIO page means a CoCo-VM owns
> the MMIO page. So TDX firmware won't allow the CC vPCI device (which
> physically owns the MMIO page) unbind/freed from a CoCo-VM, while the
> VM still have the S-EPT mapping.
> 
> AMD doesn't use KVM page table to track CC ownership, so no need to
> interact with KVM.
> 

IMHO, I think it might be helpful that you can picture out what are the
minimum requirements (function/life cycle) to the current IOMMUFD TSM
bind architecture:

1.host tsm_bind (preparation) is in IOMMUFD, triggered by QEMU handling
the TVM-HOST call.
2. TDI acceptance is handled in guest_request() to accept the TDI after
the validation in the TVM)

and which part/where need to be modified in the current architecture to
reach there. Try to fold vendor-specific knowledge as much as possible,
but still keep them modular in the TSM driver and let's see how it looks
like. Maybe some example TSM driver code to demonstrate together with
VFIO dma-buf patch.

If some where is extremely hacky in the TSM driver, let's see how they
can be lift to the upper level or the upper call passes more parameters
to them.

Z.

> Thanks,
> Yilun
> 
> > 
> > If the TDI unbind is triggered in VFIO/IOMMUFD, there has be a
> > cross-module notification to KVM to do cleanup in the S-EPT.
> > 
> > So shooting down the DMABUF object (encrypted MMIO page) means
> > shooting down the S-EPT mapping and recovering the DMABUF object
> > means re-construct the non-encrypted MMIO mapping in the EPT after
> > the TDI is unbound. 
> > 
> > Z.
> > 
> > > > > What I really want is, one SW component to manage MMIO dmabuf,
> > > > > secure iommu & TSM bind/unbind. So easier coordinate these 3
> > > > > operations cause these ops are interconnected according to
> > > > > secure firmware's requirement.
> > > >
> > > > This SW component is QEMU. It knows about FLRs and other config
> > > > space things, it can destroy all these IOMMUFD objects and talk
> > > > to VFIO too, I've tried, so far it is looking easier to manage.
> > > > Thanks,
> > > 
> > > Yes, qemu should be sequencing this. The kernel only needs to
> > > enforce any rules required to keep the system from crashing.
> > > 
> > > Jason
> > > 
> > 
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ