[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <688d2f7ac39ce_cff9910024@dwillia2-xfh.jf.intel.com.notmuch>
Date: Fri, 1 Aug 2025 14:19:54 -0700
From: <dan.j.williams@...el.com>
To: Jason Gunthorpe <jgg@...pe.ca>, <dan.j.williams@...el.com>
CC: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@...nel.org>,
<linux-coco@...ts.linux.dev>, <kvmarm@...ts.linux.dev>,
<linux-pci@...r.kernel.org>, <linux-kernel@...r.kernel.org>, <aik@....com>,
<lukas@...ner.de>, Samuel Ortiz <sameo@...osinc.com>, Xu Yilun
<yilun.xu@...ux.intel.com>, Suzuki K Poulose <Suzuki.Poulose@....com>,
"Steven Price" <steven.price@....com>, Catalin Marinas
<catalin.marinas@....com>, "Marc Zyngier" <maz@...nel.org>, Will Deacon
<will@...nel.org>, Oliver Upton <oliver.upton@...ux.dev>
Subject: Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support
Jason Gunthorpe wrote:
> On Thu, Jul 31, 2025 at 07:07:17PM -0700, dan.j.williams@...el.com wrote:
> > Aneesh Kumar K.V (Arm) wrote:
> > > Host:
> > > step 1.
> > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
> > > echo vfio-pci > /sys/bus/pci/devices/${DEVICE}/driver_override
> > > echo ${DEVICE} > /sys/bus/pci/drivers_probe
> > >
> > > step 2.
> > > echo 1 > /sys/bus/pci/devices/$DEVICE/tsm/connect
> >
> > Just for my own understanding... presumably there is no ordering
> > constraint for ARM CCA between step1 and step2, right? I.e. The connect
> > state is independent of the bind state.
> >
> > In the v4 PCI/TSM scheme the connect command is now:
> >
> > echo $tsm_dev > /sys/bus/pci/devices/$DEVICE/tsm/connect
>
> What does this do on the host? It seems to somehow prep it for VM
> assignment? Seems pretty strange this is here in sysfs and not part of
> creating the vPCI function in the VM through VFIO and iommufd?
vPCI is out of the picture at this phase.
On the host this establishes an SPDM session and sets up link encryption
(IDE) with the physical device. Leave VMs out of the picture, this
capability in isolation is a useful property. It addresses the similar
threat model that Intel Total Memory Encryption (TME) or AMD Secure
Memory Encryption (SME) go after, i.e. interposer on a physical link
capturing data in flight.
With that established then one can go futher to do the full TDISP dance.
> Frankly, I'm nervous about making any uAPI whatsoever for the
> hypervisor side at this point. I don't think we have enough of the
> solution even in draft format. I'd really like your first merged TSM
> series to only have uAPI for the guest side where things are hopefully
> closer to complete.
Aligned. I am not comfortable merging any of this until we have that end
to end reliably stable for a kernel cycle or 2. The proposal is soak all
the vendor solutions together in tsm.git#staging.
Now, if the guest side graduates out of that staging before the host
side, I am ok with that.
> > > step 1:
> > > echo ${DEVICE} > /sys/bus/pci/devices/${DEVICE}/driver/unbind
> > >
> > > step 2: Move the device to TDISP LOCK state
> > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/lock
> >
> > Ok, so my stance has recently picked up some nuance here. As Jason
> > mentions here:
> >
> > http://lore.kernel.org/20250410235008.GC63245@ziepe.ca
> >
> > "However it works, it should be done before the driver is probed and
> > remain stable for the duration of the driver attachment. From the
> > iommu side the correct iommu domain, on the correct IOMMU instance to
> > handle the expected traffic should be setup as the DMA API's iommu
> > domain."
>
> I think it is not just the dma api, but also the MMIO registers may
> move location (form shared to protected IPA space for
> example). Meaning any attached driver is completely wrecked.
True.
> > I agree with that up until the point where the implication is userspace
> > control of the UNLOCKED->LOCKED transition. That transition requires
> > enabling bus-mastering (BME),
>
> Why? That's sad. BME should be controlled by the VM driver not the
> TSM, and it should be set only when a VM driver is probed to the RUN
> state device?
To me it is an unfortunate PCI specification wrinkle that writing to the
command register drops the device from RUN to ERROR. So you can LOCK
without setting BME, but then no DMA.
> > and *then* locking the device. That means userspace is blindly
> > hoping that the device is in a state where it will remain quiet on the
> > bus between BME and LOCKED, and that the previous unbind left the device
> > in a state where it is prepared to be locked again.
>
> Yes, but we broadly assume this already in Linux. Drivers assume their
> devices are quiet when they are bound the first time, we expect on
> unbinding a driver quiets the device before removing.
>
> So broadly I think you can assume that a device with no driver is
> quiet regardless of BME.
>
> > 2 potential ways to solve this, but open to other ideas:
> >
> > - Userspace only picks the iommu domain context for the device not the
> > lock state. Something like:
> >
> > private > /sys/bus/pci/devices/${DEVICE}/tsm/domain
> >
> > ...where the default is "shared" and from that point the device can
> > not issue DMA until a driver attaches. Driver controls
> > UNLOCKED->LOCKED->RUN.
>
> What? Gross, no way can we let userspace control such intimate details
> of the kernel. The kernel must auto set based on what T=x mode the
> device driver binds into.
Flummoxed. Any way this gets sliced, userspace is asking for "private
world attach" because it alone knows that this device is acceptable, and
devices need to arrive in "shared world attach" mode.
> > - Userspace is not involved in this transition and the dma mapping API
> > is updated to allow a driver to switch the iommu domain at runtime,
> > but only if the device has no outstanding mappings and the transition
> > can only happen from ->probe() context. Driver controls joining
> > secure-world-DMA and UNLOCKED->LOCKED->RUN.
>
> I don't see why it is so complicated. The driver is unbound before it
> reaches T=1 so we expect the device to be quiet (bigger problems if
> not). When the PCI core reaches T=1 it tells the DMA API to
> reconfigure things for the unbound struct device. Then we bind a
> driver as normal.
>
> Driver controls nothing. All existing T=0 drivers "just work" with no
> source changes in T=1 mode. DMA API magically hides the bounce
> buffering. Surely this should be the baseline target functionality
> from a Linux perspective?
I started this project with "all existing T=0 drivers 'just work'" as a
goal and a virtue. I have been begrudgingly pulled away from it from the
slow drip of complexity it appears to push into the PCI core.
Now, I suspect the number of devices that are willing to spend gates and
firmware on TDISP capabilities in the near term is small. The "just
works" case is saved for either an L1 VMM to hide all this from an L2
guest, or a simplified TDISP specification that actually allows an OS
PCI core to handle these details in a standard way.
> So we should not have "driver controls" statements at all. Userspace
> prepares the PCI device, driver probes onto a T=1 environment and just
> works.
The concern is neither userspace nor the PCI core have everything it
needs to get the device to T=1. PCI core knows that the device is T=1
capable, but does not know how to preconfigure the device-specific lock
state, needs to wait for attestation. Userpace knows how to
attest/verify the device but really has no business running the device
outside of binding a driver, and can not rely on the PCI core to have
prepped the device's device-specific lock state.
Userspace might be able to bind a new driver that leaves the device in a
lockable state on unbind, but that is not "just works" that is,
"introduce a new concept of skinny TDISP setup drivers that leave
devices in LOCKED state on driver unbind, so that userspace can do the
work to verify the device and move it to RUN before loading the main
driver that expects the device arrives already running. Also, that main
driver needs to be careful not to trigger typically benign actions like
touch the command register to trip the device into ERROR state, or any
device-specific actions that trip ERROR state but would otherwise be
benign outside of TDISP."
If locking the device was just a toggle it would be possible. As far as
I can see it is a "prep+toggle" where "prep" needs a driver.
> > > step 3: Moves the device to TDISP RUN state
> > > echo 1 > /sys/bus/pci/devices/${DEVICE}/tsm/accept
> >
> > This has the same concern from me about userspace being in control of
> > BME. It feels like a departure from typical expectations.
>
> It is, it is architecturally broken for BME to be controlled by the
> TSM. BME is controlled by the guest OS driver only.
Agree. That "accept" attribute does not belong with TSM. That is where
Aneesh has it in this RFC. "Accept" as an action is the combination of
device entered the LOCKED state in a configuration the verifier is
willing to accept and the mechanics of triggering the LOCKED->RUN
transition.
> IMHO if this is a real worry (and I don't think it is) then the right
> answer is for physical BME to be set on during locking, but VIRTUAL
> BME is left off. Virtual BME is created by the hypervisor/tsm by
> telling the IOMMU to block DMA.
>
> The Guest OS should not participate in this broken design, the
> hypervisor can set pBME automatically when the lock request comes in,
> and the quality of vBME emulation is left up to the implementation,
> but the implementation must provide at least a NOP vBME once locked.
I can let go of the "BME without driver" worry, but that does nothing to
solve the "device specific configuration required before lock" problem.
> > Now, the nice thing about the scheme as proposed in this set is that
> > userspace has all the time in the world between "lock" and "accept" to
> > talk to a verifier.
>
> Seems right to me. There should be NO trusted kernel driver bound
> until the verifier accepts the attestation. Anything else allows un
> accepted devices to attack the kernel drivers. Few kernel drivers
> today distrust their HW interfaces as hostile actors and security
> defend against them. Therefore we should be very reluctant to bind
> drivers to anything.
>
> Arguably a CC secure kernel should have an allow list of audited
> secure drivers that can autoprobe and all other drivers must be
> approved by userspace in some way, either through T=1 and attestation
> or some customer-aware risk assumption.
Yes, today, where nothing is T=1 capable for an L1 guest*, the onus is
100% on the distribution, not the kernel. I.e. trim kernel config and
set modprobe policy to prevent unwanted drivers.
* For L2 there are proposals like this, where if you already trust your
paravisor also pre-trust all the devices it tells you to trust.
[1]: http://lore.kernel.org/20250714221545.5615-1-romank@linux.microsoft.com
> From that principal the kernel should NOT auto probe drivers to T=0
> devices that can be made T=1. Userspace should handle attaching HW to
> such devices, and userspace can sequence whatever is required,
> including the attestation and verifying.
Agree, for PCI it would be simple to set a no-auto-probe policy for T=1
capable devices.
> Otherwise, if you say, have a TDISP capable mlx5 device and boot up
> the cVM in a comporomised host the host can probably completely hack
> your cVM by exploiting the mlx5 drivers's total trust in the HW
> interface while running in T=0 mode.
>
> You must attest it and switch to T=1 before binding any driver if you
> care about mitigating this risk.
Yes, userspace must have a chance to say "no" before a driver attempts
to launch DMA to private memory after secrets have been deployed to the
TVM.
> > With the driver in control there would need to be something like a
> > usermodehelper to notify userspace that the device is in the locked
> > state and to go ahead and run the attestation while the driver waits*.
>
> It doesn't make sense to require modification to all existing drivers
> in Linux!
I do not want to burden the PCI core with TDISP compatibility hacks and
workarounds if it turns out only a small handful of devices ever deploy
a first generation TDISP Device Security Manager (DSM). L1 aiding L2, or
TDISP simplicity improvements to allow the PCI core to handle this in a
non-broken way, are what I expect if secure device assignment takes off.
> The starting point must have the core code do this sequence
> for every driver. Once that is working we can talk about if other
> flows are needed.
Do you agree that "device-specific-prep+lock" is the problem to solve?
> > > step 4: Load the driver again.
> > > echo ${DEVICE} > /sys/bus/pci/drivers_probe
> >
> > TIL drivers_probe
> >
> > Maybe want to recommend:
> >
> > echo ${DEVICE} > /sys/bus/pci/drivers/${DRIVER}/bind
> >
> > ...to users just in case there are multiple drivers loaded for the
> > device for the "shared" vs "private" case?
>
> Generic userspace will have a hard time to know what the driver names
> are..
>
> The driver_probe option looks good to me as the default.
>
> I'm not sure how generic code can handle "multiple drivers".. Most
> devices will be able to work just fine with T=0 mode with bounce
> buffers so we should generally not encourage people to make completely
> different drivers for T=0/T=1 mode.
>
> I think what is needed is some way for userspace to trigger the
> "locking configuration" you mentioned, that may need a special driver,
> but ONLY if the userspace is sequencing the device to T=1 mode. Not
> sure how to make that generic, but I think so long as userspace is
> explicitly controlling driver binding we can punt on that solution to
> the userspace project :)
>
> The real nastyness is RAS - what do you do when the device falls out
> of RUN, the kernel driver should pretty much explode. But lots of
> people would like the kernel driver to stay alive and somehow we FLR,
> re-attest and "resume" the kernel driver without allowing any T=0
> risks. For instance you can keep your netdev and just see a lot of
> lost packets while the driver thrashes.
Ideally the RUN->ERROR->UNLOCKED->LOCKED->RUN recovery can fit into the
existing 'struct pci_error_handlers' regime in some farther out future.
It was a "fun" discovery to see that virtual AER injection does not
exist in QEMU (at least last time I checked) and assigned devices that
throw physical AER events just kill the VM.
> But I think we can start with the idea that such RAS failures have to
> reload the driver too and work on improvements. Realistically few
> drivers have the sort of RAS features to consume this anyhow and maybe
> we introduce some "enhanced" driver mode to opt-into down the road.
Hmm, having trouble not reading that back supporting my argument above:
Realistically few devices support TDISP lets require enhanced drivers to
opt-into TDISP for the time being.
Powered by blists - more mailing lists