linux-kernel - Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250803222631.GN26511@ziepe.ca>
Date: Sun, 3 Aug 2025 19:26:31 -0300
From: Jason Gunthorpe <jgg@...pe.ca>
To: dan.j.williams@...el.com
Cc: "Aneesh Kumar K.V (Arm)" <aneesh.kumar@...nel.org>,
	linux-coco@...ts.linux.dev, kvmarm@...ts.linux.dev,
	linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
	aik@....com, lukas@...ner.de, Samuel Ortiz <sameo@...osinc.com>,
	Xu Yilun <yilun.xu@...ux.intel.com>,
	Suzuki K Poulose <Suzuki.Poulose@....com>,
	Steven Price <steven.price@....com>,
	Catalin Marinas <catalin.marinas@....com>,
	Marc Zyngier <maz@...nel.org>, Will Deacon <will@...nel.org>,
	Oliver Upton <oliver.upton@...ux.dev>
Subject: Re: [RFC PATCH v1 00/38] ARM CCA Device Assignment support

On Sat, Aug 02, 2025 at 04:50:50PM -0700, dan.j.williams@...el.com wrote:
> > Do you have some examples? I don't really see what complexity there is
> > if the solution it simply not auto bind any drivers to TDISP capable
> > devices and userspace is responsible to manually bind a driver once it
> > has reached T=1.
> 
> The example I have front of mind (confirmed by 2 vendors) is deferring
> the loading of guest-side device/state security capable firmware to the
> guest driver when the full device is assigned. In that scenario default
> device power-on firmware is capable of link/transport security, enough
> to get the device assigned. Guest needs to get the device/state security
> firmware loaded before TDISP state transitions are possible.

Yeah, those are the only cases I know of too, and IMHO, they are just
early devices. Clearly the clean answer is to put enough boot FW on
the device's flash to get to T=1 mode, then have the trusted OS driver
load the operating firmware from the trusted OS filesystem though the
trusted bootloader T=1 device.

You effectively attest the bootloader, and then if you trust the
bootloader you know that when the device gets to T=1 it can be trusted
to properly run the FW the trusted driver provides.

Think about this more broadly, does the prep FW load idea make sense
for something like SRIOV? No, it really doesn't. The hypervisor loaded
FW that is running the PF should definately be strong enough to get to
T=1 on the VM/VF side as well.

The non-SRIOV cases are quite often whole machine assignment
scenarios. But I'm sensing alot of that space is moving toward bare
metal machines instead of VMs.

I wonder if you can use all the CC machinery to attest and secure a
bare metal host?

> I do think RAS recovery needs it too, but like you say below that should
> come with conditions.

Especially RAS becomes simple because it basically follows the normal
flows that existed prior to TDISP, with the exception of needing some
attestation step.

I don't know alot about CC attestation, but maybe we can have
userspace provide the kernel with the accepted measurement and then
for RAS the kernel can FLR, remeasure and if the measurement is
exactly the same go back into T=1 automatically as part of the PCI
core FLR logic.

> I do think userspace can / must deal with it. Let me come back with
> actual patches and a sample test case. I see a potential path to support
> the above "prep" scenario without the mess of TDISP setup drivers, or
> the ugly complexity of driver toggles or a usermodehelper.

I don't see how, something nasty has to be done in the kernel to allow
an attached driver to switch between T=1 and T=0 "views" of the device
and lockstep those changes with userspace. This is not so simple and
it really basically exactly the same as driver binding.

I don't think we should be afraid of T=0 prep drivers in these early
days.

Something more complex could come later if it is really warranted and
people really insist on continuing this unclean device design
strategy.

> Yeah, that is the nightmare I had last night. I completed the thought
> exercise about driver toggle and said, "whoops, nope, Jason is right, we
> can't design for that without leaving a permanent mess to cleanup".
> The end goal needs to look like straight line typical driver probe path
> for TDISP capable devices.

Yeah, maybe it is worthwhile to someday try to figure out an
alternative - keep in mind that critically this requires someone to
also come with an intree driver that will use all these new APIs and
capabilities!!!

So lets get walking first and then someone can come with some
proposal, complete with a driver implementing it, and it can be
judged. This project is already so big, and I'm pretty sure if you
start to also need entirely new operating modes for drivers the basics
will just get bogged down in that discussion, and very likely killed
anyhow due to a lack of user.

Even if we decide that is prefered it is better to separate it and
discuss it after the basics are merged. At least where I sit getting
basic guest support is a big priority so I strongly want to strip it
down to minimal as possible to make consistent progress steps.

> True. Although, now I am going back on my PCI core burden concern to
> wonder if *it* should handle a vBME on behalf of the driver if only
> because it may want to force the device out of the RUN state on driver
> unbind to meet typical pci_disable_device() expectations.

Hiding some vBME in the PCI core might make sense if we can't get the
VMM owners to agree to do it on the hypervisor side. It works better
on the VMM side because there is always an IOMMU and the VMM can
emulate BME by blocking DMA with the IOMMU.

But I would not allow/expect kernel device drivers to have anything to
do with the TDISP states. Getting into RUN is fully sequenced by
userspace, getting out of run should also be sequenced only by
userspace.

Removing a driver does not change the trust state of the PCI device,
so it shouldn't drop out of RUN. If userspace wishes to FLR the device
after userspace asked to unbind it can, there are already sysfs
controls for this IIRC.

Basically, all this says that Linux drivers that want to be used with
T=1 should be well behaved, fully quite all their DMA on remove, and
have no *functional* need for BME to do anyhting. We pretty much
already expect this of drivers today, so I don't see an issue with
strongly requiring it for T=1.

Keep in mind the flip side, almost no drivers are structured properly
to forcibly quiet any DMA before pci_enable_device(). Some HW, like
mlx5, can't do this at all without either using DMA to send a reset
command or through FLR.

> > I would be comfortable if hitless RAS recovery for TDISP devices
> > requires some kernel opt-in. But also I'm not sure how this should
> > work from a security perspective. Should userspace also have to
> > re-attest before allowing back to RUN? Clearly this is complicated.
> > 
> > Also, I would be comfortable to support this only for devices that do
> > not require pre-configuration.
> 
> That seems reasonable. You want hitless RAS? Give us hitless init.

Yeah.. Realistically there are few drivers that can even do this
today, mlx5 for example has such code (and it is hard!).

There is alot of investment required in the driver's core subsystem to
make this work. netdev and RDMA can support a 'rebirth' sort of flow
where the driver can disconnect the SW APIs, FLR the device, then
reconnect in some way. However, for example, I recently had a
discussion with DRM guys about RAS and they are not even doing the
basic locking/etc to be able to do this. :\

Jason