linux-kernel - Re: [PATCH RFC 20/25] PCI/LUO: Avoid write to liveupdate devices at boot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF8kJuPQSHdh_ybGt1N2Tr_keqfGHikXeJj=XMR9H_Xh8SV5tA@mail.gmail.com>
Date: Tue, 29 Jul 2025 21:13:27 -0700
From: Chris Li <chrisl@...nel.org>
To: Jason Gunthorpe <jgg@...pe.ca>
Cc: Thomas Gleixner <tglx@...utronix.de>, Bjorn Helgaas <bhelgaas@...gle.com>, 
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>, "Rafael J. Wysocki" <rafael@...nel.org>, 
	Danilo Krummrich <dakr@...nel.org>, Len Brown <lenb@...nel.org>, linux-kernel@...r.kernel.org, 
	linux-pci@...r.kernel.org, linux-acpi@...r.kernel.org, 
	David Matlack <dmatlack@...gle.com>, Pasha Tatashin <tatashin@...gle.com>, 
	Jason Miu <jasonmiu@...gle.com>, Vipin Sharma <vipinsh@...gle.com>, 
	Saeed Mahameed <saeedm@...dia.com>, Adithya Jayachandran <ajayachandra@...dia.com>, 
	Parav Pandit <parav@...dia.com>, William Tu <witu@...dia.com>, Mike Rapoport <rppt@...nel.org>, 
	Leon Romanovsky <leon@...nel.org>, Samiullah Khawaja <skhawaja@...gle.com>
Subject: Re: [PATCH RFC 20/25] PCI/LUO: Avoid write to liveupdate devices at boot

On Mon, Jul 28, 2025 at 4:50 PM Jason Gunthorpe <jgg@...pe.ca> wrote:
> > Then you sprinkle this stuff into files, which have completely different
> > purposes, without any explanation for the particular instances why they
> > are supposed to be correct and how this works.
>
> Yeah, everyting needs to be very carefully explained.

Agree. I did some explanation in my last email reply to Thomas. Will
add a document for the next version.

>
> For instance I'm not sure we should be doing *anything* to the
> MSI. Why did you think so?
>
> MSI should be fully cleared by the new kernel and the new VFIO should
> re-establish all the MSI routing from scratch as part of adopting the
> device. We already accept that any interrupts are lost during the
> kexec process so what reason is there to do anything except start up the
> new kernel with a fully disabled MSI and cleared MSI?

The current approach is that we fake/inject a spurious interrupt to
the device to allow the device driver to have a chance to process any
pending action for the interrupt. There is also a possibility there is
nothing the device driver needs to do due to no interrupt having ever
triggered in the kexec window.  We expect the driver can tolerate that
spurious interrupt.

The alternative is to try to (partially) process the interrupt during
kexec. e.g. remember which IRQ has the interrupt triggered. It will
make things much more complicated. Invoke interrupt handler in the
early boot stage before IOMMU is very tricky.
>
> If otherwise it should be explained why we can't work this way - and
> then explain how the new kernel will adopt the inherited operating MSI
> (hint: I doubt it can) without disrupting it.

Agree.

>
> Same remark for everything. Explain in the commits and perhaps a well
> placed comment why anything needs to be done and why exactly we can't
> use the cold boot flow for each item.

We certainly can do that.

I am trying to see if we can agree on the VFIO_PCI device used by the
VM. We don't want any config space register to change during the
liveupdate kexec (before finish). We can certainly change what config
space register might or might not break stuff. But it is going to be
very hard to test and verify what can break if we change this.

If we can draw a line and say, there is no config space to write to
the device between freeze and finish. It is much easier to reason from
the device point of view, the device should continue working. The
device has no way of knowing the host kernel has been changed. The
device has only a limited view of their config space, the DMA area it
can read/write to. If we preserve enough stuff, the device should
continue working. For most of the devices, we can reason with the
model that keeping the status quo will not break things.

There is an obvious exception to that, e.g. if the device has a
watchdog timer it needs to kick at regular intervals, if that interval
is shorter than the kexec cycle. It should be pretty rare and we can
deal with those when we actually encounter one.

>
> eg "we can't use the cold boot flow for BAR sizing because BAR sizing
> requires changing the BAR register and that will break ongoing P2P
> DMAs"
>
> "we can't use the cold boot flow for bridge windows because changing
> the bridge windows in any way will break ongoing P2P DMAs" (though you
> also need to explain why the cold boot flow would change the bridge
> windows)
>
> etc etc.

There will be some config space register hard to make sure changing it
will break things or not.
e.g. The base BAR register, if we change to a new memory region, and
all follow up write to the device using a BAR new address, should
things continue working? Will have a lot of corner case like this, it
is much easier to just avoid changing anything to make things
consistent.

>
> There is also some complication here as the iommu driver technically
> owns some of the PCI state, and we really don't want the PCI Core to
> change it, but we do need theiommu driver to affirm what the in-use
> state should be because it is responsible to clean it up.

Yes, there is overlap between PCI and IOMMU, more than just config
space write. The IOMMU needs to know which PCI device participates,
which set of groups it needs to save. CC Samiullah here, he knows more
about the IOMMU side of the liveupdate than I do.

> This may actually require some restructing of the iommu driver/pci
> core interfaces to switch from an enable/disbale language to a 'target
> state' language. Ie "ATS shall be on and ATS page size shall be X".
>
Ack.

I have some ideas to make the PCI initialization cleaner for this
usage as well. Instead of directly initiating and turning on features
if found. We can do in 3 stages:
1) enumerate PCI capability and get the list of capability available
but don't turn them on yet.
2) determine what capability needs to be turned on/off. For the normal
initiation without liveupdate, the current behavior mostly turns on
whatever can be turned on. For liveupdate devices, it would be
inherent the on/off from what the previous kernel hands off to the new
kernel. By either 1) reading the device state (assume reading state is
possible and does not change device state) or 2) previous kernel save
state into preserved folio and new kernel reads the state from
preserved folio.
3) Perform the action to turn on/off the according the result from 2).
For live update devices the most common case is skip write, that will
be noop. For normal initialization without liveupdate, it will turn on
the capability.

> This series is very big, so I would probably try to break it up into
> smaller chunks. Like you don't need to preserve bridge windows and
> BARs if you don't support P2P. You don't need to worry about ATS and
> PASID if you don't support those, etc, etc.

Yes, I can break it to smaller chunks.

One of the deliverables of this patch series is that I can test the
liveupdate with the pci-lu-stub and pci-lub-stub-pf driver. Having
additional patch to verify no PCI config space write has performed on
the requested PCI device during shutdown and kexec boot up.

> Yes, in the end all needs to be supported, but going bit by bit will
> be easier for people to understand. Basic VFIO support with a basic
> IOMMU using basic PCI with no P2P is the simplest thing you can do,
> and I think it needs surprisingly little preservation.

Yes, that is certainly possible ;-)

Because I am working on the PCI side of the liveupdate, there are
other developers working on VFIO and IOMMU depending on my PCI
changes. From the project development point of view the PCI change
needs to happen first, to unblock others. That is how I get here.

I can certainly break it down to smaller chunks.

Chris