linux-kernel - Re: [PATCH RFC 20/25] PCI/LUO: Avoid write to liveupdate devices at boot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zfconsaw.ffs@tglx>
Date: Mon, 28 Jul 2025 19:23:03 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Chris Li <chrisl@...nel.org>, Bjorn Helgaas <bhelgaas@...gle.com>, Greg
 Kroah-Hartman <gregkh@...uxfoundation.org>, "Rafael J. Wysocki"
 <rafael@...nel.org>, Danilo Krummrich <dakr@...nel.org>, Len Brown
 <lenb@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
 linux-acpi@...r.kernel.org, David Matlack <dmatlack@...gle.com>, Pasha
 Tatashin <tatashin@...gle.com>, Jason Miu <jasonmiu@...gle.com>, Vipin
 Sharma <vipinsh@...gle.com>, Saeed Mahameed <saeedm@...dia.com>, Adithya
 Jayachandran <ajayachandra@...dia.com>, Parav Pandit <parav@...dia.com>,
 William Tu <witu@...dia.com>, Mike Rapoport <rppt@...nel.org>, Chris Li
 <chrisl@...nel.org>, Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky
 <leon@...nel.org>
Subject: Re: [PATCH RFC 20/25] PCI/LUO: Avoid write to liveupdate devices at
 boot

On Mon, Jul 28 2025 at 01:24, Chris Li wrote:
> The liveupdate devices are already initialized by the kernel before the
> kexec. During the kexec the device is still running. Avoid write to the
> liveupdate devices during the new kernel boot up.

This change log is way too meager for this kind of change.

 1) You want to explain in detail how this works.

    "initialized by the kernel before the kexec" is as vague as it gets.

 2) Avoid write ....

    Again this lacks any information how this is supposed to work correctly.

>  drivers/pci/ats.c            |  7 ++--
>  drivers/pci/iov.c            | 58 ++++++++++++++++++------------
>  drivers/pci/msi/msi.c        | 32 ++++++++++++-----
>  drivers/pci/msi/pcidev_msi.c |  4 +--
>  drivers/pci/pci-acpi.c       |  3 ++
>  drivers/pci/pci.c            | 85 +++++++++++++++++++++++++++++---------------
>  drivers/pci/pci.h            |  9 ++++-
>  drivers/pci/pcie/aspm.c      |  7 ++--
>  drivers/pci/pcie/pme.c       | 11 ++++--
>  drivers/pci/probe.c          | 43 +++++++++++++++-------
>  drivers/pci/setup-bus.c      | 10 +++++-

Then you sprinkle this stuff into files, which have completely different
purposes, without any explanation for the particular instances why they
are supposed to be correct and how this works.

I'm just looking at the MSI parts, as I have no expertise with the rest.

> diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> index 6ede55a7c5e652c80b51b10e58f0290eb6556430..7c40fde1ba0f89ad1d72064ac9e80696faeab426 100644
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -113,7 +113,8 @@ static int pci_setup_msi_context(struct pci_dev *dev)
>  
>  void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
>  {
> -	raw_spinlock_t *lock = &to_pci_dev(desc->dev)->msi_lock;
> +	struct pci_dev *pci_dev = to_pci_dev(desc->dev);
> +	raw_spinlock_t *lock = &pci_dev->msi_lock;
>  	unsigned long flags;
>  
>  	if (!desc->pci.msi_attrib.can_mask)
> @@ -122,8 +123,9 @@ void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 set)
>  	raw_spin_lock_irqsave(lock, flags);
>  	desc->pci.msi_mask &= ~clear;
>  	desc->pci.msi_mask |= set;
> -	pci_write_config_dword(msi_desc_to_pci_dev(desc), desc->pci.mask_pos,
> -			       desc->pci.msi_mask);
> +	if (!pci_lu_adopt(pci_dev))
> +		pci_write_config_dword(pci_dev, desc->pci.mask_pos,
> +				       desc->pci.msi_mask);

This results in inconsistent state, which is a bad idea to begin
with. How is cached software state and hardware state going to be
brought in sync at some point?

If you analyzed all places, which actually depend on hardware state and
make decisions based on it, for correctness, then you failed to provide
that analysis. If not, no comment.

>  	raw_spin_unlock_irqrestore(lock, flags);
>  }
>  
> @@ -190,6 +192,9 @@ static inline void pci_write_msg_msi(struct pci_dev *dev, struct msi_desc *desc,
>  	int pos = dev->msi_cap;
>  	u16 msgctl;
>  
> +	if (pci_lu_adopt(dev))
> +		return;
> +
>  	pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
>  	msgctl &= ~PCI_MSI_FLAGS_QSIZE;
>  	msgctl |= FIELD_PREP(PCI_MSI_FLAGS_QSIZE, desc->pci.msi_attrib.multiple);
> @@ -214,6 +219,8 @@ static inline void pci_write_msg_msix(struct msi_desc *desc, struct msi_msg *msg
>  
>  	if (desc->pci.msi_attrib.is_virtual)
>  		return;
> +	if (pci_lu_adopt(to_pci_dev(desc->dev)))
> +		return;

So you don't allow the new kernel to write the MSI message, but the
interrupt subsystem has this new message and there are places which
utilize that cached message. How is this supposed to work?

>  	/*
>  	 * The specification mandates that the entry is masked
>  	 * when the message is modified:
> @@ -279,7 +286,8 @@ static void pci_msi_set_enable(struct pci_dev *dev, int enable)
>  	control &= ~PCI_MSI_FLAGS_ENABLE;
>  	if (enable)
>  		control |= PCI_MSI_FLAGS_ENABLE;
> -	pci_write_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, control);
> +	if (!pci_lu_adopt(dev))
> +		pci_write_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, control);

The placement of these conditionals is arbitrary. Some are the begin of
a function, others just block the write. Is that based on some logic or
were the places selected by shabby AI queries?

>  static int msi_setup_msi_desc(struct pci_dev *dev, int nvec,
> @@ -553,6 +561,7 @@ static void pci_msix_clear_and_set_ctrl(struct pci_dev *dev, u16 clear, u16 set)
>  {
>  	u16 ctrl;
>  
> +	BUG_ON(pci_lu_adopt(dev));

Not going to happen. BUG() is only appropriate when there is absolutely
no way to handle a situation. This is as undocumented as everything else
here.

>  	pci_read_config_word(dev, dev->msix_cap + PCI_MSIX_FLAGS, &ctrl);
>  	ctrl &= ~clear;
>  	ctrl |= set;
> @@ -720,8 +729,9 @@ static int msix_capability_init(struct pci_dev *dev, struct msix_entry *entries,
>  	 * registers can be accessed.  Mask all the vectors to prevent
>  	 * interrupts coming in before they're fully set up.
>  	 */
> -	pci_msix_clear_and_set_ctrl(dev, 0, PCI_MSIX_FLAGS_MASKALL |
> -				    PCI_MSIX_FLAGS_ENABLE);
> +	if (!pci_lu_adopt(dev))
> +		pci_msix_clear_and_set_ctrl(dev, 0, PCI_MSIX_FLAGS_MASKALL |
> +					    PCI_MSIX_FLAGS_ENABLE);

And for enhanced annoyance you sprinkle this condition everywhere into
the code and then BUG() when you missed an instance. Because putting it
into the function which is invoked a gazillion of times would be too
obvious, right? That would at least be tasteful, but that's not the
primary problem of all this.

Sprinkling these conditionals all over the place is absolutely
unmaintainable, error prone and burdens everyone with this insanity and
the related hard to chase bugs.

Especially as there is no concept behind this and zero documentation how
any of this should work or even be remotely correct.

Before you start the next hackery, please sit down and write up coherent
explanations:

  What is the general concept of this?

  What is the exact state in which a device is left when the old kernel
  jumps into the new kernel?

  What is the state of the MSI[-X] or legacy PCI interrupts at this
  point?

  Can the device raise interrupts during the transition from the old to
  the new kernel?

  How is the "live" state of the device reflected and restored
  throughout the interrupt subsystem?

  How is the device driver supposed to attach to the same interrupt
  state as before?

  How are the potentially different Linux interrupt numbers mapped to
  the previous state?

Before this materializes and is agreed on, this is not going anywhere.

Thanks,

        tglx