lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250618191530.GA1218109@bhelgaas>
Date: Wed, 18 Jun 2025 14:15:30 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Timothy Pearson <tpearson@...torengineering.com>
Cc: linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-pci <linux-pci@...r.kernel.org>,
	Madhavan Srinivasan <maddy@...ux.ibm.com>,
	Michael Ellerman <mpe@...erman.id.au>,
	christophe leroy <christophe.leroy@...roup.eu>,
	Naveen N Rao <naveen@...nel.org>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Shawn Anastasio <sanastasio@...torengineering.com>
Subject: Re: [PATCH v2 5/6] pci/hotplug/pnv_php: Fix surprise plug detection
 and

On Wed, Jun 18, 2025 at 11:58:23AM -0500, Timothy Pearson wrote:
>  recovery

Same weird subject/commit wrapping.

> The existing PowerNV hotplug code did not handle suprise plug events
> correctly, leading to a complete failure of the hotplug system after
> device removal and a required reboot to detect new devices.

s/suprise/surprise/ (also below)

> This comes down to two issues:
> 1.) When a device is suprise removed, oftentimes the bridge upstream
>     port will cause a PE freeze on the PHB.  If this freeze is not
>     cleared, the MSI interrupts from the bridge hotplug notification
>     logic will not be received by the kernel, stalling all plug events
>     on all slots associated with the PE.

I guess you mean the bridge *downstream* port that leads to the slot?

> 2.) When a device is removed from a slot, regardless of suprise or
>     programmatic removal, the associated PHB/PE ls left frozen.
>     If this freeze is not cleared via a fundamental reset, skiboot
>     is unable to clear the freeze and cannot retrain / rescan the
>     slot.  This also requires a reboot to clear the freeze and redetect
>     the device in the slot.
> 
> Issue the appropriate unfreeze and rescan commands on hotplug events,
> and don't oops on hotplug if pci_bus_to_OF_node() returns NULL.
> 
> Signed-off-by: Timothy Pearson <tpearson@...torengineering.com>
> ---
>  arch/powerpc/kernel/pci-hotplug.c |  3 ++
>  drivers/pci/hotplug/pnv_php.c     | 53 ++++++++++++++++++++++++++++++-
>  2 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c
> index 9ea74973d78d..6f444d0822d8 100644
> --- a/arch/powerpc/kernel/pci-hotplug.c
> +++ b/arch/powerpc/kernel/pci-hotplug.c
> @@ -141,6 +141,9 @@ void pci_hp_add_devices(struct pci_bus *bus)
>  	struct pci_controller *phb;
>  	struct device_node *dn = pci_bus_to_OF_node(bus);
>  
> +	if (!dn)
> +		return;
> +
>  	phb = pci_bus_to_host(bus);
>  
>  	mode = PCI_PROBE_NORMAL;
> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> index bac8af3df41a..0ceb4a2c3c79 100644
> --- a/drivers/pci/hotplug/pnv_php.c
> +++ b/drivers/pci/hotplug/pnv_php.c
> @@ -10,6 +10,7 @@
>  #include <linux/libfdt.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> +#include <linux/delay.h>
>  #include <linux/pci_hotplug.h>
>  #include <linux/of_fdt.h>
>  
> @@ -474,7 +475,7 @@ static int pnv_php_enable(struct pnv_php_slot *php_slot, bool rescan)
>  	struct hotplug_slot *slot = &php_slot->slot;
>  	uint8_t presence = OPAL_PCI_SLOT_EMPTY;
>  	uint8_t power_status = OPAL_PCI_SLOT_POWER_ON;
> -	int ret;
> +	int ret, i;
>  
>  	/* Check if the slot has been configured */
>  	if (php_slot->state != PNV_PHP_STATE_REGISTERED)
> @@ -532,6 +533,27 @@ static int pnv_php_enable(struct pnv_php_slot *php_slot, bool rescan)
>  
>  	/* Power is off, turn it on and then scan the slot */
>  	ret = pnv_php_set_slot_power_state(slot, OPAL_PCI_SLOT_POWER_ON);
> +	if (ret) {
> +		SLOT_WARN(php_slot, "PCI slot activation failed with error code %d, possible frozen PHB", ret);
> +		SLOT_WARN(php_slot, "Attempting complete PHB reset before retrying slot activation\n");
> +		for (i = 0; i < 3; i++) {
> +			/* Slot activation failed, PHB may be fenced from a prior device failure
> +			 * Use the OPAL fundamental reset call to both try a device reset and clear
> +			 * any potentially active PHB fence / freeze
> +			 */
> +			SLOT_WARN(php_slot, "Try %d...\n", i + 1);
> +			pci_set_pcie_reset_state(php_slot->pdev, pcie_warm_reset);
> +			msleep(250);

What is the source of the 250 value?  Is there a spec you can cite for
this?  Maybe add a #define if it makes sense?

> +			pci_set_pcie_reset_state(php_slot->pdev, pcie_deassert_reset);
> +
> +			ret = pnv_php_set_slot_power_state(slot, OPAL_PCI_SLOT_POWER_ON);

Wrap the comment and non-printk lines to fit in 80 columns like the
rest of the file.  Preserve the messages as-is so grep finds them
easily.

Usual multi-line comment style is:

  /*
   * Text ...
   */

Possibly factor this warn/reset code into a helper function to
unclutter pnv_php_enable()?

> +			if (!ret)
> +				break;
> +		}
> +
> +		if (i >= 3)
> +			SLOT_WARN(php_slot, "Failed to bring slot online, aborting!\n");
> +	}
>  	if (ret)
>  		return ret;
>  
> @@ -841,12 +863,41 @@ static void pnv_php_event_handler(struct work_struct *work)
>  	struct pnv_php_event *event =
>  		container_of(work, struct pnv_php_event, work);
>  	struct pnv_php_slot *php_slot = event->php_slot;
> +	struct pci_dev *pdev = php_slot->pdev;
> +	struct eeh_dev *edev;
> +	struct eeh_pe *pe;
> +	int i, rc;
>  
>  	if (event->added)
>  		pnv_php_enable_slot(&php_slot->slot);
>  	else
>  		pnv_php_disable_slot(&php_slot->slot);
>  
> +	if (!event->added) {
> +		/* When a device is surprise removed from a downstream bridge slot, the upstream bridge port
> +		 * can still end up frozen due to related EEH events, which will in turn block the MSI interrupts
> +		 * for slot hotplug detection.  Detect and thaw any frozen upstream PE after slot deactivation...
> +		 */

Restyle and wrap comment.

s/upstream bridge port/bridge downstream port/ to avoid confusion.

> +		edev = pci_dev_to_eeh_dev(pdev);
> +		pe = edev ? edev->pe : NULL;
> +		rc = eeh_pe_get_state(pe);
> +		if ((rc == -ENODEV) || (rc == -ENOENT)) {
> +			SLOT_WARN(php_slot, "Upstream bridge PE state unknown, hotplug detect may fail\n");
> +		}
> +		else {
> +			if (pe->state & EEH_PE_ISOLATED) {
> +				SLOT_WARN(php_slot, "Upstream bridge PE %02x frozen, thawing...\n", pe->addr);
> +				for (i = 0; i < 3; i++)
> +					if (!eeh_unfreeze_pe(pe))
> +						break;
> +				if (i >= 3)
> +					SLOT_WARN(php_slot, "Unable to thaw PE %02x, hotplug detect will fail!\n", pe->addr);
> +				else
> +					SLOT_WARN(php_slot, "PE %02x thawed successfully\n", pe->addr);
> +			}
> +		}
> +	}

Possibly factor this out, too.  Then pnv_php_event_handler() could
look simpler:

  if (event->added) {
    pnv_php_enable_slot(&php_slot->slot);
  } else {
    pnv_php_disable_slot(&php_slot->slot);
    <new helper to check for surprise removal>
  }

  kfree(event);

>  	kfree(event);
>  }
>  
> -- 
> 2.39.5

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ