linux-kernel - Re: [PATCH 1/1] powerpc/eeh: fix recursive pci_lock_rescan

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <nc4udvo6zlszyo7n5roavkjnfpwa2owqpwkov5bl5frmkvclrd@plttt3brumjp>
Date: Tue, 18 Nov 2025 14:26:18 +0530
From: Mahesh J Salgaonkar <mahesh@...ux.ibm.com>
To: Narayana Murty N <nnmlinux@...ux.ibm.com>
Cc: oohall@...il.com, maddy@...ux.ibm.com, mpe@...erman.id.au,
        npiggin@...il.com, christophe.leroy@...roup.eu, bhelgaas@...gle.com,
        tpearson@...torengineering.com, linuxppc-dev@...ts.ozlabs.org,
        linux-kernel@...r.kernel.org, vaibhav@...ux.ibm.com,
        sbhat@...ux.ibm.com, ganeshgr@...ux.ibm.com
Subject: Re: [PATCH 1/1] powerpc/eeh: fix recursive pci_lock_rescan_remove
 locking in EEH event handling

On 2025-11-05 00:40:52 Wed, Narayana Murty N wrote:
> The recent commit 1010b4c012b0 ("powerpc/eeh: Make EEH driver device
> hotplug safe") restructured the EEH driver to improve synchronization
> with the PCI hotplug layer.
> 
> However, it inadvertently moved pci_lock_rescan_remove() outside its
> intended scope in eeh_handle_normal_event(), leading to broken PCI
> error reporting and improper EEH event triggering. Specifically,
> eeh_handle_normal_event() acquired pci_lock_rescan_remove() before
> calling eeh_pe_bus_get(), but eeh_pe_bus_get() itself attempts to
> acquire the same lock internally, causing nested locking and disrupting
> normal EEH event handling paths.
> 
> This patch adds a boolean parameter do_lock to _eeh_pe_bus_get(),
> with two public wrappers:
>     eeh_pe_bus_get() with locking enabled.
>     eeh_pe_bus_get_nolock() that skips locking.
> 
> Callers that already hold pci_lock_rescan_remove() now use
> eeh_pe_bus_get_nolock() to avoid recursive lock acquisition.
> 
> Additionally, pci_lock_rescan_remove() calls are restored to the correct
> position—after eeh_pe_bus_get() and immediately before iterating affected
> PEs and devices. This ensures EEH-triggered PCI removes occur under proper
> bus rescan locking without recursive lock contention.
> 
[...]
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index ef78ff77cf8f..492fae5e3823 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -812,6 +812,35 @@ static void eeh_clear_slot_attention(struct pci_dev *pdev)
>  	ops->set_attention_status(slot->hotplug, 0);
>  }
>  
> +static const char *eeh_pe_get_loc(struct eeh_pe *pe)
> +{

So it is duplicate of eeh_pe_loc_get() with nolock variant. Instead, can
we split original function eeh_pe_loc_get() ? Move the while (bus) loop
into another function with name eeh_bus_loc_get(bus) which will be
nolock variant and use that here ?

> +	struct pci_bus *bus = eeh_pe_bus_get_nolock(pe);
> +	struct device_node *dn;
> +	const char *location = NULL;
> +
> +	while (bus) {
> +		dn = pci_bus_to_OF_node(bus);
> +		if (!dn) {
> +			bus = bus->parent;
> +			continue;
> +		}
> +
> +		if (pci_is_root_bus(bus))
> +			location = of_get_property(dn, "ibm,io-base-loc-code",
> +						   NULL);
> +		else
> +			location = of_get_property(dn, "ibm,slot-location-code",
> +						   NULL);
> +
> +		if (location)
> +			return location;
> +
> +		bus = bus->parent;
> +	}
> +
> +	return "N/A";
> +}
> +
>  /**
>   * eeh_handle_normal_event - Handle EEH events on a specific PE
>   * @pe: EEH PE - which should not be used after we return, as it may
> @@ -846,7 +875,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  
>  	pci_lock_rescan_remove();
>  
> -	bus = eeh_pe_bus_get(pe);
> +	bus = eeh_pe_bus_get_nolock(pe);
>  	if (!bus) {
>  		pr_err("%s: Cannot find PCI bus for PHB#%x-PE#%x\n",
>  			__func__, pe->phb->global_number, pe->addr);
> @@ -886,14 +915,14 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  	/* Log the event */
>  	if (pe->type & EEH_PE_PHB) {
>  		pr_err("EEH: Recovering PHB#%x, location: %s\n",
> -			pe->phb->global_number, eeh_pe_loc_get(pe));
> +			pe->phb->global_number, eeh_pe_get_loc(pe));
>  	} else {
>  		struct eeh_pe *phb_pe = eeh_phb_pe_get(pe->phb);
>  
>  		pr_err("EEH: Recovering PHB#%x-PE#%x\n",
>  		       pe->phb->global_number, pe->addr);
>  		pr_err("EEH: PE location: %s, PHB location: %s\n",
> -		       eeh_pe_loc_get(pe), eeh_pe_loc_get(phb_pe));
> +		       eeh_pe_get_loc(pe), eeh_pe_get_loc(phb_pe));
>  	}
>  
Thanks,
-Mahesh.