[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a685858-833a-4ccf-93b2-d878eee25722@huawei.com>
Date: Fri, 24 Oct 2025 15:21:00 +0800
From: Jijie Shao <shaojijie@...wei.com>
To: Jacob Keller <jacob.e.keller@...el.com>, <davem@...emloft.net>,
	<edumazet@...gle.com>, <kuba@...nel.org>, <pabeni@...hat.com>,
	<andrew+netdev@...n.ch>, <horms@...nel.org>
CC: <shaojijie@...wei.com>, <shenjian15@...wei.com>, <liuyonglong@...wei.com>,
	<chenhao418@...wei.com>, <lantao5@...wei.com>,
	<huangdonghua3@...artners.com>, <yangshuaisong@...artners.com>,
	<jonathan.cameron@...wei.com>, <salil.mehta@...wei.com>,
	<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net 3/3] net: hibmcge: fix the inappropriate
 netif_device_detach()
on 2025/10/24 9:05, Jacob Keller wrote:
>
> On 10/21/2025 7:00 AM, Jijie Shao wrote:
>> current, driver will call netif_device_detach() in
>> pci_error_handlers.error_detected() and do reset in
>> pci_error_handlers.slot_reset().
>> However, if pci_error_handlers.slot_reset() is not called
>> after pci_error_handlers.error_detected(),
>> driver will be detached and unable to recover.
>>
>> drivers/pci/pcie/err.c/report_error_detected() says:
>>    If any device in the subtree does not have an error_detected
>>    callback, PCI_ERS_RESULT_NO_AER_DRIVER prevents subsequent
>>    error callbacks of any device in the subtree, and will
>>    exit in the disconnected error state.
>>
>> Therefore, when the hibmcge device and other devices that do not
>> support the error_detected callback are under the same subtree,
>> hibmcge will be unable to do slot_reset.
>>
> Hmm.
>
> In the example case, the slot_reset never happens, but the PCI device is
> still in an error state, which means that the device is not functional..
>
> In that case detaching the netdev and remaining detached seems like an
> expected outcome?
>
> I guess I don't fully understand the setup in this scenario.
We have encountered some non-fatal errors, such as the SMMU event 0x10,
which triggered the PCIe RAS and caused the network port to become unusable.
the event does not significantly affect the normal use of the network port,
so it is unreasonable to say that the network port cannot be used.
We do our best to ensure the normal use of the network ports;
otherwise, unless there is a serial port,
it will no longer be possible to connect to the board.
To prevent such issues, I move netif_device_detach() from error_detected() to slot_reset().
Either do netif_device_detach() and error_detected(), otherwise do nothing.
>
>> This path move netif_device_detach from error_detected to slot_reset,
>> ensuring that detach and reset are always executed together.
>>
>> Fixes: fd394a334b1c ("net: hibmcge: Add support for abnormal irq handling feature")
>> Signed-off-by: Jijie Shao <shaojijie@...wei.com>
>> ---
>>   drivers/net/ethernet/hisilicon/hibmcge/hbg_err.c | 10 ++++++----
>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/hisilicon/hibmcge/hbg_err.c b/drivers/net/ethernet/hisilicon/hibmcge/hbg_err.c
>> index 83cf75bf7a17..e11495b7ee98 100644
>> --- a/drivers/net/ethernet/hisilicon/hibmcge/hbg_err.c
>> +++ b/drivers/net/ethernet/hisilicon/hibmcge/hbg_err.c
>> @@ -136,12 +136,11 @@ static pci_ers_result_t hbg_pci_err_detected(struct pci_dev *pdev,
>>   {
>>   	struct net_device *netdev = pci_get_drvdata(pdev);
>>   
>> -	netif_device_detach(netdev);
>> -
>> -	if (state == pci_channel_io_perm_failure)
>> +	if (state == pci_channel_io_perm_failure) {
>> +		netif_device_detach(netdev);
>>   		return PCI_ERS_RESULT_DISCONNECT;
>> +	}
>>   
>> -	pci_disable_device(pdev);
>>   	return PCI_ERS_RESULT_NEED_RESET;
>>   }
>>   
>> @@ -150,6 +149,9 @@ static pci_ers_result_t hbg_pci_err_slot_reset(struct pci_dev *pdev)
>>   	struct net_device *netdev = pci_get_drvdata(pdev);
>>   	struct hbg_priv *priv = netdev_priv(netdev);
>>   
>> +	netif_device_detach(netdev);
>> +	pci_disable_device(pdev);
>> +
>>   	if (pci_enable_device(pdev)) {
>>   		dev_err(&pdev->dev,
>>   			"failed to re-enable PCI device after reset\n");
> Here, we disable the device only to immediately attempt to re-enable it?
Yes, the driver attempts to re-enable the PCIe device.
Thanks,
Jijie Shao
Powered by blists - more mailing lists
 
