netdev - Re: [PATCH net-next] net: phy: avoid kernel warning dump when stopping an errored PHY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8e7e02d8-2b2a-8619-e607-fbac50706252@huawei.com>
Date: Mon, 4 Sep 2023 17:50:32 +0800
From: Jijie Shao <shaojijie@...wei.com>
To: <f.fainelli@...il.com>, Andrew Lunn <andrew@...n.ch>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <hkallweit1@...il.com>,
	<kuba@...nel.org>, <netdev@...r.kernel.org>, <pabeni@...hat.com>,
	<rmk+kernel@...linux.org.uk>, "shenjian15@...wei.com"
	<shenjian15@...wei.com>, "liuyonglong@...wei.com" <liuyonglong@...wei.com>,
	<wangjie125@...wei.com>, <chenhao418@...wei.com>, Hao Lan
	<lanhao@...wei.com>, <shaojijie@...wei.com>, "wangpeiyang1@...wei.com"
	<wangpeiyang1@...wei.com>
Subject: Re: [PATCH net-next] net: phy: avoid kernel warning dump when
 stopping an errored PHY

Hi all,
We encountered an issue when resetting our netdevice recently, it seems
related to this patch.

During our process, we stop phy first and call phy_start() later.
phy_check_link_status returns error because it read mdio failed. The
reason why it happened is that the cmdq is unusable when we reset and we
can't access to mdio.

The process and logs are showed as followed:
Process:
reset process       |    phy_state_machine           |  phy_state
==========================================================================
                     | mutex_lock(&phydev->lock);     | PHY_RUNNING
                     | ...                            |
                     | case PHY_RUNNING:              |
                     | err = phy_check_link_status()  | PHY_RUNNING
                     | ...                            |
                     | mutex_unlock(&phydev->lock)    | PHY_RUNNING
  phy_stop()         |                                |
    ...              |                                |
    mutex_lock()     |                                | PHY_RUNNING
    ...              |                                |
    phydev->state =  |                                |
      PHY_HALTED;    |                                |  PHY_HALTED
    ...              |                                |
    mutex_unlock()   |                                |  PHY_HALTED
                     | phy_error_precise():           |
                     |   mutex_lock(&phydev->lock);   | PHY_HALTED
                     |   phydev->state = PHY_ERROR;   | PHY_ERROR
                     |   mutex_unlock(&phydev->lock); | PHY_ERROR
                     |                                |
phy_start()         |                                |  PHY_ERROR
   ...               |                                |
Logs:
[ 2622.146721] hns3 0000:35:00.0 eth1: Setting reset type 6
[ 2622.155182] hns3 0000:35:00.0: received reset event, reset type is 6
[ 2622.171641] hns3 0000:35:00.0: global reset requested
[ 2622.181867] hns3 0000:35:00.0: global reset interrupt
[ 2623.351382] ------------[ cut here ]------------
[ 2623.358012] phy_check_link_status+0x0/0xe0: returned: -16
[ 2623.370106] hns3 0000:35:00.0 eth1: net stop
[ 2623.377599] WARNING: CPU: 0 PID: 10 at drivers/net/phy/phy.c:1211
phy_state_machine+0xac/0x2b8
[ 2623.386026] RTL8211F Gigabit Ethernet mii-0000:35:00.0:02: PHY state
change RUNNING -> HALTED
                 ...
[ 2623.540165] Call trace:
[ 2623.543034]  phy_state_machine+0xac/0x2b8
[ 2623.548028]  process_one_work+0x1ec/0x478
[ 2623.552732]  worker_thread+0x74/0x448
[ 2623.556855]  kthread+0x120/0x130
[ 2623.560920]  ret_from_fork+0x10/0x20
[ 2623.565355] ---[ end trace 0000000000000000 ]---
[ 2623.577722] RTL8211F Gigabit Ethernet mii-0000:35:00.0:02: PHY state
change RUNNING -> ERROR
[ 2623.590490] hns3 0000:35:00.0 eth1: link down
[ 2623.707230] hns3 0000:35:00.0: prepare wait ok
[ 2624.169139] hns3 0000:35:00.0: The firmware version is 3.10.11.25
[ 2624.501223] hns3 0000:35:00.0: phc initializes ok!
[ 2624.553486] hns3 0000:35:00.0: Reset done, hclge driver initialization
finished.
[ 2625.586470] ------------[ cut here ]------------
[ 2625.593882] called from state ERROR
[ 2625.600677] WARNING: CPU: 1 PID: 352 at drivers/net/phy/phy.c:1392
phy_start+0x50/0xc8
                 ...
[ 2625.750077] Call trace:
[ 2625.752799]  phy_start+0x50/0xc8
[ 2625.756974]  hclge_mac_start_phy+0x34/0x50 [hclge]
                 ...
[ 2625.831224] ---[ end trace 0000000000000000 ]---
[ 2625.843790] hns3 0000:35:00.0 eth1: net open

We supposed to start phy successfully after calling phy_stop. However, the
phy state is PHY_ERROR. As aboved process, we can find
phy_check_link_status is called before phy_stop, but the final phy state
is set due to an error from phy_check_link_status. Becuase we reset our
netdevice successfully, the phy should not be PHY_ERROR when we call
phy_start. So, we supposed it might be a bug.

Additionally, what can we do if the phy is in PHY_ERROR?

Thanks!
Jijie Shao