netdev - Re: [PATCH] phy: added a PHY_BUSY state into phy_state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cd7591e7-4a4a-b5a6-21af-045acd89923d@gmail.com>
Date:   Mon, 8 Jul 2019 08:03:10 +0200
From:   Heiner Kallweit <hkallweit1@...il.com>
To:     Florian Fainelli <f.fainelli@...il.com>,
        "kwangdo.yi" <kwangdo.yi@...il.com>, netdev@...r.kernel.org,
        Andrew Lunn <andrew@...n.ch>
Subject: Re: [PATCH] phy: added a PHY_BUSY state into phy_state_machine

On 08.07.2019 05:07, Florian Fainelli wrote:
> +Andrew, Heiner (please CC PHY library maintainers).
> 
> On 7/7/2019 3:32 PM, kwangdo.yi wrote:
>> When mdio driver polling the phy state in the phy_state_machine,
>> sometimes it results in -ETIMEDOUT and link is down. But the phy
>> is still alive and just didn't meet the polling deadline. 
>> Closing the phy link in this case seems too radical. Failing to 
>> meet the deadline happens very rarely. When stress test runs for 
>> tens of hours with multiple target boards (Xilinx Zynq7000 with
>> marvell 88E1512 PHY, Xilinx custom emac IP), it happens. This 
>> patch gives another chance to the phy_state_machine when polling 
>> timeout happens. Only two consecutive failing the deadline is 
>> treated as the real phy halt and close the connection.
> 
In addition to what Florian said already there's at least one
issue apart from the quite hacky approach in general.

Let's say we are in interrupt mode and the timeout happens when
reading the PHY status after a link-down interrupt. When ignoring
the error we miss the transition and phylib will report a wrong
link status.

I also would prefer to first check for the root cause and try to
fix it, before adding hacks to upper layers for ignoring errors.

> How about simply increasing the MDIO polling timeout in the Xilinx EMAC
> driver instead? Or if the PHY is where the timeout needs to be
> increased, allow the PHY device drivers to advertise min/max timeouts
> such that the MDIO bus layer can use that information?
> 
>>
>>
>> Signed-off-by: kwangdo.yi <kwangdo.yi@...il.com>
>> ---
>>  drivers/net/phy/phy.c | 6 ++++++
>>  include/linux/phy.h   | 1 +
>>  2 files changed, 7 insertions(+)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index e888542..9e8138b 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -919,7 +919,13 @@ void phy_state_machine(struct work_struct *work)
>>  		break;
>>  	case PHY_NOLINK:
>>  	case PHY_RUNNING:
>> +	case PHY_BUSY:
>>  		err = phy_check_link_status(phydev);
>> +		if (err == -ETIMEDOUT && old_state == PHY_RUNNING) {
>> +			phy->state = PHY_BUSY;
>> +			err = 0;
>> +
>> +		}
>>  		break;
>>  	case PHY_FORCING:
>>  		err = genphy_update_link(phydev);
>> diff --git a/include/linux/phy.h b/include/linux/phy.h
>> index 6424586..4a49401 100644
>> --- a/include/linux/phy.h
>> +++ b/include/linux/phy.h
>> @@ -313,6 +313,7 @@ enum phy_state {
>>  	PHY_RUNNING,
>>  	PHY_NOLINK,
>>  	PHY_FORCING,
>> +	PHY_BUSY,
>>  };
>>  
>>  /**
>>
>