lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DB8PR04MB6795D52C6FA54D17D43D1E99E6769@DB8PR04MB6795.eurprd04.prod.outlook.com>
Date:   Tue, 6 Apr 2021 10:07:51 +0000
From:   Joakim Zhang <qiangqing.zhang@....com>
To:     Heiner Kallweit <hkallweit1@...il.com>,
        "christian.melki@...ata.com" <christian.melki@...ata.com>,
        "andrew@...n.ch" <andrew@...n.ch>,
        "linux@...linux.org.uk" <linux@...linux.org.uk>,
        "davem@...emloft.net" <davem@...emloft.net>,
        "kuba@...nel.org" <kuba@...nel.org>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        dl-linux-imx <linux-imx@....com>
Subject: RE: [PATCH] net: phy: fix PHY possibly unwork after MDIO bus resume
 back


> -----Original Message-----
> From: Heiner Kallweit <hkallweit1@...il.com>
> Sent: 2021年4月6日 14:29
> To: Joakim Zhang <qiangqing.zhang@....com>; christian.melki@...ata.com;
> andrew@...n.ch; linux@...linux.org.uk; davem@...emloft.net;
> kuba@...nel.org
> Cc: netdev@...r.kernel.org; linux-kernel@...r.kernel.org; dl-linux-imx
> <linux-imx@....com>
> Subject: Re: [PATCH] net: phy: fix PHY possibly unwork after MDIO bus resume
> back
> 
> On 06.04.2021 04:07, Joakim Zhang wrote:
> >
> > Hi Heiner,
> >
> >> -----Original Message-----
> >> From: Heiner Kallweit <hkallweit1@...il.com>
> >> Sent: 2021年4月5日 20:10
> >> To: christian.melki@...ata.com; Joakim Zhang
> >> <qiangqing.zhang@....com>; andrew@...n.ch; linux@...linux.org.uk;
> >> davem@...emloft.net; kuba@...nel.org
> >> Cc: netdev@...r.kernel.org; linux-kernel@...r.kernel.org;
> >> dl-linux-imx <linux-imx@....com>
> >> Subject: Re: [PATCH] net: phy: fix PHY possibly unwork after MDIO bus
> >> resume back
> >>
> >> On 05.04.2021 10:43, Christian Melki wrote:
> >>> On 4/5/21 12:48 AM, Heiner Kallweit wrote:
> >>>> On 04.04.2021 16:09, Heiner Kallweit wrote:
> >>>>> On 04.04.2021 12:07, Joakim Zhang wrote:
> >>>>>> commit 4c0d2e96ba055 ("net: phy: consider that suspend2ram may
> >>>>>> cut off PHY power") invokes phy_init_hw() when MDIO bus resume,
> >>>>>> it will soft reset PHY if PHY driver implements soft_reset callback.
> >>>>>> commit 764d31cacfe4 ("net: phy: micrel: set soft_reset callback
> >>>>>> to genphy_soft_reset for KSZ8081") adds soft_reset for KSZ8081.
> >>>>>> After these two patches, I found i.MX6UL 14x14 EVK which
> >>>>>> connected to KSZ8081RNB doesn't work any more when system
> resume
> >>>>>> back, MAC
> >> driver is fec_main.c.
> >>>>>>
> >>>>>> It's obvious that initializing PHY hardware when MDIO bus resume
> >>>>>> back would introduce some regression when PHY implements
> >>>>>> soft_reset. When I
> >>>>>
> >>>>> Why is this obvious? Please elaborate on why a soft reset should
> >>>>> break something.
> >>>>>
> >>>>>> am debugging, I found PHY works fine if MAC doesn't support
> >>>>>> suspend/resume or phy_stop()/phy_start() doesn't been called
> >>>>>> during suspend/resume. This let me realize, PHY state machine
> >>>>>> phy_state_machine() could do something breaks the PHY.
> >>>>>>
> >>>>>> As we known, MAC resume first and then MDIO bus resume when
> >>>>>> system resume back from suspend. When MAC resume, usually it will
> >>>>>> invoke
> >>>>>> phy_start() where to change PHY state to PHY_UP, then trigger the
> >>>>>> stat> machine to run now. In phy_state_machine(), it will
> >>>>>> start/config auto-nego, then change PHY state to PHY_NOLINK, what
> >>>>>> to next is periodically check PHY link status. When MDIO bus
> >>>>>> resume, it will initialize PHY hardware, including soft_reset,
> >>>>>> what would soft_reset affect seems various from different PHYs.
> >>>>>> For KSZ8081RNB, when it in PHY_NOLINK state and then perform a
> >>>>>> soft reset,
> >> it will never complete auto-nego.
> >>>>>
> >>>>> Why? That would need to be checked in detail. Maybe chip errata
> >>>>> documentation provides a hint.
> >>>>>
> >>>>
> >>>> The KSZ8081 spec says the following about bit BMCR_PDOWN:
> >>>>
> >>>> If software reset (Register 0.15) is used to exit power-down mode
> >>>> (Register 0.11 = 1), two software reset writes (Register 0.15 = 1)
> >>>> are required. The first write clears power-down mode; the second
> >>>> write resets the chip and re-latches the pin strapping pin values.
> >>>>
> >>>> Maybe this causes the issue you see and genphy_soft_reset() isn't
> >>>> appropriate for this PHY. Please re-test with the KSZ8081 soft
> >>>> reset following the spec comment.
> >>>>
> >>>
> >>> Interesting. Never expected that behavior.
> >>> Thanks for catching it. Skimmed through the datasheets/erratas.
> >>> This is what I found (micrel.c):
> >>>
> >>> 10/100:
> >>> 8001 - Unaffected?
> >>> 8021/8031 - Double reset after PDOWN.
> >>> 8041 - Errata. PDOWN broken. Recommended do not use. Unclear if
> >>> reset solves the issue since errata says no error after reset but is
> >>> also claiming that only toggling PDOWN (may) or power will help.
> >>> 8051 - Double reset after PDOWN.
> >>> 8061 - Double reset after PDOWN.
> >>> 8081 - Double reset after PDOWN.
> >>> 8091 - Double reset after PDOWN.
> >>>
> >>> 10/100/1000:
> >>> Nothing in gigabit afaics.
> >>>
> >>> Switches:
> >>> 8862 - Not affected?
> >>> 8863 - Errata. PDOWN broken. Reset will not help. Workaround exists.
> >>> 8864 - Not affected?
> >>> 8873 - Errata. PDOWN broken. Reset will not help. Workaround exists.
> >>> 9477 - Errata. PDOWN broken. Will randomly cause link failure on
> >>> adjacent links. Do not use.
> >>>
> >>> This certainly explains a lot.
> >>>
> >>>>>>
> >>>>>> This patch changes PHY state to PHY_UP when MDIO bus resume back,
> >>>>>> it should be reasonable after PHY hardware re-initialized. Also
> >>>>>> give state machine a chance to start/config auto-nego again.
> >>>>>>
> >>>>>
> >>>>> If the MAC driver calls phy_stop() on suspend, then
> >>>>> phydev->suspended is true and mdio_bus_phy_may_suspend() returns
> >>>>> false. As a consequence
> >>>>> phydev->suspended_by_mdio_bus is false and mdio_bus_phy_resume()
> >>>>> skips the PHY hw initialization.
> >>>>> Please also note that mdio_bus_phy_suspend() calls
> >>>>> phy_stop_machine() that sets the state to PHY_UP.
> >>>>>
> >>>>
> >>>> Forgot that MDIO bus suspend is done before MAC driver suspend.
> >>>> Therefore disregard this part for now.
> >>>>
> >>>>> Having said that the current argumentation isn't convincing. I'm
> >>>>> not aware of such issues on other systems, therefore it's likely
> >>>>> that something is system-dependent.
> >>>>>
> >>>>> Please check the exact call sequence on your system, maybe it
> >>>>> provides a hint.
> >>>>>
> >>>>>> Signed-off-by: Joakim Zhang <qiangqing.zhang@....com>
> >>>>>> ---
> >>>>>>  drivers/net/phy/phy_device.c | 7 +++++++
> >>>>>>  1 file changed, 7 insertions(+)
> >>>>>>
> >>>>>> diff --git a/drivers/net/phy/phy_device.c
> >>>>>> b/drivers/net/phy/phy_device.c index cc38e326405a..312a6f662481
> >>>>>> 100644
> >>>>>> --- a/drivers/net/phy/phy_device.c
> >>>>>> +++ b/drivers/net/phy/phy_device.c
> >>>>>> @@ -306,6 +306,13 @@ static __maybe_unused int
> >> mdio_bus_phy_resume(struct device *dev)
> >>>>>>  	ret = phy_resume(phydev);
> >>>>>>  	if (ret < 0)
> >>>>>>  		return ret;
> >>>>>> +
> >>>>>> +	/* PHY state could be changed to PHY_NOLINK from MAC controller
> >> resume
> >>>>>> +	 * rounte with phy_start(), here change to PHY_UP after
> >> re-initializing
> >>>>>> +	 * PHY hardware, let PHY state machine to start/config
> >>>>>> +auto-nego
> >> again.
> >>>>>> +	 */
> >>>>>> +	phydev->state = PHY_UP;
> >>>>>> +
> >>>>>>  no_resume:
> >>>>>>  	if (phydev->attached_dev && phydev->adjust_link)
> >>>>>>  		phy_start_machine(phydev);
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >> This is a quick draft of the modified soft reset for KSZ8081.
> >> Some tests would be appreciated.
> >>
> >
> > I test this patch at my side, unfortunately, it still can't work.
> >
> > I have not found what soft reset would affect in 8081 spec, is there
> > ang common Description for different PHYs?
> >
> 
> You can check the clause 22 spec: 802.3 22.2.4.1.1
> 
> Apart from that you can check BMCR value and which modes your PHY
> advertises after a soft reset. If the PHY indeed wouldn't restart aneg after a
> soft reset, then it would be the only one with this behavior I know. And I'd
> wonder why there aren't more such reports.

Hi Heiner,

We have reasons to believe it is a weird behavior of KSZ8081. I have another two PHYs at my side, AR8031 and RTL8211FD,
they can work fine if soft_reset implemented.

As commit message described, phy_start() would change PHY state to PHY_UP finally to PHY_NOLINK when MAC resume.
After MDIO bus resume back, it will periodically check link status. I did below code change to dump all PHY registers. 

--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -35,7 +35,7 @@
 #include <net/genetlink.h>
 #include <net/sock.h>

-#define PHY_STATE_TIME HZ
+#define PHY_STATE_TIME (10 * HZ)

 #define PHY_STATE_STR(_state)                  \
        case PHY_##_state:                      \
@@ -738,6 +738,28 @@ static int phy_check_link_status(struct phy_device *phydev)
        if (err)
                return err;

+       printk("offset 0x00 data = %0x\n", phy_read(phydev, 0x00));
+       printk("offset 0x01 data = %0x\n", phy_read(phydev, 0x01));
+       printk("offset 0x02 data = %0x\n", phy_read(phydev, 0x02));
+       printk("offset 0x03 data = %0x\n", phy_read(phydev, 0x03));
+       printk("offset 0x04 data = %0x\n", phy_read(phydev, 0x04));
+       printk("offset 0x05 data = %0x\n", phy_read(phydev, 0x05));
+       printk("offset 0x06 data = %0x\n", phy_read(phydev, 0x06));
+       printk("offset 0x07 data = %0x\n", phy_read(phydev, 0x07));
+       printk("offset 0x08 data = %0x\n", phy_read(phydev, 0x08));
+       printk("offset 0x09 data = %0x\n", phy_read(phydev, 0x09));
+       printk("offset 0x10 data = %0x\n", phy_read(phydev, 0x10));
+       printk("offset 0x11 data = %0x\n", phy_read(phydev, 0x11));
+       printk("offset 0x15 data = %0x\n", phy_read(phydev, 0x15));
+       printk("offset 0x16 data = %0x\n", phy_read(phydev, 0x16));
+       printk("offset 0x17 data = %0x\n", phy_read(phydev, 0x17));
+       printk("offset 0x18 data = %0x\n", phy_read(phydev, 0x18));
+       printk("offset 0x1b data = %0x\n", phy_read(phydev, 0x1b));
+       printk("offset 0x1c data = %0x\n", phy_read(phydev, 0x1c));
+       printk("offset 0x1d data = %0x\n", phy_read(phydev, 0x1d));
+       printk("offset 0x1e data = %0x\n", phy_read(phydev, 0x1e));
+       printk("offset 0x1f data = %0x\n", phy_read(phydev, 0x1f));
+       printk("\n\n");
        if (phydev->link && phydev->state != PHY_RUNNING) {
                phy_check_downshift(phydev);
                phydev->state = PHY_RUNNING;

After MDIO bus resume back, results as below:

[  119.545383] offset 0x00 data = 1100
[  119.549237] offset 0x01 data = 7849
[  119.563125] offset 0x02 data = 22
[  119.566810] offset 0x03 data = 1560
[  119.570597] offset 0x04 data = 85e1
[  119.588016] offset 0x05 data = 0
[  119.592891] offset 0x06 data = 4
[  119.596452] offset 0x07 data = 2001
[  119.600233] offset 0x08 data = 0
[  119.617864] offset 0x09 data = 0
[  119.625990] offset 0x10 data = 0
[  119.629576] offset 0x11 data = 0
[  119.642735] offset 0x15 data = 0
[  119.646332] offset 0x16 data = 202
[  119.650030] offset 0x17 data = 5c02
[  119.668054] offset 0x18 data = 801
[  119.672997] offset 0x1b data = 0
[  119.676565] offset 0x1c data = 0
[  119.680084] offset 0x1d data = 0
[  119.698031] offset 0x1e data = 20
[  119.706246] offset 0x1f data = 8190
[  119.709984]
[  119.709984]
[  120.182120] offset 0x00 data = 100
[  120.185894] offset 0x01 data = 784d
[  120.189681] offset 0x02 data = 22
[  120.206319] offset 0x03 data = 1560
[  120.210171] offset 0x04 data = 8061
[  120.225353] offset 0x05 data = 0
[  120.228948] offset 0x06 data = 4
[  120.242936] offset 0x07 data = 2001
[  120.246792] offset 0x08 data = 0
[  120.250313] offset 0x09 data = 0
[  120.266748] offset 0x10 data = 0
[  120.270335] offset 0x11 data = 0
[  120.284167] offset 0x15 data = 0
[  120.287760] offset 0x16 data = 202
[  120.301031] offset 0x17 data = 3c02
[  120.313209] offset 0x18 data = 801
[  120.316983] offset 0x1b data = 0
[  120.320513] offset 0x1c data = 1
[  120.336589] offset 0x1d data = 0
[  120.340184] offset 0x1e data = 115
[  120.355357] offset 0x1f data = 8190
[  120.359112]
[  120.359112]
[  129.785396] offset 0x00 data = 1100
[  129.789252] offset 0x01 data = 7849
[  129.809951] offset 0x02 data = 22
[  129.815018] offset 0x03 data = 1560
[  129.818845] offset 0x04 data = 85e1
[  129.835808] offset 0x05 data = 0
[  129.839398] offset 0x06 data = 4
[  129.854514] offset 0x07 data = 2001
[  129.858371] offset 0x08 data = 0
[  129.873357] offset 0x09 data = 0
[  129.876954] offset 0x10 data = 0
[  129.880472] offset 0x11 data = 0
[  129.896450] offset 0x15 data = 0
[  129.900038] offset 0x16 data = 202
[  129.915041] offset 0x17 data = 5c02
[  129.918889] offset 0x18 data = 801
[  129.932735] offset 0x1b data = 0
[  129.946830] offset 0x1c data = 0
[  129.950424] offset 0x1d data = 0
[  129.964585] offset 0x1e data = 0
[  129.968192] offset 0x1f data = 8190
[  129.972938]
[  129.972938]
[  130.425125] offset 0x00 data = 100
[  130.428889] offset 0x01 data = 784d
[  130.442671] offset 0x02 data = 22
[  130.446360] offset 0x03 data = 1560
[  130.450142] offset 0x04 data = 8061
[  130.467207] offset 0x05 data = 0
[  130.470789] offset 0x06 data = 4
[  130.485071] offset 0x07 data = 2001
[  130.488934] offset 0x08 data = 0
[  130.504320] offset 0x09 data = 0
[  130.507911] offset 0x10 data = 0
[  130.520950] offset 0x11 data = 0
[  130.532865] offset 0x15 data = 0
[  130.536461] offset 0x16 data = 202
[  130.540156] offset 0x17 data = 3c02
[  130.557218] offset 0x18 data = 801
[  130.560987] offset 0x1b data = 0
[  130.575158] offset 0x1c data = 1
[  130.578754] offset 0x1d data = 0
[  130.593543] offset 0x1e data = 115
[  130.597312] offset 0x1f data = 8190

We can see that BMCR and BMSR changes from 0x1100,0x7849 to 0x100,0x784d (BMCR[12] bit and BMSR[2]), and loop it.
Above process is auto-nego enabled, link is down, auto-nego is disabled, link is up. And auto-nego complete bit is always 0.

PHY seems become unstable if soft reset after start/config auto-nego. I also want to fix it in micrel driver, but failed.

Do you have any other insights that can help me further locate the issue? Thanks.

Best Regards,
Joakim Zhang

[...]

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ