[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <DB8PR04MB6795AA7CF279E3249EABE614E67F9@DB8PR04MB6795.eurprd04.prod.outlook.com>
Date: Fri, 24 Dec 2021 02:29:31 +0000
From: Joakim Zhang <qiangqing.zhang@....com>
To: Kegl Rohit <keglrohit@...il.com>
CC: netdev <netdev@...r.kernel.org>, Andrew Lunn <andrew@...n.ch>
Subject: RE: net: fec: memory corruption caused by err_enet_mii_probe error
path
Hi Kegl,
> -----Original Message-----
> From: Kegl Rohit <keglrohit@...il.com>
> Sent: 2021年12月23日 21:35
> To: Joakim Zhang <qiangqing.zhang@....com>
> Cc: netdev <netdev@...r.kernel.org>; Andrew Lunn <andrew@...n.ch>
> Subject: Re: net: fec: memory corruption caused by err_enet_mii_probe
> error path
>
> Thanks for the hint to use of_mdiobus_register instead of the old way with
> fec { phy-reset-gpios = ....}
>
> This change could solve another issue with the smsc phys.
> I think phy_reset_after_clk_enable does only something if the phy was
> specified in the DT with reset-gpio = < Coming from an older kernel I was
> using the fecs phy-reset-gpios and mdio auto probing and therefore
> phy_reset_after_clk_enable did nothing.
> And my smsc phys need this reset after enet_out enable
> (PHY_RST_AFTER_CLK_EN).
>
> int phy_reset_after_clk_enable(struct phy_device *phydev) { if (!phydev
> || !phydev->drv) return -ENODEV;
>
> if (phydev->drv->flags & PHY_RST_AFTER_CLK_EN)
> { phy_device_reset(phydev, 1); phy_device_reset(phydev, 0); return 1; }
> return 0; }
Yes, please use mdio or phy reset, instead of mac reset, which marked as deprecated in DT-binding.
> So I adapted the DT to specify the reset gpios.
> BUT this solved not my issue with the not detected phy.
> So i changed the generic
> compatible = "ethernet-phy-ieee802.3-c22"; to the exact phy id compatible =
> "ethernet-phy-id0007.c0f0";
>
> => Now the phy is detected on every boot
>
> if (!is_c45 && !of_get_phy_id(child, &phy_id)) phy =
> phy_device_create(mdio, addr, phy_id, 0, NULL); else phy =
> get_phy_device(mdio, addr, is_c45); if (IS_ERR(phy)) { if (mii_ts)
> unregister_mii_timestamper(mii_ts);
> return PTR_ERR(phy);
> }
> rc = of_mdiobus_phy_device_register(mdio, phy, child, addr);
>
> With a given phy id mdio probe will issue phy_device_create; otherwise
> get_phy_device to read the id via mdio.
>
> ethernet-phy@0 { reset-gpio = ....} gpio does not get asserted/deasserted
> before the get_phy_device call.
> Only the global mdio { reset-gpio = .... } gpio is asserted/deasserted before
> the get_phy_device call.
> So the phy does not get its reset after enet_out clock enable => phy is in bad
> state => get_phy_device does not find the phy => no
> of_mdiobus_phy_device_register => missing phy
It sounds like that phylib would play a mdio reset very earlier, before we accessing PHY registers.
Yes, generic compatible and exact phy-id compatible indeed different at this point, you caught the root case, also
let me learn more :)
> I would need something like this in addition to ethernet-phy@0 { reset-gpio
> = ....} mdio { reset-gpio = <&gpio4 13 GPIO_ACTIVE_LOW &gpio2 7
> GPIO_ACTIVE_LOW} But mdio { reset-gpio } does only support a single gpio.
>
> For now I am ok with the fixed phy ids.
>
>
> &mac0 {
> phy-mode = "rmii";
> phy-handle = <&mac0_phy>;
> phy-supply = <®_etn_3v3>;
> pinctrl-names = "default";
> pinctrl-0 = <&mac0_pins_a>, <&mac0_phy_reset_pin>,
> <&mac1_phy_reset_pin>; status = "okay";
>
>
> mdio {
> #address-cells = <1>;
> #size-cells = <0>;
>
> mac0_phy: ethernet-phy@0 {
> reg = <0>;
> compatible = "ethernet-phy-id0007.c0f0"; reset-gpios = <&gpio4 13
> GPIO_ACTIVE_LOW>; reset-assert-us = <10000>; reset-deassert-us =
> <30000>; smsc,disable-energy-detect; };
>
> mac1_phy: ethernet-phy@1 {
> reg = <1>;
> compatible = "ethernet-phy-id0007.c0f0"; reset-gpios = <&gpio2 7
> GPIO_ACTIVE_LOW>; reset-assert-us = <10000>; reset-deassert-us =
> <30000>; smsc,disable-energy-detect; }; }; };
Best Regards,
Joakim Zhang
> On Thu, Dec 23, 2021 at 2:53 AM Joakim Zhang <qiangqing.zhang@....com>
> wrote:
> >
> >
> > Hi Kegl,
> >
> > > -----Original Message-----
> > > From: Kegl Rohit <keglrohit@...il.com>
> > > Sent: 2021年12月22日 19:06
> > > To: Joakim Zhang <qiangqing.zhang@....com>
> > > Cc: netdev <netdev@...r.kernel.org>; Andrew Lunn <andrew@...n.ch>
> > > Subject: Re: net: fec: memory corruption caused by
> > > err_enet_mii_probe error path
> > >
> > > On Wed, Dec 22, 2021 at 9:01 AM Joakim Zhang
> > > <qiangqing.zhang@....com>
> > > wrote:
> > >
> > > > > This error path frees the DMA buffers, BUT as far I could see it
> > > > > does not stop the DMA engines.
> > > > > => open() fails => frees buffers => DMA still active => MAC
> > > > > receives network packet => DMA starts => random memory
> > > > > corruption (use after
> > > > > free) => random kernel panics
> > > >
> > > > A question here, why receive path still active? MAC has not
> > > > connected to
> > > PHY when this failure happened, should not see network activities.
> > >
> > > It is a imx.28 platform using the fec for dual ethernet eth0 & eth1.
> > >
> > > One of our devices (out of 10) eth1 did not detect a phy for eth1 on
> > > ifup ( fec_open() ) => mdio error path => random system crashes But
> > > the phy is there and the link is good, only MDIO access failed.
> > > phys have autoneg activated after reset. So the RX path is active,
> > > even if the fec driver says that there is no phy.
> > >
> > > Without attached ethernet cable the phy was also not detected, BUT
> > > the system did not crash. => Because no packets will arrive without
> > > attached cable.
> > >
> > > So the main issue on our side is the not detected phy. And the other
> > > issue is the use after free in the error path.
> >
> > Thanks for you detailed explanation :)
> >
> > > I think the main issue has something to do with phy reset handling.
> > > On a cold boot the eth1 phy is detected successfully. A warm restart
> > > ( reboot -f ) will always lead to a not detected eth1 phy. The eth0
> > > phy is always detected.
> > > From past experience the dual ethernet implementation in the driver
> > > was not the most stable one. Maybe because of a smaller user base.
> > > In our setup both phy reset lines are connected to gpios with the
> > > correct entry in the mac0 & mac1 DT entry.
> > > Revert of
> > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > > thu
> > >
> b.com%2Ftorvalds%2Flinux%2Fcommit%2F7705b5ed8adccd921423019e870df
> > >
> 672fa423279%23diff-655f306656e7bccbec8fe6ebff2adf466bb8133f5bcb55111
> > >
> 2d3fe37e50b1a15&data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C
> > >
> 2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4c6fa92cd99c5c301635
> > > %7C0%7C1%7C637757679518224181%7CUnknown%7CTWFpbGZsb3d8eyJ
> WIj
> > >
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1
> > >
> 000&sdata=yIQBX3v6PRvPerVBhdDcRJpjt1aIiq%2BtL8RXfNmKs%2FA%3
> > > D&reserved=0
> > > seems to get the phys in a correct state before the fec driver is
> > > probed and both phys are detected.
> >
> > You mean it always can detect both eth0 and eth1 PHY on a code boot,
> > but eth1 PHY failed when do warm reset, it may deserve a further analyze
> the different for these two scenarios.
> >
> > I am not sure when you will hardware reset PHY on you platforms. For
> > FEC driver, it seems only hardware reset PHY when probe MAC (please
> > note PHY reset has moved to phylib, not recommend to do it in MAC
> > driver),
> >
> fec_probe->of_mdiobus_register->of_mdiobus_register_phy->of_mdiobus
> _ph
> > y_device_register->phy_device_register
> >
> > Generally, a hardware reset will trigger an auto-nego together, so I
> > think PHY should in a correct state when you ifup the device,
> > although, it will also trigger auto-nego when connected PHY. What I want to
> say is that you may need to check if the hardware reset at you side is suitable?
> Not sure you use single GPIO to control two PHYs reset?
> >
> > Just some analysis, hope this can give you some hints.
> >
> > > > > So maybe fec_stop() as counterpart to fec_restart() is missing
> > > > > before freeing the buffers?
> > > > > err_enet_mii_probe:
> > > > > fec_stop(ndev);
> > > > > fec_enet_free_buffers(ndev);
> > > > >
> > > > > Issue happend with 5.10.83 and should also also happen with
> > > > > current
> > > master.
> > > >
> > > > It's fine for me, please see if anyone else has some comments. If
> > > > not,
> > > please cook a formal patch, thanks.
> > > So fec_stop is the right guess to stop the rx/tx dma rings.
> > >
> > > Other paths use netif_tx_lock_bh before calling fec_stop().
> > > I don't think this is necessary because netif_tx_start_all_queues()
> > > was not called before in this error path case?
> > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > > thu
> > >
> b.com%2Ftorvalds%2Flinux%2Fblob%2Fmaster%2Fdrivers%2Fnet%2Fethern
> > >
> et%2Ffreescale%2Ffec_main.c%23L3207&data=04%7C01%7Cqiangqing.z
> > >
> hang%40nxp.com%7C2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4
> > >
> c6fa92cd99c5c301635%7C0%7C1%7C637757679518224181%7CUnknown%7CT
> > >
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> > >
> JXVCI6Mn0%3D%7C1000&sdata=KfDdWkJ76Tvsd7nmXpNhdrY%2Fz9x24
> > > UYJl6fxjwvqtYA%3D&reserved=0
> >
> > ACK, please submit a patch if you test there is no regressions.
> >
> > Best Regards,
> > Joakim Zhang
> > > index 1b1f7f2a6..8f208b4a9 100644
> > > --- a/drivers/net/ethernet/freescale/fec_main.c
> > > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > > @@ -3211,8 +3211,9 @@ fec_enet_open(struct net_device *ndev)
> > >
> > > return 0;
> > >
> > > err_enet_mii_probe:
> > > + fec_stop(ndev);
> > > fec_enet_free_buffers(ndev);
> > > err_enet_alloc:
> > > fec_enet_clk_enable(ndev, false);
> > > clk_enable:
>
> On Thu, Dec 23, 2021 at 2:53 AM Joakim Zhang <qiangqing.zhang@....com>
> wrote:
> >
> >
> > Hi Kegl,
> >
> > > -----Original Message-----
> > > From: Kegl Rohit <keglrohit@...il.com>
> > > Sent: 2021年12月22日 19:06
> > > To: Joakim Zhang <qiangqing.zhang@....com>
> > > Cc: netdev <netdev@...r.kernel.org>; Andrew Lunn <andrew@...n.ch>
> > > Subject: Re: net: fec: memory corruption caused by
> > > err_enet_mii_probe error path
> > >
> > > On Wed, Dec 22, 2021 at 9:01 AM Joakim Zhang
> > > <qiangqing.zhang@....com>
> > > wrote:
> > >
> > > > > This error path frees the DMA buffers, BUT as far I could see it
> > > > > does not stop the DMA engines.
> > > > > => open() fails => frees buffers => DMA still active => MAC
> > > > > receives network packet => DMA starts => random memory
> > > > > corruption (use after
> > > > > free) => random kernel panics
> > > >
> > > > A question here, why receive path still active? MAC has not
> > > > connected to
> > > PHY when this failure happened, should not see network activities.
> > >
> > > It is a imx.28 platform using the fec for dual ethernet eth0 & eth1.
> > >
> > > One of our devices (out of 10) eth1 did not detect a phy for eth1 on
> > > ifup ( fec_open() ) => mdio error path => random system crashes But
> > > the phy is there and the link is good, only MDIO access failed.
> > > phys have autoneg activated after reset. So the RX path is active,
> > > even if the fec driver says that there is no phy.
> > >
> > > Without attached ethernet cable the phy was also not detected, BUT
> > > the system did not crash. => Because no packets will arrive without
> > > attached cable.
> > >
> > > So the main issue on our side is the not detected phy. And the other
> > > issue is the use after free in the error path.
> >
> > Thanks for you detailed explanation :)
> >
> > > I think the main issue has something to do with phy reset handling.
> > > On a cold boot the eth1 phy is detected successfully. A warm restart
> > > ( reboot -f ) will always lead to a not detected eth1 phy. The eth0
> > > phy is always detected.
> > > From past experience the dual ethernet implementation in the driver
> > > was not the most stable one. Maybe because of a smaller user base.
> > > In our setup both phy reset lines are connected to gpios with the
> > > correct entry in the mac0 & mac1 DT entry.
> > > Revert of
> > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > > thu
> > >
> b.com%2Ftorvalds%2Flinux%2Fcommit%2F7705b5ed8adccd921423019e870df
> > >
> 672fa423279%23diff-655f306656e7bccbec8fe6ebff2adf466bb8133f5bcb55111
> > >
> 2d3fe37e50b1a15&data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C
> > >
> 2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4c6fa92cd99c5c301635
> > > %7C0%7C1%7C637757679518224181%7CUnknown%7CTWFpbGZsb3d8eyJ
> WIj
> > >
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1
> > >
> 000&sdata=yIQBX3v6PRvPerVBhdDcRJpjt1aIiq%2BtL8RXfNmKs%2FA%3
> > > D&reserved=0
> > > seems to get the phys in a correct state before the fec driver is
> > > probed and both phys are detected.
> >
> > You mean it always can detect both eth0 and eth1 PHY on a code boot,
> > but eth1 PHY failed when do warm reset, it may deserve a further analyze
> the different for these two scenarios.
> >
> > I am not sure when you will hardware reset PHY on you platforms. For
> > FEC driver, it seems only hardware reset PHY when probe MAC (please
> > note PHY reset has moved to phylib, not recommend to do it in MAC
> > driver),
> >
> fec_probe->of_mdiobus_register->of_mdiobus_register_phy->of_mdiobus
> _ph
> > y_device_register->phy_device_register
> >
> > Generally, a hardware reset will trigger an auto-nego together, so I
> > think PHY should in a correct state when you ifup the device,
> > although, it will also trigger auto-nego when connected PHY. What I want to
> say is that you may need to check if the hardware reset at you side is suitable?
> Not sure you use single GPIO to control two PHYs reset?
> >
> > Just some analysis, hope this can give you some hints.
> >
> > > > > So maybe fec_stop() as counterpart to fec_restart() is missing
> > > > > before freeing the buffers?
> > > > > err_enet_mii_probe:
> > > > > fec_stop(ndev);
> > > > > fec_enet_free_buffers(ndev);
> > > > >
> > > > > Issue happend with 5.10.83 and should also also happen with
> > > > > current
> > > master.
> > > >
> > > > It's fine for me, please see if anyone else has some comments. If
> > > > not,
> > > please cook a formal patch, thanks.
> > > So fec_stop is the right guess to stop the rx/tx dma rings.
> > >
> > > Other paths use netif_tx_lock_bh before calling fec_stop().
> > > I don't think this is necessary because netif_tx_start_all_queues()
> > > was not called before in this error path case?
> > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > > thu
> > >
> b.com%2Ftorvalds%2Flinux%2Fblob%2Fmaster%2Fdrivers%2Fnet%2Fethern
> > >
> et%2Ffreescale%2Ffec_main.c%23L3207&data=04%7C01%7Cqiangqing.z
> > >
> hang%40nxp.com%7C2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4
> > >
> c6fa92cd99c5c301635%7C0%7C1%7C637757679518224181%7CUnknown%7CT
> > >
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> > >
> JXVCI6Mn0%3D%7C1000&sdata=KfDdWkJ76Tvsd7nmXpNhdrY%2Fz9x24
> > > UYJl6fxjwvqtYA%3D&reserved=0
> >
> > ACK, please submit a patch if you test there is no regressions.
> >
> > Best Regards,
> > Joakim Zhang
> > > index 1b1f7f2a6..8f208b4a9 100644
> > > --- a/drivers/net/ethernet/freescale/fec_main.c
> > > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > > @@ -3211,8 +3211,9 @@ fec_enet_open(struct net_device *ndev)
> > >
> > > return 0;
> > >
> > > err_enet_mii_probe:
> > > + fec_stop(ndev);
> > > fec_enet_free_buffers(ndev);
> > > err_enet_alloc:
> > > fec_enet_clk_enable(ndev, false);
> > > clk_enable:
Powered by blists - more mailing lists