[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DB8PR04MB6795B2D48F4B5FDC8BA4EBC0E67E9@DB8PR04MB6795.eurprd04.prod.outlook.com>
Date: Thu, 23 Dec 2021 01:53:53 +0000
From: Joakim Zhang <qiangqing.zhang@....com>
To: Kegl Rohit <keglrohit@...il.com>
CC: netdev <netdev@...r.kernel.org>, Andrew Lunn <andrew@...n.ch>
Subject: RE: net: fec: memory corruption caused by err_enet_mii_probe error
path
Hi Kegl,
> -----Original Message-----
> From: Kegl Rohit <keglrohit@...il.com>
> Sent: 2021年12月22日 19:06
> To: Joakim Zhang <qiangqing.zhang@....com>
> Cc: netdev <netdev@...r.kernel.org>; Andrew Lunn <andrew@...n.ch>
> Subject: Re: net: fec: memory corruption caused by err_enet_mii_probe
> error path
>
> On Wed, Dec 22, 2021 at 9:01 AM Joakim Zhang <qiangqing.zhang@....com>
> wrote:
>
> > > This error path frees the DMA buffers, BUT as far I could see it
> > > does not stop the DMA engines.
> > > => open() fails => frees buffers => DMA still active => MAC receives
> > > network packet => DMA starts => random memory corruption (use after
> > > free) => random kernel panics
> >
> > A question here, why receive path still active? MAC has not connected to
> PHY when this failure happened, should not see network activities.
>
> It is a imx.28 platform using the fec for dual ethernet eth0 & eth1.
>
> One of our devices (out of 10) eth1 did not detect a phy for eth1 on ifup
> ( fec_open() ) => mdio error path => random system crashes But the phy is
> there and the link is good, only MDIO access failed.
> phys have autoneg activated after reset. So the RX path is active, even if the
> fec driver says that there is no phy.
>
> Without attached ethernet cable the phy was also not detected, BUT the
> system did not crash. => Because no packets will arrive without attached
> cable.
>
> So the main issue on our side is the not detected phy. And the other issue is
> the use after free in the error path.
Thanks for you detailed explanation :)
> I think the main issue has something to do with phy reset handling.
> On a cold boot the eth1 phy is detected successfully. A warm restart ( reboot
> -f ) will always lead to a not detected eth1 phy. The eth0 phy is always
> detected.
> From past experience the dual ethernet implementation in the driver was
> not the most stable one. Maybe because of a smaller user base.
> In our setup both phy reset lines are connected to gpios with the correct
> entry in the mac0 & mac1 DT entry.
> Revert of
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> b.com%2Ftorvalds%2Flinux%2Fcommit%2F7705b5ed8adccd921423019e870df
> 672fa423279%23diff-655f306656e7bccbec8fe6ebff2adf466bb8133f5bcb55111
> 2d3fe37e50b1a15&data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C
> 2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4c6fa92cd99c5c301635
> %7C0%7C1%7C637757679518224181%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1
> 000&sdata=yIQBX3v6PRvPerVBhdDcRJpjt1aIiq%2BtL8RXfNmKs%2FA%3
> D&reserved=0
> seems to get the phys in a correct state before the fec driver is probed and
> both phys are detected.
You mean it always can detect both eth0 and eth1 PHY on a code boot, but eth1 PHY failed when do
warm reset, it may deserve a further analyze the different for these two scenarios.
I am not sure when you will hardware reset PHY on you platforms. For FEC driver, it seems only hardware
reset PHY when probe MAC (please note PHY reset has moved to phylib, not recommend to do it in MAC driver),
fec_probe->of_mdiobus_register->of_mdiobus_register_phy->of_mdiobus_phy_device_register->phy_device_register
Generally, a hardware reset will trigger an auto-nego together, so I think PHY should in a correct state when you ifup the device,
although, it will also trigger auto-nego when connected PHY. What I want to say is that you may need to check if the hardware reset
at you side is suitable? Not sure you use single GPIO to control two PHYs reset?
Just some analysis, hope this can give you some hints.
> > > So maybe fec_stop() as counterpart to fec_restart() is missing
> > > before freeing the buffers?
> > > err_enet_mii_probe:
> > > fec_stop(ndev);
> > > fec_enet_free_buffers(ndev);
> > >
> > > Issue happend with 5.10.83 and should also also happen with current
> master.
> >
> > It's fine for me, please see if anyone else has some comments. If not,
> please cook a formal patch, thanks.
> So fec_stop is the right guess to stop the rx/tx dma rings.
>
> Other paths use netif_tx_lock_bh before calling fec_stop().
> I don't think this is necessary because netif_tx_start_all_queues() was not
> called before in this error path case?
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> b.com%2Ftorvalds%2Flinux%2Fblob%2Fmaster%2Fdrivers%2Fnet%2Fethern
> et%2Ffreescale%2Ffec_main.c%23L3207&data=04%7C01%7Cqiangqing.z
> hang%40nxp.com%7C2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4
> c6fa92cd99c5c301635%7C0%7C1%7C637757679518224181%7CUnknown%7CT
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> JXVCI6Mn0%3D%7C1000&sdata=KfDdWkJ76Tvsd7nmXpNhdrY%2Fz9x24
> UYJl6fxjwvqtYA%3D&reserved=0
ACK, please submit a patch if you test there is no regressions.
Best Regards,
Joakim Zhang
> index 1b1f7f2a6..8f208b4a9 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -3211,8 +3211,9 @@ fec_enet_open(struct net_device *ndev)
>
> return 0;
>
> err_enet_mii_probe:
> + fec_stop(ndev);
> fec_enet_free_buffers(ndev);
> err_enet_alloc:
> fec_enet_clk_enable(ndev, false);
> clk_enable:
Powered by blists - more mailing lists