netdev - Re: net: fec: memory corruption caused by err_enet_mii

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMeyCbhTRuic6U8xnOg+oWJ_cSXduYZaV14vu8+ZWd_4XaXNaA@mail.gmail.com>
Date:   Thu, 23 Dec 2021 14:35:12 +0100
From:   Kegl Rohit <keglrohit@...il.com>
To:     Joakim Zhang <qiangqing.zhang@....com>
Cc:     netdev <netdev@...r.kernel.org>, Andrew Lunn <andrew@...n.ch>
Subject: Re: net: fec: memory corruption caused by err_enet_mii_probe error path

Thanks for the hint to use of_mdiobus_register instead of the old way
with fec { phy-reset-gpios = ....}

This change could solve another issue with the smsc phys.
I think phy_reset_after_clk_enable does only something if the phy was
specified in the DT with reset-gpio = <
Coming from an older kernel I was using the fecs phy-reset-gpios and
mdio auto probing and therefore phy_reset_after_clk_enable did
nothing.
And my smsc phys need this reset after enet_out enable (PHY_RST_AFTER_CLK_EN).

int phy_reset_after_clk_enable(struct phy_device *phydev)
{
if (!phydev || !phydev->drv)
return -ENODEV;

if (phydev->drv->flags & PHY_RST_AFTER_CLK_EN) {
phy_device_reset(phydev, 1);
phy_device_reset(phydev, 0);
return 1;
}
return 0;
}

So I adapted the DT to specify the reset gpios.
BUT this solved not my issue with the not detected phy.
So i changed the generic
compatible = "ethernet-phy-ieee802.3-c22";
to the exact phy id
compatible = "ethernet-phy-id0007.c0f0";

=> Now the phy is detected on every boot

if (!is_c45 && !of_get_phy_id(child, &phy_id))
phy = phy_device_create(mdio, addr, phy_id, 0, NULL);
else
phy = get_phy_device(mdio, addr, is_c45);
if (IS_ERR(phy)) {
if (mii_ts)
unregister_mii_timestamper(mii_ts);
return PTR_ERR(phy);
}
rc = of_mdiobus_phy_device_register(mdio, phy, child, addr);

With a given phy id mdio probe will issue phy_device_create; otherwise
get_phy_device to read the id via mdio.

ethernet-phy@0 { reset-gpio = ....} gpio does not get
asserted/deasserted before the get_phy_device call.
Only the global mdio { reset-gpio = .... } gpio is asserted/deasserted
before the get_phy_device call.
So the phy does not get its reset after enet_out clock enable => phy
is in bad state => get_phy_device does not find the phy => no
of_mdiobus_phy_device_register => missing phy

I would need something like this in addition to ethernet-phy@0 {
reset-gpio = ....}
mdio { reset-gpio = <&gpio4 13 GPIO_ACTIVE_LOW &gpio2 7 GPIO_ACTIVE_LOW}
But mdio { reset-gpio } does only support a single gpio.

For now I am ok with the fixed phy ids.


&mac0 {
phy-mode = "rmii";
phy-handle = <&mac0_phy>;
phy-supply = <&reg_etn_3v3>;
pinctrl-names = "default";
pinctrl-0 = <&mac0_pins_a>, <&mac0_phy_reset_pin>, <&mac1_phy_reset_pin>;
status = "okay";


mdio {
#address-cells = <1>;
#size-cells = <0>;

mac0_phy: ethernet-phy@0 {
reg = <0>;
compatible = "ethernet-phy-id0007.c0f0";
reset-gpios = <&gpio4 13 GPIO_ACTIVE_LOW>;
reset-assert-us = <10000>;
reset-deassert-us = <30000>;
smsc,disable-energy-detect;
};

mac1_phy: ethernet-phy@1 {
reg = <1>;
compatible = "ethernet-phy-id0007.c0f0";
reset-gpios = <&gpio2 7 GPIO_ACTIVE_LOW>;
reset-assert-us = <10000>;
reset-deassert-us = <30000>;
smsc,disable-energy-detect;
};
};
};

On Thu, Dec 23, 2021 at 2:53 AM Joakim Zhang <qiangqing.zhang@....com> wrote:
>
>
> Hi Kegl,
>
> > -----Original Message-----
> > From: Kegl Rohit <keglrohit@...il.com>
> > Sent: 2021年12月22日 19:06
> > To: Joakim Zhang <qiangqing.zhang@....com>
> > Cc: netdev <netdev@...r.kernel.org>; Andrew Lunn <andrew@...n.ch>
> > Subject: Re: net: fec: memory corruption caused by err_enet_mii_probe
> > error path
> >
> > On Wed, Dec 22, 2021 at 9:01 AM Joakim Zhang <qiangqing.zhang@....com>
> > wrote:
> >
> > > > This error path frees the DMA buffers, BUT as far I could see it
> > > > does not stop the DMA engines.
> > > > => open() fails => frees buffers => DMA still active => MAC receives
> > > > network packet => DMA starts => random memory corruption (use after
> > > > free) => random kernel panics
> > >
> > > A question here, why receive path still active? MAC has not connected to
> > PHY when this failure happened, should not see network activities.
> >
> > It is a imx.28 platform using the fec for dual ethernet eth0 & eth1.
> >
> > One of our devices (out of 10) eth1 did not detect a phy for eth1 on ifup
> > ( fec_open() ) => mdio error path => random system crashes But the phy is
> > there and the link is good, only MDIO access failed.
> > phys have autoneg activated after reset. So the RX path is active, even if the
> > fec driver says that there is no phy.
> >
> > Without attached ethernet cable the phy was also not detected, BUT the
> > system did not crash. => Because no packets will arrive without attached
> > cable.
> >
> > So the main issue on our side is the not detected phy. And the other issue is
> > the use after free in the error path.
>
> Thanks for you detailed explanation :)
>
> > I think the main issue has something to do with phy reset handling.
> > On a cold boot the eth1 phy is detected successfully. A warm restart ( reboot
> > -f ) will always lead to a not detected eth1 phy. The eth0 phy is always
> > detected.
> > From past experience the dual ethernet implementation in the driver was
> > not the most stable one. Maybe because of a smaller user base.
> > In our setup both phy reset lines are connected to gpios with the correct
> > entry in the mac0 & mac1 DT entry.
> > Revert of
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > b.com%2Ftorvalds%2Flinux%2Fcommit%2F7705b5ed8adccd921423019e870df
> > 672fa423279%23diff-655f306656e7bccbec8fe6ebff2adf466bb8133f5bcb55111
> > 2d3fe37e50b1a15&amp;data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C
> > 2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4c6fa92cd99c5c301635
> > %7C0%7C1%7C637757679518224181%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> > oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1
> > 000&amp;sdata=yIQBX3v6PRvPerVBhdDcRJpjt1aIiq%2BtL8RXfNmKs%2FA%3
> > D&amp;reserved=0
> > seems to get the phys in a correct state before the fec driver is probed and
> > both phys are detected.
>
> You mean it always can detect both eth0 and eth1 PHY on a code boot, but eth1 PHY failed when do
> warm reset, it may deserve a further analyze the different for these two scenarios.
>
> I am not sure when you will hardware reset PHY on you platforms. For FEC driver, it seems only hardware
> reset PHY when probe MAC (please note PHY reset has moved to phylib, not recommend to do it in MAC driver),
> fec_probe->of_mdiobus_register->of_mdiobus_register_phy->of_mdiobus_phy_device_register->phy_device_register
>
> Generally, a hardware reset will trigger an auto-nego together, so I think PHY should in a correct state when you ifup the device,
> although, it will also trigger auto-nego when connected PHY. What I want to say is that you may need to check if the hardware reset
> at you side is suitable? Not sure you use single GPIO to control two PHYs reset?
>
> Just some analysis, hope this can give you some hints.
>
> > > > So maybe fec_stop() as counterpart to fec_restart() is missing
> > > > before freeing the buffers?
> > > > err_enet_mii_probe:
> > > > fec_stop(ndev);
> > > > fec_enet_free_buffers(ndev);
> > > >
> > > > Issue happend with 5.10.83 and should also also happen with current
> > master.
> > >
> > > It's fine for me, please see if anyone else has some comments. If not,
> > please cook a formal patch, thanks.
> > So fec_stop is the right guess to stop the rx/tx dma rings.
> >
> > Other paths use netif_tx_lock_bh before calling fec_stop().
> > I don't think this is necessary because netif_tx_start_all_queues() was not
> > called before in this error path case?
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > b.com%2Ftorvalds%2Flinux%2Fblob%2Fmaster%2Fdrivers%2Fnet%2Fethern
> > et%2Ffreescale%2Ffec_main.c%23L3207&amp;data=04%7C01%7Cqiangqing.z
> > hang%40nxp.com%7C2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4
> > c6fa92cd99c5c301635%7C0%7C1%7C637757679518224181%7CUnknown%7CT
> > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> > JXVCI6Mn0%3D%7C1000&amp;sdata=KfDdWkJ76Tvsd7nmXpNhdrY%2Fz9x24
> > UYJl6fxjwvqtYA%3D&amp;reserved=0
>
> ACK, please submit a patch if you test there is no regressions.
>
> Best Regards,
> Joakim Zhang
> > index 1b1f7f2a6..8f208b4a9 100644
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -3211,8 +3211,9 @@ fec_enet_open(struct net_device *ndev)
> >
> >         return 0;
> >
> >  err_enet_mii_probe:
> > +       fec_stop(ndev);
> >         fec_enet_free_buffers(ndev);
> >  err_enet_alloc:
> >         fec_enet_clk_enable(ndev, false);
> >  clk_enable:

On Thu, Dec 23, 2021 at 2:53 AM Joakim Zhang <qiangqing.zhang@....com> wrote:
>
>
> Hi Kegl,
>
> > -----Original Message-----
> > From: Kegl Rohit <keglrohit@...il.com>
> > Sent: 2021年12月22日 19:06
> > To: Joakim Zhang <qiangqing.zhang@....com>
> > Cc: netdev <netdev@...r.kernel.org>; Andrew Lunn <andrew@...n.ch>
> > Subject: Re: net: fec: memory corruption caused by err_enet_mii_probe
> > error path
> >
> > On Wed, Dec 22, 2021 at 9:01 AM Joakim Zhang <qiangqing.zhang@....com>
> > wrote:
> >
> > > > This error path frees the DMA buffers, BUT as far I could see it
> > > > does not stop the DMA engines.
> > > > => open() fails => frees buffers => DMA still active => MAC receives
> > > > network packet => DMA starts => random memory corruption (use after
> > > > free) => random kernel panics
> > >
> > > A question here, why receive path still active? MAC has not connected to
> > PHY when this failure happened, should not see network activities.
> >
> > It is a imx.28 platform using the fec for dual ethernet eth0 & eth1.
> >
> > One of our devices (out of 10) eth1 did not detect a phy for eth1 on ifup
> > ( fec_open() ) => mdio error path => random system crashes But the phy is
> > there and the link is good, only MDIO access failed.
> > phys have autoneg activated after reset. So the RX path is active, even if the
> > fec driver says that there is no phy.
> >
> > Without attached ethernet cable the phy was also not detected, BUT the
> > system did not crash. => Because no packets will arrive without attached
> > cable.
> >
> > So the main issue on our side is the not detected phy. And the other issue is
> > the use after free in the error path.
>
> Thanks for you detailed explanation :)
>
> > I think the main issue has something to do with phy reset handling.
> > On a cold boot the eth1 phy is detected successfully. A warm restart ( reboot
> > -f ) will always lead to a not detected eth1 phy. The eth0 phy is always
> > detected.
> > From past experience the dual ethernet implementation in the driver was
> > not the most stable one. Maybe because of a smaller user base.
> > In our setup both phy reset lines are connected to gpios with the correct
> > entry in the mac0 & mac1 DT entry.
> > Revert of
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > b.com%2Ftorvalds%2Flinux%2Fcommit%2F7705b5ed8adccd921423019e870df
> > 672fa423279%23diff-655f306656e7bccbec8fe6ebff2adf466bb8133f5bcb55111
> > 2d3fe37e50b1a15&amp;data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C
> > 2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4c6fa92cd99c5c301635
> > %7C0%7C1%7C637757679518224181%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> > oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1
> > 000&amp;sdata=yIQBX3v6PRvPerVBhdDcRJpjt1aIiq%2BtL8RXfNmKs%2FA%3
> > D&amp;reserved=0
> > seems to get the phys in a correct state before the fec driver is probed and
> > both phys are detected.
>
> You mean it always can detect both eth0 and eth1 PHY on a code boot, but eth1 PHY failed when do
> warm reset, it may deserve a further analyze the different for these two scenarios.
>
> I am not sure when you will hardware reset PHY on you platforms. For FEC driver, it seems only hardware
> reset PHY when probe MAC (please note PHY reset has moved to phylib, not recommend to do it in MAC driver),
> fec_probe->of_mdiobus_register->of_mdiobus_register_phy->of_mdiobus_phy_device_register->phy_device_register
>
> Generally, a hardware reset will trigger an auto-nego together, so I think PHY should in a correct state when you ifup the device,
> although, it will also trigger auto-nego when connected PHY. What I want to say is that you may need to check if the hardware reset
> at you side is suitable? Not sure you use single GPIO to control two PHYs reset?
>
> Just some analysis, hope this can give you some hints.
>
> > > > So maybe fec_stop() as counterpart to fec_restart() is missing
> > > > before freeing the buffers?
> > > > err_enet_mii_probe:
> > > > fec_stop(ndev);
> > > > fec_enet_free_buffers(ndev);
> > > >
> > > > Issue happend with 5.10.83 and should also also happen with current
> > master.
> > >
> > > It's fine for me, please see if anyone else has some comments. If not,
> > please cook a formal patch, thanks.
> > So fec_stop is the right guess to stop the rx/tx dma rings.
> >
> > Other paths use netif_tx_lock_bh before calling fec_stop().
> > I don't think this is necessary because netif_tx_start_all_queues() was not
> > called before in this error path case?
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> > b.com%2Ftorvalds%2Flinux%2Fblob%2Fmaster%2Fdrivers%2Fnet%2Fethern
> > et%2Ffreescale%2Ffec_main.c%23L3207&amp;data=04%7C01%7Cqiangqing.z
> > hang%40nxp.com%7C2f10c3ff43de4ccccc8708d9c53b037e%7C686ea1d3bc2b4
> > c6fa92cd99c5c301635%7C0%7C1%7C637757679518224181%7CUnknown%7CT
> > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> > JXVCI6Mn0%3D%7C1000&amp;sdata=KfDdWkJ76Tvsd7nmXpNhdrY%2Fz9x24
> > UYJl6fxjwvqtYA%3D&amp;reserved=0
>
> ACK, please submit a patch if you test there is no regressions.
>
> Best Regards,
> Joakim Zhang
> > index 1b1f7f2a6..8f208b4a9 100644
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -3211,8 +3211,9 @@ fec_enet_open(struct net_device *ndev)
> >
> >         return 0;
> >
> >  err_enet_mii_probe:
> > +       fec_stop(ndev);
> >         fec_enet_free_buffers(ndev);
> >  err_enet_alloc:
> >         fec_enet_clk_enable(ndev, false);
> >  clk_enable: