[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <addb830174704f3c9dcfea1323ed8ec8@chdua14.duagon.ads>
Date: Fri, 17 Sep 2021 06:07:00 +0000
From: Walter Stoll <Walter.Stoll@...gon.com>
To: Andrew Lunn <andrew@...n.ch>
CC: "f.fainelli@...il.com" <f.fainelli@...il.com>,
"hkallweit1@...il.com" <hkallweit1@...il.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: AW: [BUG] net/phy: ethtool versus phy_state_machine race condition
> Von: Andrew Lunn <andrew@...n.ch>
> Gesendet: Freitag, 17. September 2021 00:59
> An: Walter Stoll <Walter.Stoll@...gon.com>
> Cc: f.fainelli@...il.com; hkallweit1@...il.com; netdev@...r.kernel.org
> Betreff: Re: [BUG] net/phy: ethtool versus phy_state_machine race condition
>
> On Thu, Sep 16, 2021 at 01:08:21PM +0000, Walter Stoll wrote:
> > Effect observed
> > ---------------
> >
> > During final test of one of our products, we use ethtool to set the mode of
> > the ethernet port eth0 as follows:
> >
> > ethtool -s eth0 speed 100 duplex full autoneg off
> >
> > However, very rarely the command does not work as expected. Even though the
> > command executes without error, the port is not set accordingly. As a result,
> > the test fails.
> >
> > We observed the effect with kernel version v5.4.138-rt62. However, we think
> > that the most recent kernel exhibits the same behavior because the structure of
> > the sources in question (see below) did not change. This also holds for the non
> > realtime kernel.
> >
> >
> > Root cause
> > ----------
> >
> > We found that there exists a race condition between ethtool and the PHY state-
> > machine.
> >
> > Execution of the ethtool command involves the phy_ethtool_ksettings_set()
> > function being executed, see
> > https://elixir.bootlin.com/linux/v5.4.138/source/drivers/net/phy/phy.c#L315
> >
> > Here the mode is stored in the phydev structure. The phy_start_aneg() function
> > then takes the mode from the phydev structure and finally stores the mode into
> > the PHY.
> >
> > However, the phy_ethtool_ksettings_set() function can be interrupted by the
> > phy_state_machine() worker thread. If this happens just before the
> > phy_start_aneg() function is called, then the new settings are overwritten by
> > the current settings of the PHY. This is because the phy_state_machine()
> > function reads back the PHY settings, see
> > https://elixir.bootlin.com/linux/v5.4.138/source/drivers/net/phy/phy.c#L918
> >
> > This is exactly what happens in our case. We were able to proof this by
> > inserting various dev_info() calls in the code.
>
> Hi Walter
>
> This makes sense. We have a similar problem with MAC code calling
> phy_read_status() without holding the PHY lock as well. I have some
> patches for that, which i need to rebase. I will see if your proposed
> fixed can be added to that, or if it should be a separate series.
>
> Andrew
Hi Andrew
Thanks a lot for your immediate response. Please note that I am not a kernel
developer. Therefore I think, the patch eventually applied will look differently
from what I proposed. Please let me know whenever you have something to test.
Walter
Powered by blists - more mailing lists