[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z3fuP47lueQpzMst@shell.armlinux.org.uk>
Date: Fri, 3 Jan 2025 14:03:43 +0000
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: Niklas Cassel <cassel@...nel.org>
Cc: Andrew Lunn <andrew@...n.ch>, Francesco Valla <francesco@...la.it>,
Andrew Lunn <andrew+netdev@...n.ch>,
Heiner Kallweit <hkallweit1@...il.com>, netdev@...r.kernel.org,
Anand Moon <linux.amoon@...il.com>
Subject: Re: [PATCH] net: phy: don't issue a module request if a driver is
available
On Fri, Jan 03, 2025 at 01:49:30PM +0100, Niklas Cassel wrote:
> On Fri, Jan 03, 2025 at 12:08:43PM +0000, Russell King (Oracle) wrote:
> > On Fri, Jan 03, 2025 at 12:25:52PM +0100, Niklas Cassel wrote:
> > > I'm trying to enable async probe for my PCIe controller (pcie-dw-rockchip),
> > > which on the radxa rock5b has a RTL8125 NIC connected to it.
> > >
> > > By enabling async probe for the PCIe driver I get the same splat as Francesco.
> > >
> > > Looking at the prints, it is trying to load a module for PHY ID: 0x1cc840
> > > This PHY ID is defined in: drivers/net/phy/realtek.c.
> > >
> > > Looking at my .config I have:
> > > CONFIG_REALTEK_PHY=y
> > >
> > > So this is not built as a module, so I am a bit surprised to see this
> > > splat (since the driver is built as built-in).
> > >
> > >
> > > I think it would be nice if the phylib core could be fixed so that
> > > it does not try to load modules for drivers which are built as built-in.
> > >
> > >
> > > Also see this old thread that tries to enable async probe by default on
> > > DT systems:
> > > https://lore.kernel.org/linux-kernel//d5796286-ec24-511a-5910-5673f8ea8b10@samsung.com/T/#u
> > >
> > > AFAICT, it seems that the phylib core is one of the biggest blockers from
> > > being able to enable async probe by default on DT systems.
> >
> > Yes, we accept that phylib is incompatible with async probing. I don't
> > think that's going to change, because it's fundamentally baked in with
> > the way the whole fallback driver stuff works.
> >
> > We *certainly* don't want to move the request_module() into
> > phy_attach*() (which is the point where we require the driver to be
> > bound or we fallback to the generic feature-reduced driver). First,
> > that *will* break SFP modules, no ifs or buts.
> >
> > Second, moving it there would mean calling request_module() in many
> > cases with the RTNL held, which blocks things like new connections
> > network establishing while the module is requested (I've run into this
> > problem when the TI Wilink driver locks up holding the RTNL lock making
> > the platform impossible to remotely resolve if there isn't an already
> > open SSH connection.) We certainly don't want userspace to be doing
> > stuff while holding the "big" RTNL that affects much of networking.
> >
> > Third, I suspect phylib already has a race between the PHY driver /
> > driver core binding the appropriate driver and phy_attach_direct()
> > attaching the fallback generic driver to the driverless PHY device,
> > and making this more "async" is going to open that race possibly to
> > the point where it becomes a problem. (At the moment, it doesn't
> > seem to cause any issue, so is theoretical right now - but if one
> > reads the code, it's obvious that there is no locking that prevents
> > a race there.)
> >
> > What saves phylib right now is that by issueing the request_module(),
> > that will wait for the module to be loaded and initialised. The
> > initialisation function will register the PHY drivers in this module.
> > As this is synchronous, it will happen before request_module() returns,
> > and thus before phy_device_create() returns. Thus, if there is a module
> > available for the PHY, it will be loaded and available to be bound
> > to the PHY device by the time phy_device_register() is called. This
> > ensures that - in the case of an auto-loaded module, the race will
> > never happen.
> >
> > Yes, it's weak. A scenario that could trigger this is loading PHY
> > driver modules in parallel with a call to the phy_attach*() functions,
> > e.g. when bringing up a network interface where the network driver
> > calls through to phy_attach*() from its .ndo_open() method. If we
> > simply make phylib's request_module() async, then this race will be
> > opened for auto-loaded modules as well.
> >
> > Closing this race to give consistent results is impossible, even if
> > we add locking. If phy_attach*() were to complete first, the generic
> > driver would be used despite the PHY specific driver module being
> > loaded. Alternatively, if the PHY specific driver module finishes
> > being loaded before phy_attach*() is called, then the PHY specific
> > driver will be used for the device. So... it needs to be synchronous.
> >
> > I also don't think "make a list of built-in drivers and omit the
> > request module" is an acceptable workaround - it's a sticky plaster
> > for the problem. If the PHY driver isn't built-in, then you have the
> > same problem with request_module() being issued. You could work around
> > that by ensuring that the PHY driver is built-in, but then we're
> > relying on multiple different things all being correct in diverse
> > areas, which is fragile.
>
> FWIW, the patch in $subject does make the splat go away for me.
> (I have the PHY driver built as built-in).
This is not about "making the splat go away". Making something go away
is not solving the problem if it's a fragile sticky plaster. Such things
come back to bite.
> The patch in $subject does "Add a list of registered drivers and check
> if one is already available before resorting to call request_module();
> in this way, if the PHY driver is already there, the MDIO bus can perform
> the async probe."
This is the sticky plaster. I covered that in my message to which you're
replying. See the last paragraph from my reply quoted above.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Powered by blists - more mailing lists