netdev - Re: [PATCH net] net: phy: make phy_error() report which PHY has failed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191218220908.GX25745@shell.armlinux.org.uk>
Date:   Wed, 18 Dec 2019 22:09:08 +0000
From:   Russell King - ARM Linux admin <linux@...linux.org.uk>
To:     Heiner Kallweit <hkallweit1@...il.com>
Cc:     Andrew Lunn <andrew@...n.ch>,
        Florian Fainelli <f.fainelli@...il.com>,
        "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org
Subject: Re: [PATCH net] net: phy: make phy_error() report which PHY has
 failed

On Wed, Dec 18, 2019 at 09:54:32PM +0100, Heiner Kallweit wrote:
> On 18.12.2019 00:34, Russell King - ARM Linux admin wrote:
> > On Tue, Dec 17, 2019 at 10:41:34PM +0100, Heiner Kallweit wrote:
> >> On 17.12.2019 13:53, Russell King wrote:
> >>> phy_error() is called from phy_interrupt() or phy_state_machine(), and
> >>> uses WARN_ON() to print a backtrace. The backtrace is not useful when
> >>> reporting a PHY error.
> >>>
> >>> However, a system may contain multiple ethernet PHYs, and phy_error()
> >>> gives no clue which one caused the problem.
> >>>
> >>> Replace WARN_ON() with a call to phydev_err() so that we can see which
> >>> PHY had an error, and also inform the user that we are halting the PHY.
> >>>
> >>> Fixes: fa7b28c11bbf ("net: phy: print stack trace in phy_error")
> >>> Signed-off-by: Russell King <rmk+kernel@...linux.org.uk>
> >>> ---
> >>> There is another related problem in this area. If an error is detected
> >>> while the PHY is running, phy_error() moves to PHY_HALTED state. If we
> >>> try to take the network device down, then:
> >>>
> >>> void phy_stop(struct phy_device *phydev)
> >>> {
> >>>         if (!phy_is_started(phydev)) {
> >>>                 WARN(1, "called from state %s\n",
> >>>                      phy_state_to_str(phydev->state));
> >>>                 return;
> >>>         }
> >>>
> >>> triggers, and we never do any of the phy_stop() cleanup. I'm not sure
> >>> what the best way to solve this is - introducing a PHY_ERROR state may
> >>> be a solution, but I think we want some phy_is_started() sites to
> >>> return true for it and others to return false.
> >>>
> >>> Heiner - you introduced the above warning, could you look at improving
> >>> this case so we don't print a warning and taint the kernel when taking
> >>> a network device down after phy_error() please?
> >>>
> >> I think we need both types of information:
> >> - the affected PHY device
> >> - the stack trace to see where the issue was triggered
> > 
> > Can you please explain why the stack trace is useful.  For the paths
> > that are reachable, all it tells you is whether it was reached via
> > the interrupt or the workqueue.
> > 
> > If it's via the interrupt, the rest of the backtrace beyond that is
> > irrelevant.  If it's the workqueue, the backtrace doesn't go back
> > very far, and doesn't tell you what operation triggered it.
> > 
> > If it's important to see where or why phy_error() was called, there
> > are much better ways of doing that, notably passing a string into
> > phy_error() to describe the actual error itself.  That would convey
> > way more useful information than the backtrace does.
> > 
> > I have been faced with these backtraces, and they have not been at
> > all useful for diagnosing the problem.
> > 
> "The problem" comes in two flavors:
> 1. The problem that caused the PHY error
> 2. The problem caused by the PHY error (if we decide to not
>    always switch to HALTED state)
> 
> We can't do much for case 1, maybe we could add an errno argument
> to phy_error(). To facilitate analyzing case 2 we'd need to change
> code pieces like the following.
> 
> case a:
> err = f1();
> case b:
> err = f2();
> 
> if (err)
> 	phy_error()
> 
> For my understanding: What caused the PHY error in your case(s)?
> Which info would have been useful for analyzing the error?

Errors reading/writing from the PHY.

The problem with a backtrace from phy_error() is it doesn't tell you
where the error actually occurred, it only tells you where the error
is reported - which is one of two different paths at the moment.
That can be achieved with much more elegance and simplicity by
passing a string into phy_error() to describe the call site if that's
even relevant.

I would say, however, that knowing where the error occurred would be
far better information.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up