netdev - Re: [PATCH stable-4.19.y] net: phy: reschedule state machine if AN has not completed in PHY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200531001849.GG1551@shell.armlinux.org.uk>
Date:   Sun, 31 May 2020 01:18:49 +0100
From:   Russell King - ARM Linux admin <linux@...linux.org.uk>
To:     Vladimir Oltean <olteanv@...il.com>
Cc:     stable@...r.kernel.org, gregkh@...uxfoundation.org,
        netdev@...r.kernel.org, andrew@...n.ch, f.fainelli@...il.com,
        hkallweit1@...il.com, davem@...emloft.net, kuba@...nel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH stable-4.19.y] net: phy: reschedule state machine if AN
 has not completed in PHY_AN state

On Sun, May 31, 2020 at 12:43:15AM +0300, Vladimir Oltean wrote:
> From: Vladimir Oltean <vladimir.oltean@....com>
> 
> In kernel 4.19 (and probably earlier too) there are issues surrounding
> the PHY_AN state.
> 
> For example, if a PHY is in PHY_AN state and AN has not finished, then
> what is supposed to happen is that the state machine gets rescheduled
> until it is, or until the link_timeout reaches zero which triggers an
> autoneg restart process.
> 
> But actually the rescheduling never works if the PHY uses interrupts,
> because the condition under which rescheduling occurs is just if
> phy_polling_mode() is true. So basically, this whole rescheduling
> functionality works for AN-not-yet-complete just by mistake. Let me
> explain.
> 
> Most of the time the AN process manages to finish by the time the
> interrupt has triggered. One might say "that should always be the case,
> otherwise the PHY wouldn't raise the interrupt, right?".
> Well, some PHYs implement an .aneg_done method which allows them to tell
> the state machine when the AN is really complete.
> The AR8031/AR8033 driver (at803x.c) is one such example. Even when
> copper autoneg completes, the driver still keeps the "aneg_done"
> variable unset until in-band SGMII autoneg finishes too (there is no
> interrupt for that). So we have the premises of a race condition.

Why do we care whether SGMII autoneg has completed - is that not the
domain of the MAC side of the link?

It sounds like things are a little confused.  The PHY interrupt is
signalling that the copper side has completed its autoneg.  If we're
in SGMII mode, the PHY can now start the process of informing the
MAC about the negotiation results across the SGMII link.  When the
MAC receives those results, and sends the acknowledgement back to the
PHY, is it not the responsibility of the MAC to then say "the link is
now up" ?

That's how we deal with it elsewhere with phylink integration, which
is what has to be done when you have to cope with PHYs that switch
their host interface mode between SGMII, 2500BASE-X, 5GBASE-R and
10GBASE-R - the MAC side needs to be dynamically reconfigured depending
on the new host-side operating mode of the PHY.  Only when the MAC
subsequently reports that the link has been established is the whole
link from the MAC to the media deemed to be operational.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC for 0.8m (est. 1762m) line in suburbia: sync at 13.1Mbps down 424kbps up