netdev - Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UeY1xHvxXm8YDJ6MArgwcgu2Dkw3PBTg=m0XK7Hi-XrXw@mail.gmail.com>
Date: Wed, 16 Apr 2025 15:58:33 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: "Russell King (Oracle)" <linux@...linux.org.uk>
Cc: netdev@...r.kernel.org, andrew@...n.ch, hkallweit1@...il.com, 
	davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com
Subject: Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing
 w/ BMC present

On Wed, Apr 16, 2025 at 1:05 PM Russell King (Oracle)
<linux@...linux.org.uk> wrote:
>
> On Wed, Apr 16, 2025 at 08:19:38PM +0100, Russell King (Oracle) wrote:
> > So, when a Linux network driver probes, it starts out in administrative
> > state *DOWN*. When the administrator configures the network driver,
> > .ndo_open is called, which is expected to configure the network adapter.
> >
> > Part of that process is to call phylink_start() as one of the last
> > steps, which detects whether the link is up or not. If the link is up,
> > then phylink will call netif_carrier_on() and .mac_link_up(). This
> > tells the core networking layers that the network interface is now
> > ready to start sending packets, and it will begin queueing packets for
> > the network driver to process - not before.
> >
> > Prior to .ndo_open being called, the networking layers do not expect
> > traffic from the network device no matter what physical state the
> > media link is in. If .ndo_open fails, the same applies - no traffic is
> > expected to be passed to the core network layers from the network
> > layers because as far as the network stack is concerned, the interface
> > is still administratively down.
> >
> > Thus, the fact that your BMC thinks that the link is up is irrelevant.
> >
> > So, start off in a state that packet activity is suspended even if the
> > link is up at probe time. Only start packet activity (reception and
> > transmission) once .mac_link_up() has been called. Stop that activity
> > when .mac_link_down() is subsequently called.
> >
> > There have been lots of NICs out there where the physical link doesn't
> > follow the adminstrative state of the network interface. This is not a
> > problem. It may be desirable that it does, but a desire is not the same
> > as "it must".
>
> Let me be crystal clear on this.
>
> Phylink has a contract with all existing users. That contract is:
>
> Initial state: link down.
>
> Driver calls phylink_start() in its .ndo_open method.
>
> Phylink does configuration of the PHY and link according to the
> chosen link parameters by calling into the MAC, PCS, and phylib as
> appropriate.
>
> If the link is then discovered to be up (it might have been already
> up before phylink_start() was called), phylink will call the various
> components such as PCS and MAC to inform them that the link is now up.
> This will mean calling the .mac_link_up() method. Otherwise (if the
> link is discovered to be down when the interface is brought up) no
> call to either .mac_link_up() nor .mac_link_down() will be made.
>
> If the link _subsequently_ goes down, then phylink deals with that
> and calls .mac_link_down() - only if .mac_link_up() was previously
> called (that's one of the bugs you discovered, that on resume it
> gets called anyway. I've submitted a fix for that contract breach,
> which only affects a very small number of drivers - stmmac, ucc_geth
> and your fbnic out of 22 total ethernet users plus however many DSA
> users we have.)
>
> Only if .mac_link_down() has been called, if the link subsequently
> comes back up, then the same process happens as before resulting in
> .mac_link_up() being called.
>
> If the interface is taken down, then .mac_link_down() will be called
> if and only if .mac_link_up() had been called.
>
> The ordering of .mac_link_up() / .mac_link_down() is a strict
> contract term with phylink users.
>
> The reason for this contract: phylink users may have ordering
> requirements.
>
> For example, on mac_link_down(), they may wait for packet activity to
> stop, and then place the MAC in reset. If called without a previous
> .mac_link_up call, the wait stage may time out due to the MAC being
> in reset. (Ocelot may suffer with this.)
>
> Another example is fs_enet which also relies on this strict ordering
> as described above.
>
> There could be others - there are some that call into firmware on
> calls to .mac_link_up() / .mac_link_down() and misordering those
> depends on what the firmware is doing, which we have no visibility
> of.
>
> As I stated, this is the contract that phylink gave to users, and
> the contract still stands, and can't be broken to behave differently
> (e.g. calling .mac_link_down() after phylink_start() without an
> intervening call to .mac_link_up()) otherwise existing users will
> break. Bugs that go against that contract will be fixed, but the
> contract will not be intentionally broken.

The issue is as that stands the contract is inherently broken if a BMC
is present.

1. There is still the link loss during the phylink_start issue which
will essentially leave the MAC up if the link fails. This will cause
the Tx FIFO on the NIC to essentially be left to hang without flushing
out the stale packets that may have queued up when the link dropped.
This is one of the reasons why the mac_link_down call still needs to
propagate all the way back down to the hardware.

2. We already have precedent for the link being up when WOL is in use.
One concern I would have with your patch is if it will impact that or
not. I suspected part of the mac_link_down is related to cleaning up
something related to the link setup for WOL configuring the link for
some speed.

3. While it may be a part of the contract, isn't there some way we can
"renegotiate the terms"? The fact is in the case of our driver we are
essentially doing a hand-off from FW to the OS for the link when we
call ndo_open. When we are done in ndo_stop we hand ownership back
over to the firmware. Those hand-offs need to be free of link bounces.
This pattern looks very much like the WOL setup with the only
exception being trying to avoid these link bounces. I'm just wondering
if there isn't some way we can add a phylink_config value indicating
that there is a BMC present on the link so we can handle those cases.
The general idea would be to update my patch so instead of
pl->link_balanced being just set to false in suspend and at creation
time we could essentially just set it to "pl->link_balanced =
!pl->config.mac_bmc_handoff".

The fact is the code with the diff I provided did everything I needed.
I could load/unload and ifconfig up/down the driver all day and it
didn't drop a packet. As it stood the definition of the mac_config
code more or less called out that we weren't supposed to change things
unless we needed to and I had followed that. I suspect the overall
change should be small, smaller now following your fix of the message
issue, and it shouldn't impact anything other than our driver if I add
the config flag checks.