netdev - Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UcXY3y3=0AnbbbRH75gh2ciBKhQj2tzQAbcHW_acKeoQw@mail.gmail.com>
Date: Sun, 20 Apr 2025 11:18:39 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: netdev@...r.kernel.org, linux@...linux.org.uk, hkallweit1@...il.com, 
	davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com
Subject: Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap

On Sat, Apr 19, 2025 at 11:11 AM Andrew Lunn <andrew@...n.ch> wrote:
>
> On Wed, Apr 16, 2025 at 08:28:46AM -0700, Alexander Duyck wrote:
> > Address two issues found in the phylink code.
> >
> > The first issue is the fact that there were unused defines that were
> > referencing deprecated macros themselves. Since they aren't used we might
> > as well drop them.
> >
> > The second issue which is more the main reason for this submission is the
> > fact that the BMC was losing link when we would call phylink_resume. This
> > is fixed by adding a new boolean value link_balanced which will allow us
> > to avoid doing an immediate force of the link up/down and instead defer it
> > until after we have checked the actual link state.
>
> I'm wondering if we have jumped straight into the weeds without having
> a good overall big picture of what we are trying to achieve. But maybe
> it is just me, and this is just for my edification...
>
> As i've said a few times we don't have a good story around networking
> and BMCs. Traditionally, all the details have been hidden away in the
> NIC firmware, and linux is pretty much unaware it is going on, at
> least from the Host side. fbnic is changing that, and we need
> Linux/phylink to understand this.
>
> Since this is all pretty new to me, i went and quickly read:
>
> https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
>
> Hopefully i now have a better big picture.
>
> Figure 2 answers a few questions for me. One was, do we actually have
> a three port switch in here? And i would say no. We have something
> similar, but not a switch. There is no back to back MAC on the host
> PCI interface. We do have back to back MAC on the NC-SI port, but it
> appears Linux has no knowledge of the NIC NC-SI MAC, and the BMC is
> controlling BMC NC-SI MAC.
>
> Not having a switch means when we are talking about the MAC, PCS, PHY
> etc, we are talking about the media side MAC, PCS, PHY. Given that
> phylink is just as often used with switches with a conduit interface
> and switch ports, that is an important point.
>
> Figure 2 also hints at there being three different life cycles all
> interacting with each other. Our normal model in phylink is that the
> Network Controller is passive, it is told what to do by
> Linux/phylink. However, in this setup, that is not true. The Network
> Controller is active, it has firmware running on it. The Network
> Controller and the Management Controller life cycle probably starts at
> about the same time, when the PSU starts generating standby power. The
> host life cycle starts later, when the BMC decides to power up the
> host.
>
> The NC-SI protocol defines messages between the Management Controller
> and the Network Controller. One of these messages is how to configure
> the media side. See section 8.4.21. It lists different networks speeds
> which can be negotiated, duplex, and pause, and if to use
> auto-neg. There is not enough details to fully specific link modes
> above 1000BaseT, all you can request for example is 40G, but you
> cannot say CR4, KR4, SR4, or LR4. There also does not appear to be a
> way to ask the network controller what it actually supports. So i
> guess you normally just ask for everything up to 100G, and you are
> happy when Get Link Status response command says the link it 10BaseT
> Half.

In our case this is one of the reasons why the firmware knows what the
interface and media type are before we do. It basically has it coded
into the EEPROM so that it will identify the type as CR and normally
it informs the BMC and the host of the speed/FEC of the port as well.
So the BMC could configure this, but it is normally getting this info
from the NIC as it is setup to bring the link up before the BMC even
gets a chance to ask to setup the config.

> The Network Controller needs to be smart enough to get the link up and
> running. So it basically has a phylink implementation, with a PCS
> driver, 0 or more PHY drivers, SFP cage driver, SFP driver etc.
>
> Some text from the document, which is pretty relevant to the
> discussion.
>
>   The Set Link command may be used by the Management Controller to
>   configure the external network interface associated with the channel
>   by using the provided settings. Upon receiving this command, while
>   the host NC driver is not operational, the channel shall attempt to
>   set the link to the configuration specified by the parameters. Upon
>   successful completion of this command, link settings specified in
>   the command should be used by the network controller as long as the
>   host NC driver does not overwrite the link settings.
>
>   In the absence of an operational host NC driver, the NC should
>   attempt to make the requested link state change even if it requires
>   the NC to drop the current link. The channel shall send a response
>   packet to the Management Controller within the required response
>   time. However, the requested link state changes may take an
>   unspecified amount of time to complete.
>
>   The actual link settings are controlled by the host NC driver when
>   it is operational. When the host NC driver is operational, link
>   settings specified by the MC using the Set Link command may be
>   overwritten by the host NC driver. The link settings are not
>   restored by the NC if the host NC driver becomes non
>   operational.
>
> There is a very clear indication that the host is in control, or the
> host is not in control. So one obvious question to me is, should
> phylink have ops into the MAC driver to say it is taking over control,
> and relinquishing control? The Linux model is that when the interface
> is admin down, you can use ethtool to preconfigure things, but they
> don't take affect until the link is admin up. So with admin down, we
> have a host NC driver, but it is not operational, hence the Network
> Controller is in control of the link at the Management Controllers
> bequest. It is only with admin up that phylink takes control of the
> Network Controller, and it releases it with admin down. Having these
> ops would also help with suspend/resume. Suspend does not change the
> admin up/down status, but the host clearly needs to hand over control
> of the media to the Network Controller, and take it back again on
> resume.

Yes, this more-or-less describes the current setup in fbnic. The only
piece that is probably missing would be the heartbeat we maintain so
that the NIC doesn't revoke access due to the OS/driver potentially
being hung. The other thing involved in this that you didn't mention
is that the MC is also managing the Rx filter configuration. So when
we take ownership it is both the Rx Filters and MAC/PCS/PHY that we
are taking ownership of.

The current pattern in fbnic is for us to do most of this on the tail
end of __fbnic_open and unwind it near the start of fbnic_stop.
Essentially the pattern is xmit_ownership, init heartbeat, init PTP,
start phylink, configure Rx filters. In the case of close it is the
reverse with us tearing down the filters, disabling phylink, disabling
PTP, and then releasing ownership.

> Also, if we have these ops, we know that admin down/suspend does not
> mean media down. The presence of these ops triggers different state
> transitions in the phylink state machine so that it simply hands off
> control of the media, but otherwise leaves it alone.
>
> With this in place, i think we can avoid all the unbalanced state?

As I understand it right now the main issue is that Phylink assumes
that it has to take the link down in order to perform a major
configuration in phylink_start/phylink_resume. However in the case of
fbnic what I do a check in mac_prepare to see if I even really need to
take the link down or not to support the requested mode. If not then I
skip the destructive settings update and tweak a few minor things that
can be updated without taking the link down. So I think the change I
am making is that mac_prepare before assumed it was just disabling
things from the SerDes down, and I think I pulled that up to the MAC
layer which may be a departure from previous expectations.

> What is potentially more interesting is when phylink takes control. Do
> we have enough information about the system to say its current
> configuration is the wanted configuration? Or are we forced to do a
> ground up reconfiguration, which will include a media down/up? I had a
> quick scan of the document and i did not find anything which says the
> host is not allowed/is allowed to do disruptive things, but the text
> quoted above says 'The actual link settings are controlled by the host
> NC driver when it is operational'. Controlling the link settings is a
> disruptive operation, so the management controller needs to be
> tolerant to such changes.

In the case of fbnic we have a test we can do to verify we are linked
at the correct fec/modulation/lane width via a check of signal outputs
from the PCS. Basically we have one master bit that says link is
present, but then we can inspect the individual values on a per
FEC/lane/modulation basis. So we can do a link check for a specific
interface type and verify if the link is up and configured for the
settings we want.

Historically speaking, drivers in the past did disruptive things while
the BMC was up. That is one of the things that caught me off-guard
with Meta's requirements. I had been used to dealing with 1G BMC on
most enterprise hardware and in the case of those we end up bouncing
the link state when the link is configured. I had gotten quite used to
seeing that on igb some time ago as I have had a 1G system with igb
and a BMC for years and got used to the SoL console stalling as the
igb reconfigured the link.

The requirement that the BMC not lose link comes more out of the
multi-host setups that have been in place in the data center
environment for the last decade or so where there was only one link
but multiple systems all sharing that link, including the BMC. So it
is not strictly a BMC requirement, but more of a multi-host
requirement.

> So, can we ignore the weeds for the moment, and think about the big
> picture?

So big picture wise we really have 2 issues:
1. The BMC handling doesn't currently exist, so we need to extend
handling/hand-off for link up before we start, and link up after we
stop.
2. Expectations for our 25G+ interfaces to behave like multi-host NICs
that are sharing a link via firmware. Specifically that
loading/unloading the driver or ifconfig up/down on the host interface
should not cause the link to bounce and/or drop packets for any other
connections, which in this case includes the BMC.