lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <de130c97-c344-42ee-b3bc-0ca5f9dc36df@lunn.ch>
Date: Sat, 19 Apr 2025 20:11:19 +0200
From: Andrew Lunn <andrew@...n.ch>
To: Alexander Duyck <alexander.duyck@...il.com>
Cc: netdev@...r.kernel.org, linux@...linux.org.uk, hkallweit1@...il.com,
	davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com
Subject: Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap

On Wed, Apr 16, 2025 at 08:28:46AM -0700, Alexander Duyck wrote:
> Address two issues found in the phylink code.
> 
> The first issue is the fact that there were unused defines that were
> referencing deprecated macros themselves. Since they aren't used we might
> as well drop them.
> 
> The second issue which is more the main reason for this submission is the
> fact that the BMC was losing link when we would call phylink_resume. This
> is fixed by adding a new boolean value link_balanced which will allow us
> to avoid doing an immediate force of the link up/down and instead defer it
> until after we have checked the actual link state.

I'm wondering if we have jumped straight into the weeds without having
a good overall big picture of what we are trying to achieve. But maybe
it is just me, and this is just for my edification...

As i've said a few times we don't have a good story around networking
and BMCs. Traditionally, all the details have been hidden away in the
NIC firmware, and linux is pretty much unaware it is going on, at
least from the Host side. fbnic is changing that, and we need
Linux/phylink to understand this.

Since this is all pretty new to me, i went and quickly read:

https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf

Hopefully i now have a better big picture.

Figure 2 answers a few questions for me. One was, do we actually have
a three port switch in here? And i would say no. We have something
similar, but not a switch. There is no back to back MAC on the host
PCI interface. We do have back to back MAC on the NC-SI port, but it
appears Linux has no knowledge of the NIC NC-SI MAC, and the BMC is
controlling BMC NC-SI MAC.

Not having a switch means when we are talking about the MAC, PCS, PHY
etc, we are talking about the media side MAC, PCS, PHY. Given that
phylink is just as often used with switches with a conduit interface
and switch ports, that is an important point.

Figure 2 also hints at there being three different life cycles all
interacting with each other. Our normal model in phylink is that the
Network Controller is passive, it is told what to do by
Linux/phylink. However, in this setup, that is not true. The Network
Controller is active, it has firmware running on it. The Network
Controller and the Management Controller life cycle probably starts at
about the same time, when the PSU starts generating standby power. The
host life cycle starts later, when the BMC decides to power up the
host.

The NC-SI protocol defines messages between the Management Controller
and the Network Controller. One of these messages is how to configure
the media side. See section 8.4.21. It lists different networks speeds
which can be negotiated, duplex, and pause, and if to use
auto-neg. There is not enough details to fully specific link modes
above 1000BaseT, all you can request for example is 40G, but you
cannot say CR4, KR4, SR4, or LR4. There also does not appear to be a
way to ask the network controller what it actually supports. So i
guess you normally just ask for everything up to 100G, and you are
happy when Get Link Status response command says the link it 10BaseT
Half.

The Network Controller needs to be smart enough to get the link up and
running. So it basically has a phylink implementation, with a PCS
driver, 0 or more PHY drivers, SFP cage driver, SFP driver etc.

Some text from the document, which is pretty relevant to the
discussion.

  The Set Link command may be used by the Management Controller to
  configure the external network interface associated with the channel
  by using the provided settings. Upon receiving this command, while
  the host NC driver is not operational, the channel shall attempt to
  set the link to the configuration specified by the parameters. Upon
  successful completion of this command, link settings specified in
  the command should be used by the network controller as long as the
  host NC driver does not overwrite the link settings.

  In the absence of an operational host NC driver, the NC should
  attempt to make the requested link state change even if it requires
  the NC to drop the current link. The channel shall send a response
  packet to the Management Controller within the required response
  time. However, the requested link state changes may take an
  unspecified amount of time to complete.

  The actual link settings are controlled by the host NC driver when
  it is operational. When the host NC driver is operational, link
  settings specified by the MC using the Set Link command may be
  overwritten by the host NC driver. The link settings are not
  restored by the NC if the host NC driver becomes non
  operational.

There is a very clear indication that the host is in control, or the
host is not in control. So one obvious question to me is, should
phylink have ops into the MAC driver to say it is taking over control,
and relinquishing control? The Linux model is that when the interface
is admin down, you can use ethtool to preconfigure things, but they
don't take affect until the link is admin up. So with admin down, we
have a host NC driver, but it is not operational, hence the Network
Controller is in control of the link at the Management Controllers
bequest. It is only with admin up that phylink takes control of the
Network Controller, and it releases it with admin down. Having these
ops would also help with suspend/resume. Suspend does not change the
admin up/down status, but the host clearly needs to hand over control
of the media to the Network Controller, and take it back again on
resume.

Also, if we have these ops, we know that admin down/suspend does not
mean media down. The presence of these ops triggers different state
transitions in the phylink state machine so that it simply hands off
control of the media, but otherwise leaves it alone.

With this in place, i think we can avoid all the unbalanced state?

What is potentially more interesting is when phylink takes control. Do
we have enough information about the system to say its current
configuration is the wanted configuration? Or are we forced to do a
ground up reconfiguration, which will include a media down/up? I had a
quick scan of the document and i did not find anything which says the
host is not allowed/is allowed to do disruptive things, but the text
quoted above says 'The actual link settings are controlled by the host
NC driver when it is operational'. Controlling the link settings is a
disruptive operation, so the management controller needs to be
tolerant to such changes.

So, can we ignore the weeds for the moment, and think about the big
picture?

	Andrew

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ