netdev - Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UfDWP91rH1G70+pYL2HbMdjgr46h3X+uufL42xmXVi=cg@mail.gmail.com>
Date: Tue, 22 Apr 2025 16:06:19 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org, linux@...linux.org.uk, 
	hkallweit1@...il.com, davem@...emloft.net, pabeni@...hat.com
Subject: Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap

On Tue, Apr 22, 2025 at 3:26 PM Andrew Lunn <andrew@...n.ch> wrote:
>
> On Tue, Apr 22, 2025 at 02:29:48PM -0700, Alexander Duyck wrote:
> > On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@...n.ch> wrote:
> > >
> > > > > The whole concept of a multi-host NIC is new to me. So i at least need
> > > > > to get up to speed with it. I've no idea if Russell has come across it
> > > > > before, since it is not a SoC concept.
> > > > >
> > > > > I don't really want to agree to anything until i do have that concept
> > > > > understood. That is part of why i asked about a standard. It is a
> > > > > dense document answering a lot of questions. Without a standard, i
> > > > > need to ask a lot of questions.
> > > >
> > > > Don't hesitate to ask the questions, your last reply contains no
> > > > question marks :)
> > >
> > > O.K. Lets start with the basics. I assume the NIC has a PCIe connector
> > > something like a 4.0 x4? Each of the four hosts in the system
> > > contribute one PCIe lane. So from the host side it looks like a 4.0 x1
> > > NIC?
> >
> > More like 5.0 x16 split in to 4 5.0 x4 NICs.
>
> O.K. Same thing, different scale.

Agreed.

> > > There are not 4 host MACs connected to a 5 port switch. Rather, each
> > > host gets its own subset of queues, DMA engines etc, for one shared
> > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> > > interface.
> >
> > Yeah, that is the setup so far. Basically we are using one QSFP cable
> > and slicing it up. So instead of having a 100CR4 connection we might
> > have 2x50CR2 operating on the same cable, or 4x25CR.
>
> But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs?

Yes. Some confusion here may be that our hardware always has 4
MAC/PCS/PMA setups, one for each host. Depending on the NIC
configuration we may have either 4 hosts or 2 hosts present with 2
disabled. What they end up doing is routing 2 lanes from the QSFP to
one host and the other two to the other. So in the case of the QSFP28
or QSFP+ we can only support 2 hosts, and with the QSFP-DD we can
support 4.

> Or is there always 4 MACs, each MAC has its own queues, and you need
> to place frames into the correct queue, and with a 2x50CR2 you also
> need to load balance across those two queues?

Are you familiar with the concept of QSFP breakout cables? The general
idea is that one end of the cable is a QSFP connection and it will
break out into 4 SFP connections on the other end. That is actually
pretty close to the concept behind our NIC. We essentially have an
internalized breakout where the QSFP connection comes in, but we break
it into either 2 or 4 connections on our end. Our limit is 2 lanes per
host.

I did a quick search and came up with the following link to a Cisco
whitepaper that sort of explains the breakout cable concept. I will
try to see if I can find a spec somewhere that defines how to handle a
breakout cable:
https://www.cisco.com/c/en/us/products/collateral/interfaces-modules/transceiver-modules/whitepaper-c11-744077.html

> I guess the queuing does not matter much to phylink, but how do you
> represent multiple PCS lanes to phylink? Up until now, one netdev has
> had one PCS lane. It now has 1, 2, or 4 lanes. None of the
> phylink_pcs_op have a lane indicator.

So the PCS isn't really much of a problem. There is only one PCS per
host. Where things get a bit messier is that the PMA/PMD setup is per
lane. So our PCS has vendor registers for setting up the PMA side of
things and we have to set them for 2 devices instead of just one.
Likewise we have to pass a lanes mask to the PMD to tell it which
lanes are being configured for what modulation and which lanes are
disabled.

> > > NC-SI, with Linux controlling the hardware, implies you need to be
> > > able to hand off control of the GPIOs, I2C, PCS to Linux. But with
> > > multi-host, it makes no sense for all 4 hosts to be trying to control
> > > the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more
> > > likely to me, one host gets put in change of everything below the
> > > queues to the MAC. The others just know there is link, nothing more.
> >
> > Things are a bit simpler than that. With the direct-attach we don't
> > need to take any action on the SFP. Essentially the I2C and GPIOs are
> > all shared. As such we can read the QSFP state, but cannot modify it
> > directly. We aren't taking any actions to write to the I2C other than
> > bank/page which is handled all as a part of the read call.
>
> That might work for direct-attach, but what about the general case? We
> need to ensure whatever we add supports the general case.

I agree, but at the same time I am just letting you know the
limitations of our hardware setup. There isn't anything to really
control on the QSFP. It is mostly just there to provide the media and
that is about it. No PHY on it to load FW for.

> The current SFP code expects a Linux I2C bus. Given how SFPs are
> broken, it does 16 bytes reads at the most. When it needs to read more
> than 16 bytes, i expect it will set the page once, read it back to
> ensure the SFP actually implements the page, and then do multiple I2C
> reads to read all the data it wants from that page. I don't see how
> this is going to work when the I2C bus is shared.

The general idea is that we have to cache the page and bank in the
driver and pass those as arguments to the firmware when we perform a
read. Basically it will take a lock on the I2C, set the page and bank,
perform the read, and then release the lock. With that all 4 hosts can
read the I2C from the QSFP without causing any side effects.

> > > This actually circles back to the discussion about fixed-link. The one
> > > host in control of all the lower hardware has the complete
> > > picture. The other 3 maybe just need a fixed link. They don't get to
> > > see what is going on below the MAC, and as a result there is no
> > > ethtool support to change anything, and so no conflicting
> > > configuration? And since they cannot control any of that, they cannot
> > > put the link down. So 3/4 of the problem is solved.
> >
> > Yeah, this is why I was headed down that path for a bit. However our
> > links are independent with the only shared bit being the PMD and the
> > SFP module.
>
> Yours might be, but what is the general case?

I will do some digging into the breakout cable path. That seems like
the most likely setup that would be similar and more general.