[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UfW=mHjtvxNdqy1qB6VYGxKrabWfWNgF3snR07QpNjEhQ@mail.gmail.com>
Date: Tue, 22 Apr 2025 14:29:48 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org, linux@...linux.org.uk,
hkallweit1@...il.com, davem@...emloft.net, pabeni@...hat.com
Subject: Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@...n.ch> wrote:
>
> > > The whole concept of a multi-host NIC is new to me. So i at least need
> > > to get up to speed with it. I've no idea if Russell has come across it
> > > before, since it is not a SoC concept.
> > >
> > > I don't really want to agree to anything until i do have that concept
> > > understood. That is part of why i asked about a standard. It is a
> > > dense document answering a lot of questions. Without a standard, i
> > > need to ask a lot of questions.
> >
> > Don't hesitate to ask the questions, your last reply contains no
> > question marks :)
>
> O.K. Lets start with the basics. I assume the NIC has a PCIe connector
> something like a 4.0 x4? Each of the four hosts in the system
> contribute one PCIe lane. So from the host side it looks like a 4.0 x1
> NIC?
More like 5.0 x16 split in to 4 5.0 x4 NICs.
> There are not 4 host MACs connected to a 5 port switch. Rather, each
> host gets its own subset of queues, DMA engines etc, for one shared
> MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> interface.
Yeah, that is the setup so far. Basically we are using one QSFP cable
and slicing it up. So instead of having a 100CR4 connection we might
have 2x50CR2 operating on the same cable, or 4x25CR.
> You must have a minimum of firmware on the NIC to get the MAC into a
> state the BMC can inject/receive frames, configure the PCS, gpios to
> the SFP, enough I2C to figure out what the module is, what quirks are
> needed etc.
The firmware isn't that smart. It isn't reading the QSFP itself to get
that info. It could, but it doesn't. It is essentially hands off as
there isn't any change needed for a direct attach cable. Basically it
is configuring the MAC, PCS, FEC, PMA, PMD with a pre-recorded setting
in the EEPROM for the NIC.
> NC-SI, with Linux controlling the hardware, implies you need to be
> able to hand off control of the GPIOs, I2C, PCS to Linux. But with
> multi-host, it makes no sense for all 4 hosts to be trying to control
> the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more
> likely to me, one host gets put in change of everything below the
> queues to the MAC. The others just know there is link, nothing more.
Things are a bit simpler than that. With the direct-attach we don't
need to take any action on the SFP. Essentially the I2C and GPIOs are
all shared. As such we can read the QSFP state, but cannot modify it
directly. We aren't taking any actions to write to the I2C other than
bank/page which is handled all as a part of the read call.
> This actually circles back to the discussion about fixed-link. The one
> host in control of all the lower hardware has the complete
> picture. The other 3 maybe just need a fixed link. They don't get to
> see what is going on below the MAC, and as a result there is no
> ethtool support to change anything, and so no conflicting
> configuration? And since they cannot control any of that, they cannot
> put the link down. So 3/4 of the problem is solved.
Yeah, this is why I was headed down that path for a bit. However our
links are independent with the only shared bit being the PMD and the
SFP module. We can essentially configure everything else diffrerent
between the ports from there. So depending on what the cable supports
we can potentially run one lane or two, and in NRZ or PAM4 mode.
So for example one of our standard test items to run is to use a
QSFP-DD loopback plug and essentially cycle through all different port
configurations on all the different ports to make sure we don't have
configuration leaking over from one port to the other as the PMD is
shared between hosts 0 and 1, and hosts 2 and 3 if we have a 4 port
setup. We don't have to have all 4 MAC/PCS/PMA configured the same. We
can have a different config between ports, although in most cases it
will just end up being the same.
> phylink is however not expecting that when phylink_start() is called,
> it might or might not have to drive the hardware depending on if it
> wins an election to control the hardware. And if it losses, it needs
> to ditch all its configuration for a PCS, SPF, etc and swap to a
> fixed-link. Do we want to teach phylink all this, or put all phylink
> stuff into open(), rather than spread across probe() and open(). Being
> in open(), you basically construct a different phylink configuration
> depending on if you win the election or not.
We are getting a bit off into the weeds here. So there isn't any sort
of election. There is still a firmware that is sitting on the shared
bits. So the PMD, I2C to the QSFP, and GPIO from the QSFP are all
controlled via the firmware. To prevent any significant issues for now
we treat the QSFP as read-only from the hosts as we have to go through
the firmware to get access, and the PMD can only be configured via a
message to the FW asking for a specific bitrate/modulation and number
of lanes.
> Is one host in the position to control the complete media
> configuration? Could you split the QSFP into four, each host gets its
> own channel, and it gets to choose how to use that channel, different
> FEC schemes, bit rates?
So one thing to be aware of is that the QSFP can be electrically
separated so that it is one cable, but with either 2 (QSFP+/QSFP28) or
4 (QSFP-DD) separate sets of lanes. The cable defines the limits of
what we can do in terms of modulation and number of lanes, but we
don't have to configure anything directly on it. That is handled
through the PCS/PMA/PMD side of things.
Powered by blists - more mailing lists