netdev - Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ufm1T59r4Zn48_8gkOi=g0oqH5fvP+Gtxu0Wn9D5jNdaw@mail.gmail.com>
Date: Thu, 24 Apr 2025 16:40:33 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org, linux@...linux.org.uk, 
	hkallweit1@...il.com, davem@...emloft.net, pabeni@...hat.com
Subject: Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap

On Thu, Apr 24, 2025 at 1:34 PM Andrew Lunn <andrew@...n.ch> wrote:
>
> Sorry for the delay, busy with $DAY_JOB

No problem. I have a number of my own issues I am dealing with here in
terms of code cleanup stuff anyway.

> > > > > There are not 4 host MACs connected to a 5 port switch. Rather, each
> > > > > host gets its own subset of queues, DMA engines etc, for one shared
> > > > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> > > > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> > > > > interface.
> > > >
> > > > Yeah, that is the setup so far. Basically we are using one QSFP cable
> > > > and slicing it up. So instead of having a 100CR4 connection we might
> > > > have 2x50CR2 operating on the same cable, or 4x25CR.
> > >
> > > But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs?
> >
> > Yes. Some confusion here may be that our hardware always has 4
> > MAC/PCS/PMA setups, one for each host. Depending on the NIC
> > configuration we may have either 4 hosts or 2 hosts present with 2
> > disabled.
>
> So with 2 hosts, each host has two netdevs? If you were to dedicate
> the whole card to one host, you would have 4 netdevs? It is upto
> whatever is above to perform load balancing over those?

Just to be clear when I say "host" in this case I am referring to a
system running Linux, not "host" in the CMIS regard as that is the
NIC/FW I think.

Anyway we have 2 scenarios. Our main use case is to route one x4 to
each host. So in that case we only have one PCIe connection and it
will only show up as one netdev. In our manufacturing test case we
have a riser and use PCIe bifurcation to split up a x16 to 4 x4 and
all 4 endpoints can show up on the one host.

> If you always have 4 MAC/PCS, then the PCS is only ever used with a
> single lane? The MAC does not support 100000baseKR4 for example, but
> 250000baseKR1?

In our 2 host setup we are normally running a QSFP28 or QSFP+ cable
that has 4 lanes. So we effectively are cutting the cable in half to
provide 2 lanes to each host. This allows us to support either
50baseCR2 or 100baseCR2 as the upper limit for the cable depending on
if it is running NRZ or PAM4 modulation. In these setups things are
fairly rigid as we can only select to use 1 or 2 lanes, no selection
for modulation due to the nature of the cable spec.

In our 4 host setup we are configured to connect a QSFP-DD cable. With
that the cable has 8 lanes, with each host getting 2 of that and
seeing the same limitations as the 2 host setup mentioned earlier.

In theory we could do something like you call out in your example, but
we haven't configured a board combination for that yet. Basically it
would require a specific board and EEPROM combination to route the
lanes so that we had one lane per host instead of 2 which is the
current configs.

> > The general idea is that we have to cache the page and bank in the
> > driver and pass those as arguments to the firmware when we perform a
> > read. Basically it will take a lock on the I2C, set the page and bank,
> > perform the read, and then release the lock. With that all 4 hosts can
> > read the I2C from the QSFP without causing any side effects.
>
> I assume your hardware team have not actually implemented I2C, they
> have licensed it. Hence there is probably already a driver for it in
> drivers/i2c/busses, maybe one of the i2c-designware-? However, you are
> not going to use it, you are going to reinvent the wheel so you can
> parse the transactions going over it, look for reads and writes to
> address 127? Humm, i suppose you could have a virtual I2C driver doing
> this stacked on top of the real I2C driver. Is this something other
> network drivers are going to need? Should it be somewhere in
> drivers/net/phy? The hard bit is how you do the mutex in an agnostic
> way. But it looks like hardware spinlocks would work:
> https://docs.kernel.org/locking/hwspinlock.html

Part of the issue would be who owns the I2C driver. Both the firmware
and the host would need access to it. Rather than having to do a
handoff for that it is easier to have the firmware maintain the driver
and just process the requests for us via mailbox IPC calls.

One other point of contention is that we don't have a central firmware
managing things. We have one instance of the firmware running per
host. So the 4 firmware instances will be competing with each other
over access to the QSFP, so they have their own mutex that they
maintain to determine who can have master access to the I2C bus for
the QSFP with each having their own I2C device to connect to the bus.

> And actually, it is more complex than caching the page.

Yeah, that was generally my thought on it, and that is if I even need
to do that. From what I have seen most of the QSFP28/+ direct attach
cables we are working with are very simplistic. Seems like they only
had page 0. It wasn't until I started getting into the QSFP-DD stuff
for the 4 host NIC that I started running into the need for multi page
support. So for example the hwmon sensors don't do much for us as
direct attach cables don't really bother implementing them. A call to
"ethtool -m" on on of our systems usually yields 0.00 degrees C and
0.000 volts.

Also the ethtool API already had get_module_eeprom_by_page which is a
very good fit for our model since it allowed for atomic access based
on the page and bank number.

>   This specification defines functions in Pages 00h-02h. Pages 03-7Fh
>   are reserved for future use. Writing the value of a non-supported
>   Page shall not be accepted by the transceiver. The Page Select byte
>   shall revert to 0 and read / write operations shall be to the
>   unpaged A2h memory map.
>
> So i expect the SFP driver to do a write followed by a read to know if
> it needs to return EOPNOTSUPP to user space because the SFP does not
> implement the page.

I guess it could do EOPNOTSUPP too, we had used EADDRNOTAVAIL to
indicate that case. This is one of the reasons why our firmware API
requires the bank and page be passed in the message to perform an QSFP
I2C read. It is able to verify it on its end and if it isn't supported
it returns an error.