netdev - Re: [net PATCH 1/2] net: phy: Cleanup handling of recent changes to phy_lookup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Uf1R0BadAZe0ANMpS00AZB228e2-Am9LaZxzeSTCWS4aQ@mail.gmail.com>
Date: Sat, 5 Apr 2025 13:41:29 -0700
From: Alexander Duyck <alexander.duyck@...il.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: "Russell King (Oracle)" <linux@...linux.org.uk>, 
	Maxime Chevallier <maxime.chevallier@...tlin.com>, netdev@...r.kernel.org, 
	hkallweit1@...il.com, davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com
Subject: Re: [net PATCH 1/2] net: phy: Cleanup handling of recent changes to phy_lookup_setting

On Sat, Apr 5, 2025 at 7:51 AM Andrew Lunn <andrew@...n.ch> wrote:
>
> > > So for us, we have:
> > >
> > > MAC - PHY
> > > MAC - PCS - PHY
> > > MAC - PCS - SFP cage
> > > MAC - PCS - PHY - SFP cage
> >
> > Is this last one correct? I would have thought it would be MAC - PCS -
> > SFP cage - PHY. At least that is how I remember it being with some of
> > the igb setups I worked on back in the day.
>
> This PHY is acting as an MII converter. What comes out of the PCS
> cannot be directly connected to the SFP cage, it needs a
> translation. The Marvell 10G PHY can do this, you see this with some
> of the Marvell reference designs.
>
> There could also be a PHY inside the SFP cage, if the media is
> Base-T. Linux is not great at describing that situation, multiple PHYs
> for one link, but it is getting better at that, thanks to the work
> Bootlin is doing.
>
> >
> > > This is why i keep saying you are pushing the envelope. SoC currently
> > > top out at 10GbaseX. There might be 4 lanes to implement that 10G, or
> > > 1 lane, but we don't care, they all get connected to a PHY, and BaseT
> > > comes out the other side.
> >
> > I know we are pushing the envelope. That was one of the complaints we
> > had when you insisted that we switch this over to phylink. If anything
> > 50G sounds like it will give the 2500BaseX a run for its money in
> > terms of being even more confusing and complicated.
>
> Well, 2500BaseX itself it straight forward. It is the vendors that
> make it complex by having broken implementations.
>
> Does your 50G mode follow the standard?

>From what I can tell the 50GbaseR portion of it follows the standard.
The LAUI stuff is another story. It looks like it mostly compiles but
I am having to blur some definitions as the IEEE version had no FEC
and with ours we have the options for RS528 or BASER which more
closely matches up with 25GbaseR

> SoC vendors tend to follow the standard, which is why there is so much
> code sharing possible. They often just purchase IP to implement the
> boring parts like the PCS, there is no magic sauce there, all the
> vendor differentiation is in the MAC, if they try to differentiate at
> all in networking.
>
> The current market is SoCs have 10G. Microchip does have a 25G link in
> its switches, which uses phylink. We might see more 25G, or we might
> see a jump to 40G.
>
> I know your register layout does not follow the standard, but i hope
> the registers themselves do. So i guess what will happen is when
> somebody else has a 40G PCS, maybe even the same licensed IP, they
> will write a translation layer on top of yours to make your registers
> standards compliment, and then reuse your driver. This assumes you are
> following the standard, plus/minus some integration quirks.
>
> If you have thrown the standard out the window, and nothing is going
> to be reusable then maybe you should hide it away in the MAC
> driver.

So the ugly bit for us is that there are no MII interfaces to the PCS
or PMA. It is all MMIO accesses a register map and a number of signals
were just routed to registers in another section of the part for us to
read to or write from.

> > If anything we most closely resemble the setup with just the SFP cage
> > and no PHY. So I suspect we will probably need that whole set in place
> > in order for things to function as expected.
>
> That is how we have seen new link modes added. Going from 2.5G to 5G
> to 10G is not that big, so the patchsets are reasonably small. But the
> jump from 10G to 40G is probably bigger.
>
> If you internally use fixed-link as a development crutch, that is not
> a problem. If however you want it in mainline, then we need to look at
> the big picture, does it fit with what fixed-link is meant to be?

It just impacts the order in which I do things. By going with a fixed
link I could add the phylink functionality to the driver as I went. I
can go the other way around, it just means I can't test the
functionality as I add it. Instead it will be adding all the code and
then suddenly it all just works. At this point I have it mostly
working with the few items I have already pointed out so I can
probably just re-order things to push the functionality changes first,
and then enable the driver to use them bypassing the fixed-link step.

> What is also going to make things complex is the BMC. SoCs and
> switches don't have BMCs, Linux via phylink and all the other pieces
> of the puzzle are in complete control of driving the hardware. We
> don't have a good story for when Linux is only partially in control of
> the hardware, because the BMC is also controlling some of it.

Fortunately the BMC isn't much of an issue as I think I figured out
the one problem I had on Thursday. One of the first things we did is
establish a lockout/tagout procedure for the link and TCAM
configuration. Essentially the FW/BMC is in control when the driver
isn't loaded. When we call open we send a message to the FW indicating
we are locking it out and taking ownership. At that point it shouldn't
modify anything unless we ask it to, or we don't send a heartbeat
message for 2 minutes.

If anything we were the problem child is that the code as it is
currently written defaults to taking down the link and re-configuring
everything on driver load. This was causing a bunch of heartburn
because it was causing the BMC to lose link for a few seconds. However
as of Thursday I realized we can essentially just use our
pcs_get_state call at the start of our configure routine to identify
if we actually need to reconfigure things or if the link is already up
with the configuration we want. Doing that the only thing that causes
any link issues is the initial phylink_link_down in phylink_resume,
however that is much less significant as it doesn't actually trigger
any link down events on the FW and the time it is down is only a
fraction of a second versus the several seconds it takes for a PCS
reset and for the PMA to complete link training.