lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251008111059.wxf3jgialy36qc6m@skbuf>
Date: Wed, 8 Oct 2025 14:10:59 +0300
From: Vladimir Oltean <vladimir.oltean@....com>
To: Alexander Wilhelm <alexander.wilhelm@...termo.com>
Cc: "Russell King (Oracle)" <linux@...linux.org.uk>,
	Andrew Lunn <andrew@...n.ch>,
	Heiner Kallweit <hkallweit1@...il.com>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: Aquantia PHY in OCSGMII mode?

On Wed, Oct 08, 2025 at 09:47:28AM +0200, Alexander Wilhelm wrote:
> On Tue, Oct 07, 2025 at 05:08:19PM +0300, Vladimir Oltean wrote:
> > Hi Alexander,
> [...]
> > Sorry for the delay. What you have found are undoubtebly two major bugs,
> > causing the Lynx PCS to operate in undefined behaviour territory.
> > Nonetheless, while your finding has helped me discover many previously
> > unknown facts about the hardware IP, I still cannot replicate exactly
> > your reported behaviour. In order to fully process things, I would like
> > to ask a few more clarification questions.
> 
> Sure.
> 
> > Is your U-Boot implementation based on NXP's dtsec_configure_serdes()?
> > https://github.com/u-boot/u-boot/blob/master/drivers/net/fm/eth.c#L57
> 
> Unfortunately, I am working with an older U-Boot version v2016.07. However,
> the bug I fixed was not part of the official U-Boot codebase, it was
> introduced by our team:
> 
>     value = PHY_SGMII_IF_MODE_SGMII;
>     value |= PHY_SGMII_IF_MODE_AN;
> 
> I added the missing `if` condition as follows:
> 
>     if (!sgmii_2500) {
>         value = PHY_SGMII_IF_MODE_SGMII;
>         value |= PHY_SGMII_IF_MODE_AN;
>     }
> 
> With the official U-Boot codebase I don't have a ping at none of the
> speeds:
> 
>     value = PHY_SGMII_IF_MODE_SGMII;
>     if (!sgmii_2500)
>         value |= PHY_SGMII_IF_MODE_AN;
> 
> > Why would U-Boot set IF_MODE_SGMII_EN | IF_MODE_USE_SGMII_AN only when
> > the AQR115 resolves only to 100M, but not in the other cases (which do
> > not have this problem)? Or does it do it irrespective of resolved media
> > side link speed? Simply put: what did the code that you fixed up look like?
> 
> In our implementation, the SGMII flags were always set in U-Boot,
> regardless of the negotiated link speed. My assumption is that the SGMII
> mode configuration results in a behavior where only a 100M link applies the
> 10x symbol replication, while 1G does not. For a 2.5G link, the behavior
> ends up being the same as 1G, since there is no actual SGMII mode for 2.5G.

Yes, this assumption seems to hold water thus far, but I have to
validate it by seeing the debugging print for 1G/2.5G, once we figure
out the debug printing aspect.

> > With the U-Boot fix reverted, could you please replicate the broken
> > setup with AQR115 linking at 100Mbps, and add the following function in
> > Linux drivers/pcs-lynx.c?
> > 
> > static void lynx_pcs_debug(struct mdio_device *pcs)
> > {
> > 	int bmsr = mdiodev_read(pcs, MII_BMSR);
> > 	int bmcr = mdiodev_read(pcs, MII_BMCR);
> > 	int adv = mdiodev_read(pcs, MII_ADVERTISE);
> > 	int lpa = mdiodev_read(pcs, MII_LPA);
> > 	int if_mode = mdiodev_read(pcs, IF_MODE);
> > 
> > 	dev_info(&pcs->dev, "BMSR 0x%x, BMCR 0x%x, ADV 0x%x, LPA 0x%x, IF_MODE 0x%x\n", bmsr, bmcr, adv, lpa, if_mode);
> > }
> > 
> > and call it from:
> > 
> > static void lynx_pcs_get_state(struct phylink_pcs *pcs, unsigned int neg_mode,
> > 			       struct phylink_link_state *state)
> > {
> > 	struct lynx_pcs *lynx = phylink_pcs_to_lynx(pcs);
> > 
> > 	lynx_pcs_debug(lynx->mdio); // <- here
> > 
> > 	switch (state->interface) {
> > 	...
> > 
> > With this, I would like to know:
> > (a) what is the IF_MODE register content outside of the IF_MODE_SGMII_EN
> >     and IF_MODE_USE_SGMII_AN bits.
> > (b) what is the SGMII code word advertised by the AQR115 in OCSGMII mode.
> > 
> > Then if you could replicate this test for 1Gbps medium link speed, it
> > would be great.
> 
> For now, I have reverted both the U-Boot and kernel fixes and added debug
> outputs for further analysis. Unfortunately the function
> `lynx_pcs_get_state` is never called in my kernel code. Therefore I put the
> debug function into `lynx_pcs_config`. Here is the output:
> 
>     mdio_bus 0x0000000ffe4e5000:00: BMSR 0x29, BMCR 0x1140, ADV 0x4001, LPA 0xdc01, IF_MODE 0x3
> 
> I hope it'll help to analyze the problem further.

Correct. lynx_pcs_get_state() is only called for MLO_AN_INBAND (managed = "in-band-status"),
which the Lynx PCS driver does not currently support for 2500base-x.

However, I don't fully trust the positioning of the debug print into lynx_pcs_config().
The BMCR, ADV and IF_MODE registers look plausible, as if lynx_pcs_config() did what it
was supposed to do, but LPA (link config code word coming from AQR115) looks strange.
Field 11:10 (COP_SPD) is 0b11, which is a reserved value, neither 1G nor 100M nor 10M.
Maybe this is the mythical "SGMII 2500" auto-negotiation? Anyway, I don't think there is
any standard for it, and even if there was, the Lynx PCS doesn't implement it.

I'm surprised your AQR115 would transmit in-band code words for OCSGMII. None of the Aquantia
PHYs I've tested on were able to do that, and I'm not sure what register controls that.
If we look at your previous debugging output of the global system configuration registers:
https://lore.kernel.org/lkml/aJH8n0zheqB8tWzb@FUE-ALEWI-WINX/
we see that for 100M line side, the PHY uses "SerDes mode 4 autoneg 0". I also tried modifying
aqr_gen2_config_inband() to set VEND1_GLOBAL_CFG_AUTONEG_ENA for OCSGMII, but it didn't appear
to change anything, so that's probably not the setting. I'll have to ask somebody at Marvell.

In any case, contrary to my previous beliefs and according to your finding plus my parallel
testing, the Lynx PCS actually supports in-band auto-negotiation at 2500 data rate - both
2500base-x auto-negotiation and SGMII auto-negotiation (to the extent that this is a thing
that actually makes sense - it doesn't).

With IF_MODE=3 (SGMII_EN | USE_SGMII_AN), the PCS will automatically reconfigure the data path
for the speed decoded in hardware from the LPA_SGMII_SPD_MASK bits. Apparently it does this
for the lane data rate of 2500 just the same as it does it for 1000, just that the
LPA_SGMII_SPD_MASK bits need to be 0b00 (gigabit) for traffic to pass. Otherwise, it tries to
perform symbol replication (as per your hw engineer's claim), and that didn't work in my testing
either(* details at the end).

I'm not saying that IF_MODE=3 is a valid configuration when the lane data rate is 2500.
It absolutely isn't, and your patch which changes IF_MODE to 0 seems ok. I'm just trying to
understand and then re-explain what the PCS does when configured in this mode, based on the
evidence.

Specifically when LPA_SGMII_SPD_MASK/COP_SPD is 0b11, it isn't documented how it would behave,
I don't have a protocol analyzer to count replicated symbols, and I'm unable to obtain a
functional data path to measure bandwidth with iperf3. Your hardware engineer's claim remains
the most trustworthy source of information we have.

Regarding lynx_pcs_get_state(): I actually was working on the patch attached, which I had in my
tree and I didn't realize it would impact your testing. I would kindly ask you to apply as well.
Applying it alone would be enough to fix the IF_MODE=3 problem, but fixing the problem is not
what we want, instead we want to see the MII_LPA register value at lynx_pcs_get_state() time,
and for multiple link speeds.

For that, please break the link again, by making the following changes on top:

1. Configure IF_MODE=3 (SGMII autoneg format) for 2500base-x:

diff --git a/drivers/net/pcs/pcs-lynx.c b/drivers/net/pcs/pcs-lynx.c
index a88cbe67cc9d..ea42b8d813f3 100644
--- a/drivers/net/pcs/pcs-lynx.c
+++ b/drivers/net/pcs/pcs-lynx.c
@@ -152,11 +152,10 @@ static int lynx_pcs_config_giga(struct mdio_device *pcs,
 		mdiodev_write(pcs, LINK_TIMER_HI, link_timer >> 16);
 	}

-	if (interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    interface == PHY_INTERFACE_MODE_2500BASEX) {
+	if (interface == PHY_INTERFACE_MODE_1000BASEX) {
 		if_mode = 0;
 	} else {
-		/* SGMII and QSGMII */
+		/* SGMII, QSGMII and (incorrectly) 2500base-x */
 		if_mode = IF_MODE_SGMII_EN;
 		if (neg_mode == PHYLINK_PCS_NEG_INBAND_ENABLED)
 			if_mode |= IF_MODE_USE_SGMII_AN;

2. Edit your MAC OF node in the device tree and add:

&mac {
	managed = "in-band-status";  // this
	phy-mode = "2500base-x";
};

This will reliably cause the same behaviour as before, but with no dependency on U-Boot.

Because aqr_gen2_inband_caps() (code added recently to net-next) returns 0 (unknown)
for 2500base-x, phylink doesn't know whether the PHY will send or not in-band code words.
It will have to trust the firmware description with managed = "in-band-status", which will
lead neg_mode to be PHYLINK_PCS_NEG_INBAND_ENABLED, which will cause IF_MODE_USE_SGMII_AN
to be set.

Note: this is something else we'll have to look at later too. What bits control "OCSGMII"
link codeword transmission in Aquantia PHYs?

Your earlier assumption about why 100M is broken but 1G / 2.5G are not
only holds water if, as a result of testing at these other link speeds,
you find the MII_LPA register to contain 0b10 (Gigabit) in bits 11:10
(COP_SPD / LPA_SGMII_SPD_MASK). Aka it worked, but it was purely accidental,
because phylink thought 2500base-x would not use autoneg, yet it did,
and by some miracle the SGMII format coincided and resulted in no change
in the link characteristics.

Regarding my patch vs yours, my thoughts on this topic are: the bug is
old, the PCS driver never worked if the registers were not as expected
(this is not a regression), and your patch is incomplete if MII_BMCR
also contains significant differences. I would recommend submitting
mine, as a new feature to net-next, when it reopens for patches for 6.19.
I've credited you with Co-developed-by due to the significance of your
findings. Thanks.

(*) When testing forced 10x SGMII symbol replication on a 3.125 Gbps
lane over a pair of optical SFP+ modules connected between two LS1028A Lynx PCS
blocks, I can see packets being transmitted, but on the receiver, the
RFRG mEMAC counter increases (rx_fragments). The documentation says for this:
"Incremented for each packet which is shorter than 64 octets received with
a wrong CRC. (Fragments are not delivered to the FIFO interface.)"
Without dedicated equipment, I'm unsure how to push this further.

View attachment "0001-net-pcs-lynx-accept-in-band-autoneg-for-2500base-x.patch" of type "text/x-diff" (4691 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ