lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8b5266ce-76c1-4618-a6c4-54e1df677490@intel.com>
Date: Mon, 20 May 2024 17:02:47 -0700
From: Jacob Keller <jacob.e.keller@...el.com>
To: <kernel.org-fo5k2w@...arbi.fr>, Jeff Daly <jeffd@...icom-usa.com>
CC: <anthony.l.nguyen@...el.com>, <intel-wired-lan@...ts.osuosl.org>,
	<jesse.brandeburg@...el.com>, <netdev@...r.kernel.org>,
	<regressions@...mhuis.info>
Subject: Re: Non-functional ixgbe driver between Intel X553 chipset and Cisco
 switch via kernel >=6.1 under Debian



On 5/9/2024 6:26 AM, kernel.org-fo5k2w@...arbi.fr wrote:
> Hi,
> 
>> No link detected, but it does detect this is a 10GBaseT cable.
>> Interesting it doesn't report FEC or autonegotiation. Hmm.
> 
> In fact, I personally find it strange that the "Supported link modes" is "10000baseT/Full". A DAC is not a SFP+ 8P8C (RJ45) module. Wouldn't it be more logical if the modes reported were the same as those obtained by an "ethtool eth2" on the Connectx-3 side? :
> 
> Settings for eth2:
> 	Supported ports: [ FIBRE ]
> 	Supported link modes:   10000baseKX4/Full
> 	                        1000baseX/Full
> 	                        10000baseCR/Full
> 	                        10000baseSR/Full
> 	Supported pause frame use: Symmetric Receive-only
> 	Supports auto-negotiation: No
> 	Supported FEC modes: Not reported
> 	Advertised link modes:  10000baseKX4/Full
> 	                        1000baseX/Full
> 	                        10000baseCR/Full
> 	                        10000baseSR/Full
> 	Advertised pause frame use: Symmetric
> 	Advertised auto-negotiation: No
> 	Advertised FEC modes: Not reported
> 	Speed: 10000Mb/s
> 	Duplex: Full
> 	Auto-negotiation: off
> 	Port: Direct Attach Copper
> 	PHYAD: 0
> 	Transceiver: internal
> 	Supports Wake-on: d
> 	Wake-on: d
>         Current message level: 0x00000014 (20)
>                                link ifdown
> 	Link detected: yes
> 
> 
> In other words, isn't the fact that the reported mode is "10000baseT/Full" a bug in itself?
>  

Possibly, though I am not familiar enough to know for sure.

>> Knowing the kernel is the important part, we don't have specific
>> versioning of drivers in the kernel anymore.
> 
> Ok. I take note of this information.
> 
>> The steps would require that you build the kernel manually. I can
>> outline the steps i would take here
>>
>> 1. get the kernel source from git.kernel.org. I place it in $HOME/git/linux
>> 2. switch to v5.10 with 'git switch --detach v5.10'
>> 2. copy the debian 5.10 config file to $HOME/git/linux/.config
>> 3. build kernel with 'make -j24' (adjust -j depending on how much CPU
>> you want to spend building the kernel)
>> 4. install with 'sudo make -j24 modules_install && sudo make install'
>> 5. reboot and select the v5.10 kernel, double check it works.
>> 6. in $HOME/git/linux run 'git bisect start' to initiate the bisect session.
>> 7. First, label the current v5.10 commit as good with 'git bisect good'
>> 8. Second, label the v6.1 commit as bad with 'git bisect bad v6.1'
>>
>> This will initiate a bisect session and will checkout the kernel
>> approximately halfway between v5.10 and v6.1. For each bisection point
>> it checks, run the following steps:
>>
>> 1. 'make olddefconfig' to update the configuration for this version
>> 2. 'make -j24' to rebuild with the current version
>> 3. 'sudo make -j24 modules_install && sudo make install' to install this
>> version.
>> 4. reboot into that version and check its behavior.
>> 5. If it works properly then run 'git bisect good'
>> 6. If it works incorrectly, then run 'git bisect bad'
>>
>> A new commit will be selected. It will pick one between the latest good
>> point and the closest bad point, essentially honing in towards the
>> incorrect behavior.
>>
>> If for any reason a commit can't be built or tested, you can use "git
>> bisect skip" and it will skip around a bit to find another point that
>> can be tried.
> 
> Thank you for your and Thorsten Leemhuis's advice. I don't know whether the following Bisect log will be of any help to you. However, I have determined precisely that the problem was introduced with version 6.1. If I boot into 6.0, it works perfectly. So there are fewer differences to search for the problem. Here's the feedback from Bisect, but I'm still dubious about the relevance of this log because the “git bisect bad v6.1” command returned "7614896350aa20764c5eca527262d9eb0a57da63 était à la fois good et bad"... I didn't really understand how it all worked... :
> 
> git bisect start
> # good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
> git bisect good 45eb8ae5370d5df1ee8236f45df3f29103ba6e12
> # bad: [830b3c68c1fb1e9176028d02ef86f3cf76aa2476] Linux 6.1
> git bisect bad 7614896350aa20764c5eca527262d9eb0a57da63
> 
> I should point out that I had to switch back to Debian 11 because 12 and 13 refuse to compile these old kernels... Anyway, I compiled the versions successively and came across the difference in operation between 6.0 and 6.1.
> 

>From this point, it would checkout a commit and then you would build and
verify if it works, and then run "git bisect good" if the test passed,
and "git bisect bad" if the test failed, and it would then pick a new
test candidate. It uses a bisection process (similar to the name) to
reduce the range by ~half each time to quickly hone in on the
appropriate bad commit.

>> I suspect those changes must have broken the Cisco switch link behavior.
>> I unfortunately do not know enough about this hardware or the SFI
>> configuration to understand why this causes it.
>>
>> If you don't want to try bisect, I would suggest trying to revert that
>> commit or simply replace the ixgbe_setup_sfi_x550a function with the one
>> from out-of-tree here. If you do that, you can rebuild just ixgbe with
>> "make M=drivers/net/ethernet/intel/ixgbe" and then insert the module
>> with "insmod drivers/net/ethernet/intel/ixgbe/ixgbe.ko".
>>
>> It seems likely that this change had unintended side effect which broke
>> the Cisco switch linking.
> 
> 
> If I do a "git revert 565736048bd5f9888990569993c6b6bfdf6dcb6d" to go back before the state of the suspected problem commit, compile kernel 6.1 and boot on it, it works perfectly.
> So it turns out that this is the source of the malfunction and was introduced with Linux 6.1.
> 

That is sufficient for me. I did some further digging here and it looks
like this was originally some workaround for a specific switch. Per the
commit message, this was submitted to our driver by Jeff Daly from
Silicom. I suspect that the fix happens to resolve issues on a
particular switch. Clearly breaks other switches and in a pretty bad way.

>From what I can understand, the Cisco switch you are using sees the AN37
and basically decides the port is confused and gives up and won't
attempt to link again without a power cycle.

>  
>> I've added Jeff Daly, in the hopes that he could provide more details on
>> the change.
>>
>> @Jeff, it seems likely that the change you made at 565736048bd5 ("ixgbe:
>> Manual AN-37 for troublesome link partners for X550 SFI") is breaking
>> some other switches. It would help if you could shed some light on this
>> change as otherwise we might need to revert it and once again break the
>> setup you fixed.
>>
>> Thanks,
>> Jake
> 
> Let me know if you need more information. I'll be happy to help!
> 

I believe the correct thing to do here is to revert this change. It
helps some cases, but clearly broke others. Jeff (or anyone from
silicom) has not responded with clarifications. We (Intel) do not have
this change in the related out-of-tree driver. Although this did get
tested by us, I suspect we simply did not test it with the correct range
of devices and switches.

I don't know enough about the standards to know which switch is at fault
or behaving incorrectly here, but I am inclined to revert this fix
because it breaks, and the original authors of the fix aren't commenting.

> Best regards.
> 
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁ Yohan Charbi
> ⢿⡄⠘⠷⠚⠋⠀ Cordialement
> ⠈⠳⣄⠀

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ