lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <MN0PR12MB59539CF99653EC1F937591AAB7B42@MN0PR12MB5953.namprd12.prod.outlook.com>
Date: Wed, 9 Apr 2025 11:14:28 +0000
From: "Pandey, Radhey Shyam" <radhey.shyam.pandey@....com>
To: Álvaro G. M. <alvaro.gamez@...ent.com>, Jakub Kicinski
	<kuba@...nel.org>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "Katakam, Harini"
	<harini.katakam@....com>, "Gupta, Suraj" <Suraj.Gupta2@....com>
Subject: RE: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on
 MicroBlaze: Packets only received after some buffer is full

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Álvaro G. M. <alvaro.gamez@...ent.com>
> Sent: Wednesday, April 9, 2025 4:31 PM
> To: Pandey, Radhey Shyam <radhey.shyam.pandey@....com>; Jakub Kicinski
> <kuba@...nel.org>
> Cc: netdev@...r.kernel.org; Katakam, Harini <harini.katakam@....com>; Gupta,
> Suraj <Suraj.Gupta2@....com>
> Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze:
> Packets only received after some buffer is full
>
> On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote:
> > [...]
> >  + Going through the details and will get back to you . Just to
> > confirm there is no vivado design update ? and we are only updating linux kernel to
> latest?
> >
>
> Hi again,
>
> I've reconsidered the upgrading approach and I've first upgraded buildroot and kept
> the same kernel version (4.4.43). This has the effect of upgrading gcc from version
> 10 to version 13.
>
> With buildroot's compiled gcc-13 and keeping this same old kernel, the effect is that
> the system drops ARP requests. Compiling with older gcc-10, ARP requests are

When the system drops ARP packet - Is it drop by MAC hw or by software layer.
Reading MAC stats and DMA descriptors help us know if it reaches software
layer or not ?

> replied to. Keeping old buildroot version but asking it to use gcc-11 will cause the
> same issue with kernel 4.4.43, so something must have happened in between those
> gcc versions.
>
> So this does not look like an axienet driver problem, which I first thought it was,
> because who would blame the compiler in first instance?
>
> But then things started to get even stranger.
>
> What I did next, was slowly upgrading buildroot and using the kernel version that
> buildroot considered "latest" at the point it was released. I reached a point in which
> the ARP requests were being dropped again. This happened on buildroot 2021.11,
> which still used gcc-10 as the default and kernel version 5.15.6. So some gcc bug
> that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by
> kernel 5.15.6.
>
> Using gcc-10, I bisected the kernel and found that this commit was triggering
> whatever it is that is happening, around 5.11-rc2:
>
> commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD)
> Author: Menglong Dong <dong.menglong@....com.cn>
> Date:   Mon Jan 11 02:42:21 2021 -0800
>
>     net: core: use eth_type_vlan in __netif_receive_skb_core
>
>     Replace the check for ETH_P_8021Q and ETH_P_8021AD in
>     __netif_receive_skb_core with eth_type_vlan.
>
>     Signed-off-by: Menglong Dong <dong.menglong@....com.cn>
>     Link: https://lore.kernel.org/r/20210111104221.3451-1-
> dong.menglong@....com.cn
>     Signed-off-by: Jakub Kicinski <kuba@...nel.org>
>
>
> I've been staring at the diff for hours because I can't understand what can be wrong
> about this:
>
> diff --git a/net/core/dev.c b/net/core/dev.c index e4d77c8abe76..267c4a8daa55
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb,
> bool pfmemalloc,
>         skb_reset_mac_len(skb);
>     }
>
> -   if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
> -       skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
> +   if (eth_type_vlan(skb->protocol)) {
>         skb = skb_vlan_untag(skb);
>         if (unlikely(!skb))
>             goto out;
> @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb,
> bool pfmemalloc,
>              * find vlan device.
>              */
>             skb->pkt_type = PACKET_OTHERHOST;
> -       } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
> -              skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
> +       } else if (eth_type_vlan(skb->protocol)) {
>             /* Outer header is 802.1P with vlan 0, inner header is
>              * 802.1Q or 802.1AD and vlan_do_receive() above could
>              * not find vlan dev for vlan id 0.
>
>
>
> Given that eth_type_vlan is simply this:
>
> static inline bool eth_type_vlan(__be16 ethertype) {
>         switch (ethertype) {
>         case htons(ETH_P_8021Q):
>         case htons(ETH_P_8021AD):
>                 return true;
>         default:
>                 return false;
>         }
> }
>
> I've added a small printk to see these values right before the first time they are
> checked:
>
> printk(KERN_ALERT  "skb->protocol = %d, ETH_P_8021Q=%d
> ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d",
>        skb->protocol, cpu_to_be16(ETH_P_8021Q), cpu_to_be16(ETH_P_8021AD),
> eth_type_vlan(skb->protocol));
>
> And each ARP ping delivers a packet reported as:
> skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144,
> skb->eth_type_vlan(skb->protocol) = 0
>
> To add insult to injury, adding this printk line solves the ARP deafness, so no matter
> whether I use eth_type_vlan function or manual comparison, now ARP packets
> aren't dropped.
>
> Removing this printk and adding one inside the if-clause that should not be
> happening, shows nothing, so neither I can directly inspect the packets or return
> value of the wrong working code, nor can I indirectly proof that the wrong branch of
> the if is being taken. This reinforces the idea of a compiler bug, but I very well could
> be wrong.
>
> Adding this printk:
> diff --git i/net/core/dev.c w/net/core/dev.c index 267c4a8daa55..a3ae3bcb3a21
> 100644
> --- i/net/core/dev.c
> +++ w/net/core/dev.c
> @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb,
> bool pfmemalloc,
>                  * check again for vlan id to set OTHERHOST.
>                  */
>                 goto check_vlan_id;
> +       } else {
> +           printk(KERN_ALERT "(1) skb->protocol is not type vlan\n");
>         }
>         /* Note: we might in the future use prio bits
>          * and set skb->priority like in vlan_do_receive()
>
> is even weirder because the same effect: the message does not appear but ARP
> requests are answered back. If I remove this printk, ARP requests are dropped.
>
> I've generated assembly output and this is the difference between having that extra
> else with the printk and not having it.
>
> It doesn't even make much any sense that code would even reach this region of
> code because there's no vlan involved in at all here.
>
> And so here I am again, staring at all this without knowing how to proceed.
>
> I guess I will be trying different and more modern versions of gcc, even some
> precompiled toolchains and see what else may be going on.
>
> If anyone has any hindsight as to what is causing this or how to solve it, it'd be great
> if you could share it.
>
> Thanks!
>
> --
> Álvaro G. M.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ