lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <573ae845a793527ddb410eee4f6f5f0111912ca6.camel@hazent.com>
Date: Wed, 09 Apr 2025 15:09:56 +0200
From: Álvaro "G. M." <alvaro.gamez@...ent.com>
To: "Pandey, Radhey Shyam" <radhey.shyam.pandey@....com>, Jakub Kicinski
	 <kuba@...nel.org>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "Katakam, Harini"
	 <harini.katakam@....com>, "Gupta, Suraj" <Suraj.Gupta2@....com>
Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on
 MicroBlaze: Packets only received after some buffer is full

On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> > -----Original Message-----
> > From: Álvaro G. M. <alvaro.gamez@...ent.com>
> > Sent: Wednesday, April 9, 2025 4:31 PM
> > To: Pandey, Radhey Shyam <radhey.shyam.pandey@....com>; Jakub Kicinski
> > <kuba@...nel.org>
> > Cc: netdev@...r.kernel.org; Katakam, Harini <harini.katakam@....com>; Gupta,
> > Suraj <Suraj.Gupta2@....com>
> > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze:
> > Packets only received after some buffer is full
> > 
> > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote:
> > > [...]
> > >  + Going through the details and will get back to you . Just to
> > > confirm there is no vivado design update ? and we are only updating linux kernel to
> > latest?
> > > 
> > 
> > Hi again,
> > 
> > I've reconsidered the upgrading approach and I've first upgraded buildroot and kept
> > the same kernel version (4.4.43). This has the effect of upgrading gcc from version
> > 10 to version 13.
> > 
> > With buildroot's compiled gcc-13 and keeping this same old kernel, the effect is that
> > the system drops ARP requests. Compiling with older gcc-10, ARP requests are
> 
> When the system drops ARP packet - Is it drop by MAC hw or by software layer.
> Reading MAC stats and DMA descriptors help us know if it reaches software
> layer or not ?

I'm not sure, who is the open dropping packets, I can only check with
ethtool -S eth0 and this is its output after a few dozens of arpings:

# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 06:00:0A:BC:8C:01  
          inet addr:10.188.140.1  Bcast:10.188.143.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:164 errors:0 dropped:99 overruns:0 frame:0
          TX packets:22 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:11236 (10.9 KiB)  TX bytes:1844 (1.8 KiB)

# ethtool -S eth0
NIC statistics:
     Received bytes: 13950
     Transmitted bytes: 2016
     RX Good VLAN Tagged Frames: 0
     TX Good VLAN Tagged Frames: 0
     TX Good PFC Frames: 0
     RX Good PFC Frames: 0
     User Defined Counter 0: 0
     User Defined Counter 1: 0
     User Defined Counter 2: 0

# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:		4096
RX Mini:	0
RX Jumbo:	0
TX:		4096
Current hardware settings:
RX:		1024
RX Mini:	0
RX Jumbo:	0
TX:		128

# ethtool -d eth0
Offset		Values
------		------
0x0000:		00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00 
0x0010:		00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00 
0x0020:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x0030:		00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18 
0x0040:		00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00 
0x0050:		80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00 
0x0060:		00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc 
0x0070:		8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00 
0x0080:		03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80 
0x0090:		03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81 



Running tcpdump makes it so that ifconfig dropped value doesn't increment and shows
me ARP requests (although it won't reply to them), but just setting the interface as promisc do not.

If you can give me any indications on how to gather more data about DMA descriptors
I'll try my best.

This is using internal's emaclite dma, because when using dmaengine there's no dropping
of packets, but a big buffering, and kernel 6.13.8, because in series ~5.11 which I'm
also working with, axienet didn't have support for reading statistics from the core.

I assume the old dma code inside axienet is to be deprecated, and I would be
pretty glad to use dmaengine, but that has the buffering problem. So if you
want to focus efforts on solving that issue I'm completely open to whatever
you all deem more appropriate.

I can even add some ILA to the Vivado design and inspect whatever you
think could be useful

Thanks

> 
> > replied to. Keeping old buildroot version but asking it to use gcc-11 will cause the
> > same issue with kernel 4.4.43, so something must have happened in between those
> > gcc versions.
> > 
> > So this does not look like an axienet driver problem, which I first thought it was,
> > because who would blame the compiler in first instance?
> > 
> > But then things started to get even stranger.
> > 
> > What I did next, was slowly upgrading buildroot and using the kernel version that
> > buildroot considered "latest" at the point it was released. I reached a point in which
> > the ARP requests were being dropped again. This happened on buildroot 2021.11,
> > which still used gcc-10 as the default and kernel version 5.15.6. So some gcc bug
> > that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by
> > kernel 5.15.6.
> > 
> > Using gcc-10, I bisected the kernel and found that this commit was triggering
> > whatever it is that is happening, around 5.11-rc2:
> > 
> > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD)
> > Author: Menglong Dong <dong.menglong@....com.cn>
> > Date:   Mon Jan 11 02:42:21 2021 -0800
> > 
> >     net: core: use eth_type_vlan in __netif_receive_skb_core
> > 
> >     Replace the check for ETH_P_8021Q and ETH_P_8021AD in
> >     __netif_receive_skb_core with eth_type_vlan.
> > 
> >     Signed-off-by: Menglong Dong <dong.menglong@....com.cn>
> >     Link: https://lore.kernel.org/r/20210111104221.3451-1-
> > dong.menglong@....com.cn
> >     Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> > 
> > 
> > I've been staring at the diff for hours because I can't understand what can be wrong
> > about this:
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c index e4d77c8abe76..267c4a8daa55
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb,
> > bool pfmemalloc,
> >         skb_reset_mac_len(skb);
> >     }
> > 
> > -   if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
> > -       skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
> > +   if (eth_type_vlan(skb->protocol)) {
> >         skb = skb_vlan_untag(skb);
> >         if (unlikely(!skb))
> >             goto out;
> > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb,
> > bool pfmemalloc,
> >              * find vlan device.
> >              */
> >             skb->pkt_type = PACKET_OTHERHOST;
> > -       } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
> > -              skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
> > +       } else if (eth_type_vlan(skb->protocol)) {
> >             /* Outer header is 802.1P with vlan 0, inner header is
> >              * 802.1Q or 802.1AD and vlan_do_receive() above could
> >              * not find vlan dev for vlan id 0.
> > 
> > 
> > 
> > Given that eth_type_vlan is simply this:
> > 
> > static inline bool eth_type_vlan(__be16 ethertype) {
> >         switch (ethertype) {
> >         case htons(ETH_P_8021Q):
> >         case htons(ETH_P_8021AD):
> >                 return true;
> >         default:
> >                 return false;
> >         }
> > }
> > 
> > I've added a small printk to see these values right before the first time they are
> > checked:
> > 
> > printk(KERN_ALERT  "skb->protocol = %d, ETH_P_8021Q=%d
> > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d",
> >        skb->protocol, cpu_to_be16(ETH_P_8021Q), cpu_to_be16(ETH_P_8021AD),
> > eth_type_vlan(skb->protocol));
> > 
> > And each ARP ping delivers a packet reported as:
> > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144,
> > skb->eth_type_vlan(skb->protocol) = 0
> > 
> > To add insult to injury, adding this printk line solves the ARP deafness, so no matter
> > whether I use eth_type_vlan function or manual comparison, now ARP packets
> > aren't dropped.
> > 
> > Removing this printk and adding one inside the if-clause that should not be
> > happening, shows nothing, so neither I can directly inspect the packets or return
> > value of the wrong working code, nor can I indirectly proof that the wrong branch of
> > the if is being taken. This reinforces the idea of a compiler bug, but I very well could
> > be wrong.
> > 
> > Adding this printk:
> > diff --git i/net/core/dev.c w/net/core/dev.c index 267c4a8daa55..a3ae3bcb3a21
> > 100644
> > --- i/net/core/dev.c
> > +++ w/net/core/dev.c
> > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb,
> > bool pfmemalloc,
> >                  * check again for vlan id to set OTHERHOST.
> >                  */
> >                 goto check_vlan_id;
> > +       } else {
> > +           printk(KERN_ALERT "(1) skb->protocol is not type vlan\n");
> >         }
> >         /* Note: we might in the future use prio bits
> >          * and set skb->priority like in vlan_do_receive()
> > 
> > is even weirder because the same effect: the message does not appear but ARP
> > requests are answered back. If I remove this printk, ARP requests are dropped.
> > 
> > I've generated assembly output and this is the difference between having that extra
> > else with the printk and not having it.
> > 
> > It doesn't even make much any sense that code would even reach this region of
> > code because there's no vlan involved in at all here.
> > 
> > And so here I am again, staring at all this without knowing how to proceed.
> > 
> > I guess I will be trying different and more modern versions of gcc, even some
> > precompiled toolchains and see what else may be going on.
> > 
> > If anyone has any hindsight as to what is causing this or how to solve it, it'd be great
> > if you could share it.
> > 
> > Thanks!
> > 
> > --
> > Álvaro G. M.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ