[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <40a0dfa8116ecccd5e39f2cd186e9f19e43fe7d0.camel@hazent.com>
Date: Thu, 10 Apr 2025 08:54:14 +0200
From: Álvaro "G. M." <alvaro.gamez@...ent.com>
To: "Gupta, Suraj" <Suraj.Gupta2@....com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "Katakam, Harini"
<harini.katakam@....com>, "Pandey, Radhey Shyam"
<radhey.shyam.pandey@....com>, Jakub Kicinski <kuba@...nel.org>
Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on
MicroBlaze: Packets only received after some buffer is full
On Thu, 2025-04-10 at 06:25 +0000, Gupta, Suraj wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> > -----Original Message-----
> > From: Álvaro G. M. <alvaro.gamez@...ent.com>
> > Sent: Wednesday, April 9, 2025 6:40 PM
> > To: Pandey, Radhey Shyam <radhey.shyam.pandey@....com>; Jakub Kicinski
> > <kuba@...nel.org>
> > Cc: netdev@...r.kernel.org; Katakam, Harini <harini.katakam@....com>; Gupta,
> > Suraj <Suraj.Gupta2@....com>
> > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze:
> > Packets only received after some buffer is full
> >
> > Caution: This message originated from an External Source. Use proper caution
> > when opening attachments, clicking links, or responding.
> >
> >
> > On Wed, 2025-04-09 at 11:14 +0000, Pandey, Radhey Shyam wrote:
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > >
> > > > -----Original Message-----
> > > > From: Álvaro G. M. <alvaro.gamez@...ent.com>
> > > > Sent: Wednesday, April 9, 2025 4:31 PM
> > > > To: Pandey, Radhey Shyam <radhey.shyam.pandey@....com>; Jakub
> > > > Kicinski <kuba@...nel.org>
> > > > Cc: netdev@...r.kernel.org; Katakam, Harini
> > > > <harini.katakam@....com>; Gupta, Suraj <Suraj.Gupta2@....com>
> > > > Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on MicroBlaze:
> > > > Packets only received after some buffer is full
> > > >
> > > > On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote:
> > > > > [...]
> > > > > + Going through the details and will get back to you . Just to
> > > > > confirm there is no vivado design update ? and we are only
> > > > > updating linux kernel to
> > > > latest?
> > > > >
> > > >
> > > > Hi again,
> > > >
> > > > I've reconsidered the upgrading approach and I've first upgraded
> > > > buildroot and kept the same kernel version (4.4.43). This has the
> > > > effect of upgrading gcc from version
> > > > 10 to version 13.
> > > >
> > > > With buildroot's compiled gcc-13 and keeping this same old kernel,
> > > > the effect is that the system drops ARP requests. Compiling with
> > > > older gcc-10, ARP requests are
> > >
> > > When the system drops ARP packet - Is it drop by MAC hw or by software layer.
> > > Reading MAC stats and DMA descriptors help us know if it reaches
> > > software layer or not ?
> >
> > I'm not sure, who is the open dropping packets, I can only check with ethtool -S
> > eth0 and this is its output after a few dozens of arpings:
> >
> > # ifconfig eth0
> > eth0 Link encap:Ethernet HWaddr 06:00:0A:BC:8C:01
> > inet addr:10.188.140.1 Bcast:10.188.143.255 Mask:255.255.248.0
> > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> > RX packets:164 errors:0 dropped:99 overruns:0 frame:0
> > TX packets:22 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:11236 (10.9 KiB) TX bytes:1844 (1.8 KiB)
> >
> > # ethtool -S eth0
> > NIC statistics:
> > Received bytes: 13950
> > Transmitted bytes: 2016
> > RX Good VLAN Tagged Frames: 0
> > TX Good VLAN Tagged Frames: 0
> > TX Good PFC Frames: 0
> > RX Good PFC Frames: 0
> > User Defined Counter 0: 0
> > User Defined Counter 1: 0
> > User Defined Counter 2: 0
> >
> > # ethtool -g eth0
> > Ring parameters for eth0:
> > Pre-set maximums:
> > RX: 4096
> > RX Mini: 0
> > RX Jumbo: 0
> > TX: 4096
> > Current hardware settings:
> > RX: 1024
> > RX Mini: 0
> > RX Jumbo: 0
> > TX: 128
> >
> > # ethtool -d eth0
> > Offset Values
> > ------ ------
> > 0x0000: 00 00 00 00 00 00 00 00 00 00 00 00 e4 01 00 00
> > 0x0010: 00 00 00 00 18 00 00 00 00 00 00 00 00 00 00 00
> > 0x0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 0x0030: 00 00 00 00 ff ff ff ff ff ff 00 18 00 00 00 18
> > 0x0040: 00 00 00 00 00 00 00 40 d0 07 00 00 50 00 00 00
> > 0x0050: 80 80 00 01 00 00 00 00 00 21 01 00 00 00 00 00
> > 0x0060: 00 00 00 00 00 00 00 00 00 00 00 00 06 00 0a bc
> > 0x0070: 8c 01 00 00 03 00 00 00 00 00 00 00 00 00 00 00
> > 0x0080: 03 70 18 21 0a 00 18 00 40 25 b3 80 40 25 b3 80
> > 0x0090: 03 50 01 00 08 00 01 00 40 38 12 81 00 38 12 81
> >
> >
> >
>
> As per registers dump, packet is not dropped by MAC. It's dropping somewhere in the software layer.
> Since you started bisecting linux commits, could you please try reverting suspected commit and check if that's actually the first bad commit?
>
I already kinda did, please read the whole message quoted below.
* To summarize:
Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315^ = 679500e385fc4d65c3fac5bfbe6ee55d65698f20 works fine
Kernel commit 324cefaf1c723625e93f703d6e6d78e28996b315 drops packets
But using commit 324cefaf1c723625e93f703d6e6d78e28996b315 and adding printk
around suspect lines, solves the issue. Looks a like a compiler bug.
* New information from yesterday's email:
Reverting commit 324cefaf1c723625e93f703d6e6d78e28996b315 on kernel 6.13.8
does not solve the issue. Neither does tinkering around with printks
> > Running tcpdump makes it so that ifconfig dropped value doesn't increment and
> > shows me ARP requests (although it won't reply to them), but just setting the
> > interface as promisc do not.
> >
> > If you can give me any indications on how to gather more data about DMA
> > descriptors I'll try my best.
> >
> > This is using internal's emaclite dma, because when using dmaengine there's no
> > dropping of packets, but a big buffering, and kernel 6.13.8, because in series ~5.11
> > which I'm also working with, axienet didn't have support for reading statistics from
> > the core.
> >
> > I assume the old dma code inside axienet is to be deprecated, and I would be pretty
> > glad to use dmaengine, but that has the buffering problem. So if you want to focus
> > efforts on solving that issue I'm completely open to whatever you all deem more
> > appropriate.
> >
>
> We're not planning to make DMAengine flow default soon as there is some significant work and optimizations required there which are under progress.
> But this buffering issue we didn't observe on our platforms last time we ran it with linux v6.12.
>
I just tried dmaengine on 6.12 and have the same buffering issue.
Did you try on Microblaze too or only on Zynq?
> > I can even add some ILA to the Vivado design and inspect whatever you think could
> > be useful
> >
> > Thanks
> >
> > >
> > > > replied to. Keeping old buildroot version but asking it to use
> > > > gcc-11 will cause the same issue with kernel 4.4.43, so something
> > > > must have happened in between those gcc versions.
> > > >
> > > > So this does not look like an axienet driver problem, which I first
> > > > thought it was, because who would blame the compiler in first instance?
> > > >
> > > > But then things started to get even stranger.
> > > >
> > > > What I did next, was slowly upgrading buildroot and using the kernel
> > > > version that buildroot considered "latest" at the point it was
> > > > released. I reached a point in which the ARP requests were being
> > > > dropped again. This happened on buildroot 2021.11, which still used
> > > > gcc-10 as the default and kernel version 5.15.6. So some gcc bug
> > > > that is getting triggered on gcc-11 in kernel 4.4.43 is also triggered on gcc-10 by
> > kernel 5.15.6.
> > > >
> > > > Using gcc-10, I bisected the kernel and found that this commit was
> > > > triggering whatever it is that is happening, around 5.11-rc2:
> > > >
> > > > commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD)
> > > > Author: Menglong Dong <dong.menglong@....com.cn>
> > > > Date: Mon Jan 11 02:42:21 2021 -0800
> > > >
> > > > net: core: use eth_type_vlan in __netif_receive_skb_core
> > > >
> > > > Replace the check for ETH_P_8021Q and ETH_P_8021AD in
> > > > __netif_receive_skb_core with eth_type_vlan.
> > > >
> > > > Signed-off-by: Menglong Dong <dong.menglong@....com.cn>
> > > > Link: https://lore.kernel.org/r/20210111104221.3451-1-
> > > > dong.menglong@....com.cn
> > > > Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> > > >
> > > >
> > > > I've been staring at the diff for hours because I can't understand
> > > > what can be wrong about this:
> > > >
> > > > diff --git a/net/core/dev.c b/net/core/dev.c index
> > > > e4d77c8abe76..267c4a8daa55
> > > > 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct
> > > > sk_buff **pskb, bool pfmemalloc,
> > > > skb_reset_mac_len(skb);
> > > > }
> > > >
> > > > - if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
> > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
> > > > + if (eth_type_vlan(skb->protocol)) {
> > > > skb = skb_vlan_untag(skb);
> > > > if (unlikely(!skb))
> > > > goto out;
> > > > @@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct
> > > > sk_buff **pskb, bool pfmemalloc,
> > > > * find vlan device.
> > > > */
> > > > skb->pkt_type = PACKET_OTHERHOST;
> > > > - } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
> > > > - skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
> > > > + } else if (eth_type_vlan(skb->protocol)) {
> > > > /* Outer header is 802.1P with vlan 0, inner header is
> > > > * 802.1Q or 802.1AD and vlan_do_receive() above could
> > > > * not find vlan dev for vlan id 0.
> > > >
> > > >
> > > >
> > > > Given that eth_type_vlan is simply this:
> > > >
> > > > static inline bool eth_type_vlan(__be16 ethertype) {
> > > > switch (ethertype) {
> > > > case htons(ETH_P_8021Q):
> > > > case htons(ETH_P_8021AD):
> > > > return true;
> > > > default:
> > > > return false;
> > > > }
> > > > }
> > > >
> > > > I've added a small printk to see these values right before the first
> > > > time they are
> > > > checked:
> > > >
> > > > printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d
> > > > ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d",
> > > > skb->protocol, cpu_to_be16(ETH_P_8021Q),
> > > > cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol));
> > > >
> > > > And each ARP ping delivers a packet reported as:
> > > > skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144,
> > > > skb->eth_type_vlan(skb->protocol) = 0
> > > >
> > > > To add insult to injury, adding this printk line solves the ARP
> > > > deafness, so no matter whether I use eth_type_vlan function or
> > > > manual comparison, now ARP packets aren't dropped.
> > > >
> > > > Removing this printk and adding one inside the if-clause that should
> > > > not be happening, shows nothing, so neither I can directly inspect
> > > > the packets or return value of the wrong working code, nor can I
> > > > indirectly proof that the wrong branch of the if is being taken.
> > > > This reinforces the idea of a compiler bug, but I very well could be wrong.
> > > >
> > > > Adding this printk:
> > > > diff --git i/net/core/dev.c w/net/core/dev.c index
> > > > 267c4a8daa55..a3ae3bcb3a21
> > > > 100644
> > > > --- i/net/core/dev.c
> > > > +++ w/net/core/dev.c
> > > > @@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct
> > > > sk_buff **pskb, bool pfmemalloc,
> > > > * check again for vlan id to set OTHERHOST.
> > > > */
> > > > goto check_vlan_id;
> > > > + } else {
> > > > + printk(KERN_ALERT "(1) skb->protocol is not type
> > > > + vlan\n");
> > > > }
> > > > /* Note: we might in the future use prio bits
> > > > * and set skb->priority like in vlan_do_receive()
> > > >
> > > > is even weirder because the same effect: the message does not appear
> > > > but ARP requests are answered back. If I remove this printk, ARP requests are
> > dropped.
> > > >
> > > > I've generated assembly output and this is the difference between
> > > > having that extra else with the printk and not having it.
> > > >
> > > > It doesn't even make much any sense that code would even reach this
> > > > region of code because there's no vlan involved in at all here.
> > > >
> > > > And so here I am again, staring at all this without knowing how to proceed.
> > > >
> > > > I guess I will be trying different and more modern versions of gcc,
> > > > even some precompiled toolchains and see what else may be going on.
> > > >
> > > > If anyone has any hindsight as to what is causing this or how to
> > > > solve it, it'd be great if you could share it.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > > Álvaro G. M.
Powered by blists - more mailing lists