[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ce56ed7345d2077a7647687f90cf42da57bf90c7.camel@hazent.com>
Date: Wed, 09 Apr 2025 13:00:42 +0200
From: Álvaro "G. M." <alvaro.gamez@...ent.com>
To: "Pandey, Radhey Shyam" <radhey.shyam.pandey@....com>, Jakub Kicinski
<kuba@...nel.org>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>, "Katakam, Harini"
<harini.katakam@....com>, "Gupta, Suraj" <Suraj.Gupta2@....com>
Subject: Re: Issue with AMD Xilinx AXI Ethernet (xilinx_axienet) on
MicroBlaze: Packets only received after some buffer is full
On Thu, 2025-04-03 at 05:54 +0000, Pandey, Radhey Shyam wrote:
> [...]
> + Going through the details and will get back to you . Just to confirm there is no
> vivado design update ? and we are only updating linux kernel to latest?
>
Hi again,
I've reconsidered the upgrading approach and I've first upgraded buildroot
and kept the same kernel version (4.4.43). This has the effect of upgrading
gcc from version 10 to version 13.
With buildroot's compiled gcc-13 and keeping this same old kernel, the effect
is that the system drops ARP requests. Compiling with older gcc-10, ARP requests
are replied to. Keeping old buildroot version but asking it to use gcc-11
will cause the same issue with kernel 4.4.43, so something must have happened
in between those gcc versions.
So this does not look like an axienet driver problem, which I first thought
it was, because who would blame the compiler in first instance?
But then things started to get even stranger.
What I did next, was slowly upgrading buildroot and using the kernel version
that buildroot considered "latest" at the point it was released. I reached
a point in which the ARP requests were being dropped again. This happened on
buildroot 2021.11, which still used gcc-10 as the default and kernel version
5.15.6. So some gcc bug that is getting triggered on gcc-11 in kernel 4.4.43
is also triggered on gcc-10 by kernel 5.15.6.
Using gcc-10, I bisected the kernel and found that this commit was triggering
whatever it is that is happening, around 5.11-rc2:
commit 324cefaf1c723625e93f703d6e6d78e28996b315 (HEAD)
Author: Menglong Dong <dong.menglong@....com.cn>
Date: Mon Jan 11 02:42:21 2021 -0800
net: core: use eth_type_vlan in __netif_receive_skb_core
Replace the check for ETH_P_8021Q and ETH_P_8021AD in
__netif_receive_skb_core with eth_type_vlan.
Signed-off-by: Menglong Dong <dong.menglong@....com.cn>
Link: https://lore.kernel.org/r/20210111104221.3451-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@...nel.org>
I've been staring at the diff for hours because I can't understand what
can be wrong about this:
diff --git a/net/core/dev.c b/net/core/dev.c
index e4d77c8abe76..267c4a8daa55 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5151,8 +5151,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
skb_reset_mac_len(skb);
}
- if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
- skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
+ if (eth_type_vlan(skb->protocol)) {
skb = skb_vlan_untag(skb);
if (unlikely(!skb))
goto out;
@@ -5236,8 +5235,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
* find vlan device.
*/
skb->pkt_type = PACKET_OTHERHOST;
- } else if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
- skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
+ } else if (eth_type_vlan(skb->protocol)) {
/* Outer header is 802.1P with vlan 0, inner header is
* 802.1Q or 802.1AD and vlan_do_receive() above could
* not find vlan dev for vlan id 0.
Given that eth_type_vlan is simply this:
static inline bool eth_type_vlan(__be16 ethertype)
{
switch (ethertype) {
case htons(ETH_P_8021Q):
case htons(ETH_P_8021AD):
return true;
default:
return false;
}
}
I've added a small printk to see these values right before the
first time they are checked:
printk(KERN_ALERT "skb->protocol = %d, ETH_P_8021Q=%d ETH_P_8021AD=%d, eth_type_vlan(skb->protocol) = %d",
skb->protocol, cpu_to_be16(ETH_P_8021Q), cpu_to_be16(ETH_P_8021AD), eth_type_vlan(skb->protocol));
And each ARP ping delivers a packet reported as:
skb->protocol = 1544, ETH_P_8021Q=129 ETH_P_8021AD=43144, eth_type_vlan(skb->protocol) = 0
To add insult to injury, adding this printk line solves the ARP deafness,
so no matter whether I use eth_type_vlan function or manual comparison,
now ARP packets aren't dropped.
Removing this printk and adding one inside the if-clause that should not
be happening, shows nothing, so neither I can directly inspect the packets
or return value of the wrong working code, nor can I indirectly proof that
the wrong branch of the if is being taken. This reinforces the idea of
a compiler bug, but I very well could be wrong.
Adding this printk:
diff --git i/net/core/dev.c w/net/core/dev.c
index 267c4a8daa55..a3ae3bcb3a21 100644
--- i/net/core/dev.c
+++ w/net/core/dev.c
@@ -5257,6 +5257,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
* check again for vlan id to set OTHERHOST.
*/
goto check_vlan_id;
+ } else {
+ printk(KERN_ALERT "(1) skb->protocol is not type vlan\n");
}
/* Note: we might in the future use prio bits
* and set skb->priority like in vlan_do_receive()
is even weirder because the same effect: the message does not appear
but ARP requests are answered back. If I remove this printk, ARP requests are dropped.
I've generated assembly output and this is the difference between having that
extra else with the printk and not having it.
It doesn't even make much any sense that code would even reach this region
of code because there's no vlan involved in at all here.
And so here I am again, staring at all this without knowing how to proceed.
I guess I will be trying different and more modern versions of gcc,
even some precompiled toolchains and see what else may be going on.
If anyone has any hindsight as to what is causing this or how to solve
it, it'd be great if you could share it.
Thanks!
--
Álvaro G. M.
Powered by blists - more mailing lists