[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJEV1ihMuP6Oq+=ubd05DReBXuLwmZLYFwO=ha2C995wBuWeLA@mail.gmail.com>
Date: Thu, 8 Feb 2024 12:59:44 +0200
From: Pavel Vazharov <pavel@...e.net>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: Magnus Karlsson <magnus.karlsson@...il.com>, Toke Høiland-Jørgensen <toke@...nel.org>,
Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org
Subject: Re: Need of advice for XDP sockets on top of the interfaces behind a
Linux bonding device
On Wed, Feb 7, 2024 at 9:00 PM Maciej Fijalkowski
<maciej.fijalkowski@...el.com> wrote:
>
> On Wed, Feb 07, 2024 at 05:49:47PM +0200, Pavel Vazharov wrote:
> > On Mon, Feb 5, 2024 at 9:07 AM Magnus Karlsson
> > <magnus.karlsson@...il.com> wrote:
> > >
> > > On Tue, 30 Jan 2024 at 15:54, Toke Høiland-Jørgensen <toke@...nel.org> wrote:
> > > >
> > > > Pavel Vazharov <pavel@...e.net> writes:
> > > >
> > > > > On Tue, Jan 30, 2024 at 4:32 PM Toke Høiland-Jørgensen <toke@...nel.org> wrote:
> > > > >>
> > > > >> Pavel Vazharov <pavel@...e.net> writes:
> > > > >>
> > > > >> >> On Sat, Jan 27, 2024 at 7:08 AM Pavel Vazharov <pavel@...e.net> wrote:
> > > > >> >>>
> > > > >> >>> On Sat, Jan 27, 2024 at 6:39 AM Jakub Kicinski <kuba@...nel.org> wrote:
> > > > >> >>> >
> > > > >> >>> > On Sat, 27 Jan 2024 05:58:55 +0200 Pavel Vazharov wrote:
> > > > >> >>> > > > Well, it will be up to your application to ensure that it is not. The
> > > > >> >>> > > > XDP program will run before the stack sees the LACP management traffic,
> > > > >> >>> > > > so you will have to take some measure to ensure that any such management
> > > > >> >>> > > > traffic gets routed to the stack instead of to the DPDK application. My
> > > > >> >>> > > > immediate guess would be that this is the cause of those warnings?
> > > > >> >>> > >
> > > > >> >>> > > Thank you for the response.
> > > > >> >>> > > I already checked the XDP program.
> > > > >> >>> > > It redirects particular pools of IPv4 (TCP or UDP) traffic to the application.
> > > > >> >>> > > Everything else is passed to the Linux kernel.
> > > > >> >>> > > However, I'll check it again. Just to be sure.
> > > > >> >>> >
> > > > >> >>> > What device driver are you using, if you don't mind sharing?
> > > > >> >>> > The pass thru code path may be much less well tested in AF_XDP
> > > > >> >>> > drivers.
> > > > >> >>> These are the kernel version and the drivers for the 3 ports in the
> > > > >> >>> above bonding.
> > > > >> >>> ~# uname -a
> > > > >> >>> Linux 6.3.2 #1 SMP Wed May 17 08:17:50 UTC 2023 x86_64 GNU/Linux
> > > > >> >>> ~# lspci -v | grep -A 16 -e 1b:00.0 -e 3b:00.0 -e 5e:00.0
> > > > >> >>> 1b:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > > > >> >>> SFI/SFP+ Network Connection (rev 01)
> > > > >> >>> ...
> > > > >> >>> Kernel driver in use: ixgbe
> > > > >> >>> --
> > > > >> >>> 3b:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > > > >> >>> SFI/SFP+ Network Connection (rev 01)
> > > > >> >>> ...
> > > > >> >>> Kernel driver in use: ixgbe
> > > > >> >>> --
> > > > >> >>> 5e:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > > > >> >>> SFI/SFP+ Network Connection (rev 01)
> > > > >> >>> ...
> > > > >> >>> Kernel driver in use: ixgbe
> > > > >> >>>
> > > > >> >>> I think they should be well supported, right?
> > > > >> >>> So far, it seems that the present usage scenario should work and the
> > > > >> >>> problem is somewhere in my code.
> > > > >> >>> I'll double check it again and try to simplify everything in order to
> > > > >> >>> pinpoint the problem.
> > > > >> > I've managed to pinpoint that forcing the copying of the packets
> > > > >> > between the kernel and the user space
> > > > >> > (XDP_COPY) fixes the issue with the malformed LACPDUs and the not
> > > > >> > working bonding.
> > > > >>
> > > > >> (+Magnus)
> > > > >>
> > > > >> Right, okay, that seems to suggest a bug in the internal kernel copying
> > > > >> that happens on XDP_PASS in zero-copy mode. Which would be a driver bug;
> > > > >> any chance you could test with a different driver and see if the same
> > > > >> issue appears there?
> > > > >>
> > > > >> -Toke
> > > > > No, sorry.
> > > > > We have only servers with Intel 82599ES with ixgbe drivers.
> > > > > And one lab machine with Intel 82540EM with igb driver but we can't
> > > > > set up bonding there
> > > > > and the problem is not reproducible there.
> > > >
> > > > Right, okay. Another thing that may be of some use is to try to capture
> > > > the packets on the physical devices using tcpdump. That should (I think)
> > > > show you the LACDPU packets as they come in, before they hit the bonding
> > > > device, but after they are copied from the XDP frame. If it's a packet
> > > > corruption issue, that should be visible in the captured packet; you can
> > > > compare with an xdpdump capture to see if there are any differences...
> > >
> > > Pavel,
> > >
> > > Sounds like an issue with the driver in zero-copy mode as it works
> > > fine in copy mode. Maciej and I will take a look at it.
> > >
> > > > -Toke
> > > >
> >
> > First I want to apologize for not responding for such a long time.
> > I had different tasks the previous week and this week went back to this issue.
> > I had to modify the code of the af_xdp driver inside the DPDK so that it loads
> > the XDP program in a way which is compatible with the xdp-dispatcher.
> > Finally, I was able to run our application with the XDP sockets and the xdpdump
> > at the same time.
> >
> > Back to the issue.
> > I just want to say again that we are not binding the XDP sockets to
> > the bonding device.
> > We are binding the sockets to the queues of the physical interfaces
> > "below" the bonding device.
> > My further observation this time is that when the issue happens and
> > the remote device reports
> > the LACP error there is no incoming LACP traffic on the corresponding
> > local port,
> > as seen by the xdump.
> > The tcpdump at the same time sees only outgoing LACP packets and
> > nothing incoming.
> > For example:
> > Remote device
> > Local Server
> > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/12 <---> eth0
> > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/13 <---> eth2
> > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/14 <---> eth4
> > And when the remote device reports "received an abnormal LACPDU"
> > for PortName=XGigabitEthernet0/0/14 I can see via xdpdump that there
> > is no incoming LACP traffic
>
> Hey Pavel,
>
> can you also look at /proc/interrupts at eth4 and what ethtool -S shows
> there?
I reproduced the problem but this time the interface with the weird
state was eth0.
It's different every time and sometimes even two of the interfaces are
in such a state.
Here are the requested info while being in this state:
~# ethtool -S eth0 > /tmp/stats0.txt ; sleep 10 ; ethtool -S eth0 >
/tmp/stats1.txt ; diff /tmp/stats0.txt /tmp/stats1.txt
6c6
< rx_pkts_nic: 81426
---
> rx_pkts_nic: 81436
8c8
< rx_bytes_nic: 10286521
---
> rx_bytes_nic: 10287801
17c17
< multicast: 72216
---
> multicast: 72226
48c48
< rx_no_dma_resources: 1109
---
> rx_no_dma_resources: 1119
~# cat /proc/interrupts | grep eth0 > /tmp/interrupts0.txt ; sleep 10
; cat /proc/interrupts | grep eth0 > /tmp/interrupts1.txt
interrupts0: 430 3098 64 108199 108199 108199 108199 108199 108199
108199 108201 63 64 1865 108199 61
interrupts1: 435 3103 69 117967 117967 117967 117967 117967 117967
117967 117969 68 69 1870 117967 66
So, it seems that packets are coming on the interface but they don't
reach to the XDP layer and deeper.
rx_no_dma_resources - this counter seems to give clues about a possible issue?
>
> > on eth4 but there is incoming LACP traffic on eth0 and eth2.
> > At the same time, according to the dmesg the kernel sees all of the
> > interfaces as
> > "link status definitely up, 10000 Mbps full duplex".
> > The issue goes aways if I stop the application even without removing
> > the XDP programs
> > from the interfaces - the running xdpdump starts showing the incoming
> > LACP traffic immediately.
> > The issue also goes away if I do "ip link set down eth4 && ip link set up eth4".
>
> and the setup is what when doing the link flap? XDP progs are loaded to
> each of the 3 interfaces of bond?
Yes, the same XDP program is loaded on application startup on each one
of the interfaces which are part of bond0 (eth0, eth2, eth4):
# xdp-loader status
CURRENT XDP PROGRAM STATUS:
Interface Prio Program name Mode ID Tag
Chain actions
--------------------------------------------------------------------------------------
lo <No XDP program loaded!>
eth0 xdp_dispatcher native 1320 90f686eb86991928
=> 50 x3sp_splitter_func 1329
3b185187f1855c4c XDP_PASS
eth1 <No XDP program loaded!>
eth2 xdp_dispatcher native 1334 90f686eb86991928
=> 50 x3sp_splitter_func 1337
3b185187f1855c4c XDP_PASS
eth3 <No XDP program loaded!>
eth4 xdp_dispatcher native 1342 90f686eb86991928
=> 50 x3sp_splitter_func 1345
3b185187f1855c4c XDP_PASS
eth5 <No XDP program loaded!>
eth6 <No XDP program loaded!>
eth7 <No XDP program loaded!>
bond0 <No XDP program loaded!>
Each of these interfaces is setup to have 16 queues i.e. the application,
through the DPDK machinery, opens 3x16 XSK sockets each bound to the
corresponding queue of the corresponding interface.
~# ethtool -l eth0 # It's same for the other 2 devices
Channel parameters for eth0:
Pre-set maximums:
RX: n/a
TX: n/a
Other: 1
Combined: 48
Current hardware settings:
RX: n/a
TX: n/a
Other: 1
Combined: 16
>
> > However, I'm not sure what happens with the bound XDP sockets in this case
> > because I haven't tested further.
>
> can you also try to bind xsk sockets before attaching XDP progs?
I looked into the DPDK code again.
The DPDK framework provides callback hooks like eth_rx_queue_setup
and each "driver" implements it as needed. Each Rx/Tx queue of the device is
set up separately. The af_xdp driver currently does this for each Rx
queue separately:
1. configures the umem for the queue
2. loads the XDP program on the corresponding interface, if not already loaded
(i.e. this happens only once per interface when its first queue is set up).
3. does xsk_socket__create which as far as I looked also internally binds the
socket to the given queue
4. places the socket in the XSKS map of the XDP program via bpf_map_update_elem
So, it seems to me that the change needed will be a bit more involved.
I'm not sure if it'll be possible to hardcode, just for the test, the
program loading and
the placing of all XSK sockets in the map to happen when the setup of the last
"queue" for the given interface is done. I need to think a bit more about this.
>
> >
> > It seems to me that something racy happens when the interfaces go down
> > and back up
> > (visible in the dmesg) when the XDP sockets are bound to their queues.
> > I mean, I'm not sure why the interfaces go down and up but setting
> > only the XDP programs
> > on the interfaces doesn't cause this behavior. So, I assume it's
> > caused by the binding of the XDP sockets.
>
> hmm i'm lost here, above you said you got no incoming traffic on eth4 even
> without xsk sockets being bound?
Probably I've phrased something in a wrong way.
The issue is not observed if I load the XDP program on all interfaces
(eth0, eth2, eth4)
with the xdp-loader:
xdp-loader load --mode native <iface> <path-to-the-xdp-program>
It's not observed probably because there are no interface down/up actions.
I also modified the DPDK "driver" to not remove the XDP program on exit and thus
when the application stops only the XSK sockets are closed but the
program remains
loaded at the interfaces. When I stop this version of the application
while running the
xdpdump at the same time I see that the traffic immediately appears in
the xdpdump.
Also, note that I basically trimmed the XDP program to simply contain
the XSK map
(BPF_MAP_TYPE_XSKMAP) and the function just does "return XDP_PASS;".
I wanted to exclude every possibility for the XDP program to do something wrong.
So, from the above it seems to me that the issue is triggered somehow by the XSK
sockets usage.
>
> > It could be that the issue is not related to the XDP sockets but just
> > to the down/up actions of the interfaces.
> > On the other hand, I'm not sure why the issue is easily reproducible
> > when the zero copy mode is enabled
> > (4 out of 5 tests reproduced the issue).
> > However, when the zero copy is disabled this issue doesn't happen
> > (I tried 10 times in a row and it doesn't happen).
>
> any chances that you could rule out the bond of the picture of this issue?
I'll need to talk to the network support guys because they manage the network
devices and they'll need to change the LACP/Trunk setup of the above
"remote device".
I can't promise that they'll agree though.
> on my side i'll try to play with multiple xsk sockets within same netdev
> served by ixgbe and see if i observe something broken. I recently fixed
> i40e Tx disable timeout issue, so maybe ixgbe has something off in down/up
> actions as you state as well.
Powered by blists - more mailing lists