netdev - Re: Need of advice for XDP sockets on top of the interfaces behind a Linux bonding device

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJEV1ijnUrJXOuGW5xnuCvMTtaC1VKhOXQ0_4iJnqR5Vco4yLg@mail.gmail.com>
Date: Fri, 9 Feb 2024 11:03:51 +0200
From: Pavel Vazharov <pavel@...e.net>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: Magnus Karlsson <magnus.karlsson@...il.com>, Toke Høiland-Jørgensen <toke@...nel.org>, 
	Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org
Subject: Re: Need of advice for XDP sockets on top of the interfaces behind a
 Linux bonding device

On Thu, Feb 8, 2024 at 5:47 PM Pavel Vazharov <pavel@...e.net> wrote:
>
> On Thu, Feb 8, 2024 at 12:59 PM Pavel Vazharov <pavel@...e.net> wrote:
> >
> > On Wed, Feb 7, 2024 at 9:00 PM Maciej Fijalkowski
> > <maciej.fijalkowski@...el.com> wrote:
> > >
> > > On Wed, Feb 07, 2024 at 05:49:47PM +0200, Pavel Vazharov wrote:
> > > > On Mon, Feb 5, 2024 at 9:07 AM Magnus Karlsson
> > > > <magnus.karlsson@...il.com> wrote:
> > > > >
> > > > > On Tue, 30 Jan 2024 at 15:54, Toke Høiland-Jørgensen <toke@...nel.org> wrote:
> > > > > >
> > > > > > Pavel Vazharov <pavel@...e.net> writes:
> > > > > >
> > > > > > > On Tue, Jan 30, 2024 at 4:32 PM Toke Høiland-Jørgensen <toke@...nel.org> wrote:
> > > > > > >>
> > > > > > >> Pavel Vazharov <pavel@...e.net> writes:
> > > > > > >>
> > > > > > >> >> On Sat, Jan 27, 2024 at 7:08 AM Pavel Vazharov <pavel@...e.net> wrote:
> > > > > > >> >>>
> > > > > > >> >>> On Sat, Jan 27, 2024 at 6:39 AM Jakub Kicinski <kuba@...nel.org> wrote:
> > > > > > >> >>> >
> > > > > > >> >>> > On Sat, 27 Jan 2024 05:58:55 +0200 Pavel Vazharov wrote:
> > > > > > >> >>> > > > Well, it will be up to your application to ensure that it is not. The
> > > > > > >> >>> > > > XDP program will run before the stack sees the LACP management traffic,
> > > > > > >> >>> > > > so you will have to take some measure to ensure that any such management
> > > > > > >> >>> > > > traffic gets routed to the stack instead of to the DPDK application. My
> > > > > > >> >>> > > > immediate guess would be that this is the cause of those warnings?
> > > > > > >> >>> > >
> > > > > > >> >>> > > Thank you for the response.
> > > > > > >> >>> > > I already checked the XDP program.
> > > > > > >> >>> > > It redirects particular pools of IPv4 (TCP or UDP) traffic to the application.
> > > > > > >> >>> > > Everything else is passed to the Linux kernel.
> > > > > > >> >>> > > However, I'll check it again. Just to be sure.
> > > > > > >> >>> >
> > > > > > >> >>> > What device driver are you using, if you don't mind sharing?
> > > > > > >> >>> > The pass thru code path may be much less well tested in AF_XDP
> > > > > > >> >>> > drivers.
> > > > > > >> >>> These are the kernel version and the drivers for the 3 ports in the
> > > > > > >> >>> above bonding.
> > > > > > >> >>> ~# uname -a
> > > > > > >> >>> Linux 6.3.2 #1 SMP Wed May 17 08:17:50 UTC 2023 x86_64 GNU/Linux
> > > > > > >> >>> ~# lspci -v | grep -A 16 -e 1b:00.0 -e 3b:00.0 -e 5e:00.0
> > > > > > >> >>> 1b:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > > > > > >> >>> SFI/SFP+ Network Connection (rev 01)
> > > > > > >> >>>        ...
> > > > > > >> >>>         Kernel driver in use: ixgbe
> > > > > > >> >>> --
> > > > > > >> >>> 3b:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > > > > > >> >>> SFI/SFP+ Network Connection (rev 01)
> > > > > > >> >>>         ...
> > > > > > >> >>>         Kernel driver in use: ixgbe
> > > > > > >> >>> --
> > > > > > >> >>> 5e:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > > > > > >> >>> SFI/SFP+ Network Connection (rev 01)
> > > > > > >> >>>         ...
> > > > > > >> >>>         Kernel driver in use: ixgbe
> > > > > > >> >>>
> > > > > > >> >>> I think they should be well supported, right?
> > > > > > >> >>> So far, it seems that the present usage scenario should work and the
> > > > > > >> >>> problem is somewhere in my code.
> > > > > > >> >>> I'll double check it again and try to simplify everything in order to
> > > > > > >> >>> pinpoint the problem.
> > > > > > >> > I've managed to pinpoint that forcing the copying of the packets
> > > > > > >> > between the kernel and the user space
> > > > > > >> > (XDP_COPY) fixes the issue with the malformed LACPDUs and the not
> > > > > > >> > working bonding.
> > > > > > >>
> > > > > > >> (+Magnus)
> > > > > > >>
> > > > > > >> Right, okay, that seems to suggest a bug in the internal kernel copying
> > > > > > >> that happens on XDP_PASS in zero-copy mode. Which would be a driver bug;
> > > > > > >> any chance you could test with a different driver and see if the same
> > > > > > >> issue appears there?
> > > > > > >>
> > > > > > >> -Toke
> > > > > > > No, sorry.
> > > > > > > We have only servers with Intel 82599ES with ixgbe drivers.
> > > > > > > And one lab machine with Intel 82540EM with igb driver but we can't
> > > > > > > set up bonding there
> > > > > > > and the problem is not reproducible there.
> > > > > >
> > > > > > Right, okay. Another thing that may be of some use is to try to capture
> > > > > > the packets on the physical devices using tcpdump. That should (I think)
> > > > > > show you the LACDPU packets as they come in, before they hit the bonding
> > > > > > device, but after they are copied from the XDP frame. If it's a packet
> > > > > > corruption issue, that should be visible in the captured packet; you can
> > > > > > compare with an xdpdump capture to see if there are any differences...
> > > > >
> > > > > Pavel,
> > > > >
> > > > > Sounds like an issue with the driver in zero-copy mode as it works
> > > > > fine in copy mode. Maciej and I will take a look at it.
> > > > >
> > > > > > -Toke
> > > > > >
> > > >
> > > > First I want to apologize for not responding for such a long time.
> > > > I had different tasks the previous week and this week went back to this issue.
> > > > I had to modify the code of the af_xdp driver inside the DPDK so that it loads
> > > > the XDP program in a way which is compatible with the xdp-dispatcher.
> > > > Finally, I was able to run our application with the XDP sockets and the xdpdump
> > > > at the same time.
> > > >
> > > > Back to the issue.
> > > > I just want to say again that we are not binding the XDP sockets to
> > > > the bonding device.
> > > > We are binding the sockets to the queues of the physical interfaces
> > > > "below" the bonding device.
> > > > My further observation this time is that when the issue happens and
> > > > the remote device reports
> > > > the LACP error there is no incoming LACP traffic on the corresponding
> > > > local port,
> > > > as seen by the xdump.
> > > > The tcpdump at the same time sees only outgoing LACP packets and
> > > > nothing incoming.
> > > > For example:
> > > > Remote device
> > > >                           Local Server
> > > > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/12 <---> eth0
> > > > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/13 <---> eth2
> > > > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/14 <---> eth4
> > > > And when the remote device reports "received an abnormal LACPDU"
> > > > for PortName=XGigabitEthernet0/0/14 I can see via xdpdump that there
> > > > is no incoming LACP traffic
> > >
> > > Hey Pavel,
> > >
> > > can you also look at /proc/interrupts at eth4 and what ethtool -S shows
> > > there?
> > I reproduced the problem but this time the interface with the weird
> > state was eth0.
> > It's different every time and sometimes even two of the interfaces are
> > in such a state.
> > Here are the requested info while being in this state:
> > ~# ethtool -S eth0 > /tmp/stats0.txt ; sleep 10 ; ethtool -S eth0 >
> > /tmp/stats1.txt ; diff /tmp/stats0.txt /tmp/stats1.txt
> > 6c6
> > <      rx_pkts_nic: 81426
> > ---
> > >      rx_pkts_nic: 81436
> > 8c8
> > <      rx_bytes_nic: 10286521
> > ---
> > >      rx_bytes_nic: 10287801
> > 17c17
> > <      multicast: 72216
> > ---
> > >      multicast: 72226
> > 48c48
> > <      rx_no_dma_resources: 1109
> > ---
> > >      rx_no_dma_resources: 1119
> >
> > ~# cat /proc/interrupts | grep eth0 > /tmp/interrupts0.txt ; sleep 10
> > ; cat /proc/interrupts | grep eth0 > /tmp/interrupts1.txt
> > interrupts0: 430 3098 64 108199 108199 108199 108199 108199 108199
> > 108199 108201 63 64 1865 108199  61
> > interrupts1: 435 3103 69 117967 117967  117967 117967 117967  117967
> > 117967 117969 68 69 1870  117967 66
> >
> > So, it seems that packets are coming on the interface but they don't
> > reach to the XDP layer and deeper.
> > rx_no_dma_resources - this counter seems to give clues about a possible issue?
> >
> > >
> > > > on eth4 but there is incoming LACP traffic on eth0 and eth2.
> > > > At the same time, according to the dmesg the kernel sees all of the
> > > > interfaces as
> > > > "link status definitely up, 10000 Mbps full duplex".
> > > > The issue goes aways if I stop the application even without removing
> > > > the XDP programs
> > > > from the interfaces - the running xdpdump starts showing the incoming
> > > > LACP traffic immediately.
> > > > The issue also goes away if I do "ip link set down eth4 && ip link set up eth4".
> > >
> > > and the setup is what when doing the link flap? XDP progs are loaded to
> > > each of the 3 interfaces of bond?
> > Yes, the same XDP program is loaded on application startup on each one
> > of the interfaces which are part of bond0 (eth0, eth2, eth4):
> > # xdp-loader status
> > CURRENT XDP PROGRAM STATUS:
> >
> > Interface        Prio  Program name      Mode     ID   Tag
> >   Chain actions
> > --------------------------------------------------------------------------------------
> > lo                     <No XDP program loaded!>
> > eth0                   xdp_dispatcher    native   1320 90f686eb86991928
> >  =>              50     x3sp_splitter_func          1329
> > 3b185187f1855c4c  XDP_PASS
> > eth1                   <No XDP program loaded!>
> > eth2                   xdp_dispatcher    native   1334 90f686eb86991928
> >  =>              50     x3sp_splitter_func          1337
> > 3b185187f1855c4c  XDP_PASS
> > eth3                   <No XDP program loaded!>
> > eth4                   xdp_dispatcher    native   1342 90f686eb86991928
> >  =>              50     x3sp_splitter_func          1345
> > 3b185187f1855c4c  XDP_PASS
> > eth5                   <No XDP program loaded!>
> > eth6                   <No XDP program loaded!>
> > eth7                   <No XDP program loaded!>
> > bond0                  <No XDP program loaded!>
> > Each of these interfaces is setup to have 16 queues i.e. the application,
> > through the DPDK machinery, opens 3x16 XSK sockets each bound to the
> > corresponding queue of the corresponding interface.
> > ~# ethtool -l eth0 # It's same for the other 2 devices
> > Channel parameters for eth0:
> > Pre-set maximums:
> > RX:             n/a
> > TX:             n/a
> > Other:          1
> > Combined:       48
> > Current hardware settings:
> > RX:             n/a
> > TX:             n/a
> > Other:          1
> > Combined:       16
> >
> > >
> > > > However, I'm not sure what happens with the bound XDP sockets in this case
> > > > because I haven't tested further.
> > >
> > > can you also try to bind xsk sockets before attaching XDP progs?
> > I looked into the DPDK code again.
> > The DPDK framework provides callback hooks like eth_rx_queue_setup
> > and each "driver" implements it as needed. Each Rx/Tx queue of the device is
> > set up separately. The af_xdp driver currently does this for each Rx
> > queue separately:
> > 1. configures the umem for the queue
> > 2. loads the XDP program on the corresponding interface, if not already loaded
> >    (i.e. this happens only once per interface when its first queue is set up).
> > 3. does xsk_socket__create which as far as I looked also internally binds the
> > socket to the given queue
> > 4. places the socket in the XSKS map of the XDP program via bpf_map_update_elem
> >
> > So, it seems to me that the change needed will be a bit more involved.
> > I'm not sure if it'll be possible to hardcode, just for the test, the
> > program loading and
> > the placing of all XSK sockets in the map to happen when the setup of the last
> > "queue" for the given interface is done. I need to think a bit more about this.
> Changed the code of the DPDK af_xdp "driver" to create and bind all of
> the XSK sockets
> to the queues of the corresponding interface and after that, after the
> initialization of the
> last XSK socket, I added the logic for the attachment of the XDP
> program to the interface
> and the population of the XSK map with the created sockets.
> The issue was still there but it was kind of harder to reproduce - it
> happened once for 5
> starts of the application.
>
> >
> > >
> > > >
> > > > It seems to me that something racy happens when the interfaces go down
> > > > and back up
> > > > (visible in the dmesg) when the XDP sockets are bound to their queues.
> > > > I mean, I'm not sure why the interfaces go down and up but setting
> > > > only the XDP programs
> > > > on the interfaces doesn't cause this behavior. So, I assume it's
> > > > caused by the binding of the XDP sockets.
> > >
> > > hmm i'm lost here, above you said you got no incoming traffic on eth4 even
> > > without xsk sockets being bound?
> > Probably I've phrased something in a wrong way.
> > The issue is not observed if I load the XDP program on all interfaces
> > (eth0, eth2, eth4)
> > with the xdp-loader:
> > xdp-loader load --mode native <iface> <path-to-the-xdp-program>
> > It's not observed probably because there are no interface down/up actions.
> > I also modified the DPDK "driver" to not remove the XDP program on exit and thus
> > when the application stops only the XSK sockets are closed but the
> > program remains
> > loaded at the interfaces. When I stop this version of the application
> > while running the
> > xdpdump at the same time I see that the traffic immediately appears in
> > the xdpdump.
> > Also, note that I basically trimmed the XDP program to simply contain
> > the XSK map
> > (BPF_MAP_TYPE_XSKMAP) and the function just does "return XDP_PASS;".
> > I wanted to exclude every possibility for the XDP program to do something wrong.
> > So, from the above it seems to me that the issue is triggered somehow by the XSK
> > sockets usage.
> >
> > >
> > > > It could be that the issue is not related to the XDP sockets but just
> > > > to the down/up actions of the interfaces.
> > > > On the other hand, I'm not sure why the issue is easily reproducible
> > > > when the zero copy mode is enabled
> > > > (4 out of 5 tests reproduced the issue).
> > > > However, when the zero copy is disabled this issue doesn't happen
> > > > (I tried 10 times in a row and it doesn't happen).
> > >
> > > any chances that you could rule out the bond of the picture of this issue?
> > I'll need to talk to the network support guys because they manage the network
> > devices and they'll need to change the LACP/Trunk setup of the above
> > "remote device".
> > I can't promise that they'll agree though.
We changed the setup and I did the tests with a single port, no
bonding involved.
The port was configured with 16 queues (and 16 XSK sockets bound to them).
I tested with about 100 Mbps of traffic to not break lots of users.
During the tests I observed the traffic on the real time graph on the
remote device port
connected to the server machine where the application was running in
L3 forward mode:
- with zero copy enabled the traffic to the server was about 100 Mbps
but the traffic
coming out of the server was about 50 Mbps (i.e. half of it).
- with no zero copy the traffic in both directions was the same - the
two graphs matched perfectly
Nothing else was changed during the both tests, only the ZC option.
Can I check some stats or something else for this testing scenario
which could be
used to reveal more info about the issue?

> >
> > > on my side i'll try to play with multiple xsk sockets within same netdev
> > > served by ixgbe and see if i observe something broken. I recently fixed
> > > i40e Tx disable timeout issue, so maybe ixgbe has something off in down/up
> > > actions as you state as well.