netdev - Re: Need of advice for XDP sockets on top of the interfaces behind a Linux bonding device

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJEV1ij+fYUhXmscxk_tsgDppHFWZLuP_bc_gUhZPLMdi4qLQA@mail.gmail.com>
Date: Mon, 19 Feb 2024 15:45:24 +0200
From: Pavel Vazharov <pavel@...e.net>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: Magnus Karlsson <magnus.karlsson@...il.com>, Toke Høiland-Jørgensen <toke@...nel.org>, 
	Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org
Subject: Re: Need of advice for XDP sockets on top of the interfaces behind a
 Linux bonding device

On Fri, Feb 16, 2024 at 7:24 PM Maciej Fijalkowski
<maciej.fijalkowski@...el.com> wrote:
>
> > > > > >
> > > > > > Back to the issue.
> > > > > > I just want to say again that we are not binding the XDP sockets to
> > > > > > the bonding device.
> > > > > > We are binding the sockets to the queues of the physical interfaces
> > > > > > "below" the bonding device.
> > > > > > My further observation this time is that when the issue happens and
> > > > > > the remote device reports
> > > > > > the LACP error there is no incoming LACP traffic on the corresponding
> > > > > > local port,
> > > > > > as seen by the xdump.
> > > > > > The tcpdump at the same time sees only outgoing LACP packets and
> > > > > > nothing incoming.
> > > > > > For example:
> > > > > > Remote device
> > > > > >                           Local Server
> > > > > > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/12 <---> eth0
> > > > > > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/13 <---> eth2
> > > > > > TrunkName=Eth-Trunk20, PortName=XGigabitEthernet0/0/14 <---> eth4
> > > > > > And when the remote device reports "received an abnormal LACPDU"
> > > > > > for PortName=XGigabitEthernet0/0/14 I can see via xdpdump that there
> > > > > > is no incoming LACP traffic
> > > > >
> > > > > Hey Pavel,
> > > > >
> > > > > can you also look at /proc/interrupts at eth4 and what ethtool -S shows
> > > > > there?
> > > > I reproduced the problem but this time the interface with the weird
> > > > state was eth0.
> > > > It's different every time and sometimes even two of the interfaces are
> > > > in such a state.
> > > > Here are the requested info while being in this state:
> > > > ~# ethtool -S eth0 > /tmp/stats0.txt ; sleep 10 ; ethtool -S eth0 >
> > > > /tmp/stats1.txt ; diff /tmp/stats0.txt /tmp/stats1.txt
> > > > 6c6
> > > > <      rx_pkts_nic: 81426
> > > > ---
> > > > >      rx_pkts_nic: 81436
> > > > 8c8
> > > > <      rx_bytes_nic: 10286521
> > > > ---
> > > > >      rx_bytes_nic: 10287801
> > > > 17c17
> > > > <      multicast: 72216
> > > > ---
> > > > >      multicast: 72226
> > > > 48c48
> > > > <      rx_no_dma_resources: 1109
> > > > ---
> > > > >      rx_no_dma_resources: 1119
> > > >
> > > > ~# cat /proc/interrupts | grep eth0 > /tmp/interrupts0.txt ; sleep 10
> > > > ; cat /proc/interrupts | grep eth0 > /tmp/interrupts1.txt
> > > > interrupts0: 430 3098 64 108199 108199 108199 108199 108199 108199
> > > > 108199 108201 63 64 1865 108199  61
> > > > interrupts1: 435 3103 69 117967 117967  117967 117967 117967  117967
> > > > 117967 117969 68 69 1870  117967 66
> > > >
> > > > So, it seems that packets are coming on the interface but they don't
> > > > reach to the XDP layer and deeper.
> > > > rx_no_dma_resources - this counter seems to give clues about a possible issue?
> > > >
> > > > >
> > > > > > on eth4 but there is incoming LACP traffic on eth0 and eth2.
> > > > > > At the same time, according to the dmesg the kernel sees all of the
> > > > > > interfaces as
> > > > > > "link status definitely up, 10000 Mbps full duplex".
> > > > > > The issue goes aways if I stop the application even without removing
> > > > > > the XDP programs
> > > > > > from the interfaces - the running xdpdump starts showing the incoming
> > > > > > LACP traffic immediately.
> > > > > > The issue also goes away if I do "ip link set down eth4 && ip link set up eth4".
> > > > >
> > > > > and the setup is what when doing the link flap? XDP progs are loaded to
> > > > > each of the 3 interfaces of bond?
> > > > Yes, the same XDP program is loaded on application startup on each one
> > > > of the interfaces which are part of bond0 (eth0, eth2, eth4):
> > > > # xdp-loader status
> > > > CURRENT XDP PROGRAM STATUS:
> > > >
> > > > Interface        Prio  Program name      Mode     ID   Tag
> > > >   Chain actions
> > > > --------------------------------------------------------------------------------------
> > > > lo                     <No XDP program loaded!>
> > > > eth0                   xdp_dispatcher    native   1320 90f686eb86991928
> > > >  =>              50     x3sp_splitter_func          1329
> > > > 3b185187f1855c4c  XDP_PASS
> > > > eth1                   <No XDP program loaded!>
> > > > eth2                   xdp_dispatcher    native   1334 90f686eb86991928
> > > >  =>              50     x3sp_splitter_func          1337
> > > > 3b185187f1855c4c  XDP_PASS
> > > > eth3                   <No XDP program loaded!>
> > > > eth4                   xdp_dispatcher    native   1342 90f686eb86991928
> > > >  =>              50     x3sp_splitter_func          1345
> > > > 3b185187f1855c4c  XDP_PASS
> > > > eth5                   <No XDP program loaded!>
> > > > eth6                   <No XDP program loaded!>
> > > > eth7                   <No XDP program loaded!>
> > > > bond0                  <No XDP program loaded!>
> > > > Each of these interfaces is setup to have 16 queues i.e. the application,
> > > > through the DPDK machinery, opens 3x16 XSK sockets each bound to the
> > > > corresponding queue of the corresponding interface.
> > > > ~# ethtool -l eth0 # It's same for the other 2 devices
> > > > Channel parameters for eth0:
> > > > Pre-set maximums:
> > > > RX:             n/a
> > > > TX:             n/a
> > > > Other:          1
> > > > Combined:       48
> > > > Current hardware settings:
> > > > RX:             n/a
> > > > TX:             n/a
> > > > Other:          1
> > > > Combined:       16
> > > >
> > > > >
> > > > > > However, I'm not sure what happens with the bound XDP sockets in this case
> > > > > > because I haven't tested further.
> > > > >
> > > > > can you also try to bind xsk sockets before attaching XDP progs?
> > > > I looked into the DPDK code again.
> > > > The DPDK framework provides callback hooks like eth_rx_queue_setup
> > > > and each "driver" implements it as needed. Each Rx/Tx queue of the device is
> > > > set up separately. The af_xdp driver currently does this for each Rx
> > > > queue separately:
> > > > 1. configures the umem for the queue
> > > > 2. loads the XDP program on the corresponding interface, if not already loaded
> > > >    (i.e. this happens only once per interface when its first queue is set up).
> > > > 3. does xsk_socket__create which as far as I looked also internally binds the
> > > > socket to the given queue
> > > > 4. places the socket in the XSKS map of the XDP program via bpf_map_update_elem
> > > >
> > > > So, it seems to me that the change needed will be a bit more involved.
> > > > I'm not sure if it'll be possible to hardcode, just for the test, the
> > > > program loading and
> > > > the placing of all XSK sockets in the map to happen when the setup of the last
> > > > "queue" for the given interface is done. I need to think a bit more about this.
> > > Changed the code of the DPDK af_xdp "driver" to create and bind all of
> > > the XSK sockets
> > > to the queues of the corresponding interface and after that, after the
> > > initialization of the
> > > last XSK socket, I added the logic for the attachment of the XDP
> > > program to the interface
> > > and the population of the XSK map with the created sockets.
> > > The issue was still there but it was kind of harder to reproduce - it
> > > happened once for 5
> > > starts of the application.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > It seems to me that something racy happens when the interfaces go down
> > > > > > and back up
> > > > > > (visible in the dmesg) when the XDP sockets are bound to their queues.
> > > > > > I mean, I'm not sure why the interfaces go down and up but setting
> > > > > > only the XDP programs
> > > > > > on the interfaces doesn't cause this behavior. So, I assume it's
> > > > > > caused by the binding of the XDP sockets.
> > > > >
> > > > > hmm i'm lost here, above you said you got no incoming traffic on eth4 even
> > > > > without xsk sockets being bound?
> > > > Probably I've phrased something in a wrong way.
> > > > The issue is not observed if I load the XDP program on all interfaces
> > > > (eth0, eth2, eth4)
> > > > with the xdp-loader:
> > > > xdp-loader load --mode native <iface> <path-to-the-xdp-program>
> > > > It's not observed probably because there are no interface down/up actions.
> > > > I also modified the DPDK "driver" to not remove the XDP program on exit and thus
> > > > when the application stops only the XSK sockets are closed but the
> > > > program remains
> > > > loaded at the interfaces. When I stop this version of the application
> > > > while running the
> > > > xdpdump at the same time I see that the traffic immediately appears in
> > > > the xdpdump.
> > > > Also, note that I basically trimmed the XDP program to simply contain
> > > > the XSK map
> > > > (BPF_MAP_TYPE_XSKMAP) and the function just does "return XDP_PASS;".
> > > > I wanted to exclude every possibility for the XDP program to do something wrong.
> > > > So, from the above it seems to me that the issue is triggered somehow by the XSK
> > > > sockets usage.
> > > >
> > > > >
> > > > > > It could be that the issue is not related to the XDP sockets but just
> > > > > > to the down/up actions of the interfaces.
> > > > > > On the other hand, I'm not sure why the issue is easily reproducible
> > > > > > when the zero copy mode is enabled
> > > > > > (4 out of 5 tests reproduced the issue).
> > > > > > However, when the zero copy is disabled this issue doesn't happen
> > > > > > (I tried 10 times in a row and it doesn't happen).
> > > > >
> > > > > any chances that you could rule out the bond of the picture of this issue?
> > > > I'll need to talk to the network support guys because they manage the network
> > > > devices and they'll need to change the LACP/Trunk setup of the above
> > > > "remote device".
> > > > I can't promise that they'll agree though.
> > We changed the setup and I did the tests with a single port, no
> > bonding involved.
> > The port was configured with 16 queues (and 16 XSK sockets bound to them).
> > I tested with about 100 Mbps of traffic to not break lots of users.
> > During the tests I observed the traffic on the real time graph on the
> > remote device port
> > connected to the server machine where the application was running in
> > L3 forward mode:
> > - with zero copy enabled the traffic to the server was about 100 Mbps
> > but the traffic
> > coming out of the server was about 50 Mbps (i.e. half of it).
> > - with no zero copy the traffic in both directions was the same - the
> > two graphs matched perfectly
> > Nothing else was changed during the both tests, only the ZC option.
> > Can I check some stats or something else for this testing scenario
> > which could be
> > used to reveal more info about the issue?
>
> FWIW I don't see this on my side. My guess would be that some of the
> queues stalled on ZC due to buggy enable/disable ring pair routines that I
> am (fingers crossed :)) fixing, or trying to fix in previous email. You
> could try something as simple as:
>
> $ watch -n 1 "ethtool -S eth_ixgbe | grep rx | grep bytes"
>
> and verify each of the queues that are supposed to receive traffic. Do the
> same thing with tx, similarly.
>
> >
> > > >
Thank you for the help.

I tried the given patch on kernel 6.7.5.
The bonding issue, that I described in the above e-mails, seems fixed.
I can no longer reproduce the issue with the malformed LACP messages.

However, I tested again with traffic and the issue remains:
- when traffic is redirected to the machine and simply forwarded at L3
by our application only about 1/2 - 2/3 of it exits the machine
- disabling only the Zero Copy (and nothing else in the application)
fixes the issue
- another thing that I noticed is in the device stats - the Rx bytes
looks OK and the counters of every queue increase over the time (with
and without ZC)
ethtool -S eth4 | grep rx | grep bytes
     rx_bytes: 20061532582
     rx_bytes_nic: 27823942900
     rx_queue_0_bytes: 690230537
     rx_queue_1_bytes: 1051217950
     rx_queue_2_bytes: 1494877257
     rx_queue_3_bytes: 1989628734
     rx_queue_4_bytes: 894557655
     rx_queue_5_bytes: 1557310636
     rx_queue_6_bytes: 1459428265
     rx_queue_7_bytes: 1514067682
     rx_queue_8_bytes: 432567753
     rx_queue_9_bytes: 1251708768
     rx_queue_10_bytes: 1091840145
     rx_queue_11_bytes: 904127964
     rx_queue_12_bytes: 1241335871
     rx_queue_13_bytes: 2039939517
     rx_queue_14_bytes: 777819814
     rx_queue_15_bytes: 1670874034

- without ZC the Tx bytes also look OK
ethtool -S eth4 | grep tx | grep bytes
     tx_bytes: 24411467399
     tx_bytes_nic: 29600497994
     tx_queue_0_bytes: 1525672312
     tx_queue_1_bytes: 1527162996
     tx_queue_2_bytes: 1529701681
     tx_queue_3_bytes: 1526220338
     tx_queue_4_bytes: 1524403501
     tx_queue_5_bytes: 1523242084
     tx_queue_6_bytes: 1523543868
     tx_queue_7_bytes: 1525376190
     tx_queue_8_bytes: 1526844278
     tx_queue_9_bytes: 1523938842
     tx_queue_10_bytes: 1522663364
     tx_queue_11_bytes: 1527292259
     tx_queue_12_bytes: 1525206246
     tx_queue_13_bytes: 1526670255
     tx_queue_14_bytes: 1523266153
     tx_queue_15_bytes: 1530263032

- however with ZC enabled the Tx bytes stats don't look OK (some
queues are like doing nothing) - again it's exactly the same
application
The sum bytes increase much more than the sum of the per queue bytes.
ethtool -S eth4 | grep tx | grep bytes ; sleep 1 ; ethtool -S eth4 |
grep tx | grep bytes
     tx_bytes: 256022649
     tx_bytes_nic: 34961074621
     tx_queue_0_bytes: 372
     tx_queue_1_bytes: 0
     tx_queue_2_bytes: 0
     tx_queue_3_bytes: 0
     tx_queue_4_bytes: 9920
     tx_queue_5_bytes: 0
     tx_queue_6_bytes: 0
     tx_queue_7_bytes: 0
     tx_queue_8_bytes: 0
     tx_queue_9_bytes: 1364
     tx_queue_10_bytes: 0
     tx_queue_11_bytes: 0
     tx_queue_12_bytes: 1116
     tx_queue_13_bytes: 0
     tx_queue_14_bytes: 0
     tx_queue_15_bytes: 0

     tx_bytes: 257830280
     tx_bytes_nic: 34962912861
     tx_queue_0_bytes: 372
     tx_queue_1_bytes: 0
     tx_queue_2_bytes: 0
     tx_queue_3_bytes: 0
     tx_queue_4_bytes: 10044
     tx_queue_5_bytes: 0
     tx_queue_6_bytes: 0
     tx_queue_7_bytes: 0
     tx_queue_8_bytes: 0
     tx_queue_9_bytes: 1364
     tx_queue_10_bytes: 0
     tx_queue_11_bytes: 0
     tx_queue_12_bytes: 1116
     tx_queue_13_bytes: 0
     tx_queue_14_bytes: 0
     tx_queue_15_bytes: 0