netdev - Possible race with xsk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y5pO+XL54ZlzZ7Qe@sbohrer-cf-dell>
Date:   Wed, 14 Dec 2022 16:32:25 -0600
From:   Shawn Bohrer <sbohrer@...udflare.com>
To:     netdev@...r.kernel.org, bpf@...r.kernel.org
Cc:     bjorn@...nel.org, magnus.karlsson@...el.com,
        kernel-team@...udflare.com
Subject: Possible race with xsk_flush

Hello,

I've been trying to track down a problem we've been seeing with
occasional corrupt packets using AF_XDP.  As you probably expect this
occurs because we end up with the same descriptor in ring multiple
times.  However, after lots of debugging and analyzing I think this is
actually caused by a kernel bug, though I still don't fully understand
what is happening.  This is currently being seen on kernel 5.15.77 and
I haven't done any testing to see if we see the same issue on newer
versions.

Inside my application I've written code to track and log as
descriptors are placed into the fill and tx queues, and when they are
pulled out of the rx and completion queues.  The transitions are
logged to /sys/kernel/debug/tracing/trace_marker.  Additionally I keep
my own VecDeque which mirrors the order descriptors are placed in the
fill queue and I verify that they come out of the rx queue in the same
order.  I do realize there are some cases where they might not come
out in the same order but I do not appear to be hitting that.

I then add several kprobes to track the kernel side with ftrace. Most
importantly are these probes:

This adds a probe on the call to xskq_prod_reserve_desc() this
actually creates two probes so you see two prints everytime it is hit:
perf probe -a '__xsk_rcv_zc:7 addr len xs xs->pool->fq'

This adds a probe on xsk_flush():
perf probe -a 'xsk_flush:0 xs'

My AF_XDP application is bound to two multi-queue veth interfaces
(called 'ingress' and 'egress' in my prints) in a networking
namespace.  I'm then generating traffic with iperf and hping3 through
these interfaces.  When I see an out-of-order descriptor in my
application I dump the state of my internal VecDeque to the
trace_marker.  Here is an example of what I'm seeing which looks like
a kernel bug:

 // On CPU 0 we've removed descriptor 0xff0900 from the fill queue,
 // copied a packet, but have not put it in the rx queue yet
 flowtrackd-9chS-142014  [000] d.Z1. 609766.698512: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x9b/0x250) addr=0xff0900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
 flowtrackd-9chS-142014  [000] d.Z1. 609766.698513: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0xff0900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
 // On CPU 2 we've removed descriptor 0x1000900 from the fill queue,
 // copied a packet, but have not put it in the rx queue yet
          iperf2-1217    [002] d.Z1. 609766.698523: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0x9b/0x250) addr=0x1000900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
          iperf2-1217    [002] d.Z1. 609766.698524: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0x1000900 len=0x42 xs=0xffff90fd32693c00 fq=0xffff90fd03d66380
 // On CPU 0 xsk_flush is called on the socket
 flowtrackd-9chS-142014  [000] d.Z1. 609766.698528: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff90fd32693c00
 // My application receives 0xff0900 on the ingress interface queue 1
 flowtrackd-9chS-142014  [000] ..... 609766.698540: tracing_mark_write: ingress q:1 0xff0900 FILL -> RX
 // On CPU 2 xsk_flush is called on the socket
          iperf2-1217    [002] d.Z1. 609766.698545: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff90fd32693c00
 // My application receives 0xf61900 this is unexpected.  We expected
 // to receive 0x1000900 which is what you saw in the previous
 // __xsk_rcv_zc print.  0xf61900 is in the fill queue but it is far
 // away from our current position in the ring and I've trimmed the
 // print slightly to show that.
 flowtrackd-9chS-142014  [000] ..... 609766.698617: tracing_mark_write: ingress q:1 0xf61900 FILL -> RX: expected 0x1000900 remaining: [fe4100, f9c100, f8a100, ..., f61900

>From reading the code I believe that the call to
xskq_prod_reserve_desc() inside __xsk_rcv_zc is the only place
descriptors are placed into the RX queue.  To me this means I should
see a print from my probe for the mystery 0xf61900 but we do not see
this print.  Instead we see the expected 0x1000900.  One theory I have
is that there could be a race where CPU 2 increments the cached_prod
pointer but has not yet updated the addr and len, CPU 0 calls
xsk_flush(), and now my application reads the old descriptor entry
from that location in the RX ring.  This would explain everything, but
the problem with this theory is that __xsk_rcv_zc() and xsk_flush()
are getting called from within napi_poll() and this appears to hold
the netpoll_poll_lock() for the whole time which I think should
prevent the race I just described.

A couple more notes:
* The ftrace print order and timestamps seem to indicate that the CPU
  2 napi_poll is running before the CPU 0 xsk_flush().  I don't know
  if these timestamps can be trusted but it does imply that maybe this
  can race as I described.  I've triggered this twice with xsk_flush
  probes and both show the order above.
* In the 3 times I've triggered this it has occurred right when the
  softirq processing switches CPUs
* I've also triggered this before I added the xsk_flush() probe and
  in that case saw the kernel side additionally fill in the next
  expected descriptor, which in the example above would be 0xfe4100.
  This seems to indicate that my tracking is all still sane.
* This is fairly reproducible, but I've got 10 test boxes running and
  I only get maybe bug a day.

Any thoughts on if the bug I described is actually possible,
alternative theories, or other things to test/try would be welcome.

Thanks,
Shawn Bohrer