linux-kernel - Re: Re: [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoAk3X2qM7gkeBw60hQ6VKd0Pv0jMtKaEB9uFw0DE=OY2A@mail.gmail.com>
Date: Fri, 4 Jul 2025 07:37:22 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: e.kubanski@...tner.samsung.com
Cc: Stanislav Fomichev <stfomichev@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "bjorn@...nel.org" <bjorn@...nel.org>, 
	"magnus.karlsson@...el.com" <magnus.karlsson@...el.com>, 
	"maciej.fijalkowski@...el.com" <maciej.fijalkowski@...el.com>, 
	"jonathan.lemon@...il.com" <jonathan.lemon@...il.com>
Subject: Re: Re: [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic_xmit()

On Mon, Jun 2, 2025 at 5:28 PM Eryk Kubanski
<e.kubanski@...tner.samsung.com> wrote:
>
> > I'm not sure I understand what's the issue here. If you're using the
> > same XSK from different CPUs, you should take care of the ordering
> > yourself on the userspace side?
>
> It's not a problem with user-space Completion Queue READER side.
> Im talking exclusively about kernel-space Completion Queue WRITE side.
>
> This problem can occur when multiple sockets are bound to the same
> umem, device, queue id. In this situation Completion Queue is shared.
> This means it can be accessed by multiple threads on kernel-side.
> Any use is indeed protected by spinlock, however any write sequence
> (Acquire write slot as writer, write to slot, submit write slot to reader)
> isn't atomic in any way and it's possible to submit not-yet-sent packet
> descriptors back to user-space as TX completed.
>
> Up untill now, all write-back operations had two phases, each phase
> locks the spinlock and unlocks it:
> 1) Acquire slot + Write descriptor (increase cached-writer by N + write values)
> 2) Submit slot to the reader (increase writer by N)
>
> Slot submission was solely based on the timing. Let's consider situation,
> where two different threads issue a syscall for two different AF_XDP sockets
> that are bound to the same umem, dev, queue-id.
>
> AF_XDP setup:
>
>                              kernel-space
>
>            Write   Read
>             +--+   +--+
>             |  |   |  |
>             |  |   |  |
>             |  |   |  |
>  Completion |  |   |  | Fill
>  Queue      |  |   |  | Queue
>             |  |   |  |
>             |  |   |  |
>             |  |   |  |
>             |  |   |  |
>             +--+   +--+
>             Read   Write
>                              user-space
>
>
>    +--------+         +--------+
>    | AF_XDP |         | AF_XDP |
>    +--------+         +--------+
>
>
>
>
>
> Possible out-of-order scenario:
>
>
>                               writer         cached_writer1                      cached_writer2
>                                  |                 |                                   |
>                                  |                 |                                   |
>                                  |                 |                                   |
>                                  |                 |                                   |
>                   +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
>                   |              |        |        |        |        |        |        |                                              |
>  Completion Queue |              |        |        |        |        |        |        |                                              |
>                   |              |        |        |        |        |        |        |                                              |
>                   +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
>                                  |                 |                                   |
>                                  |                 |                                   |
>                                  |-----------------|                                   |
>                                   A) T1 syscall    |                                   |
>                                   writes 2         |                                   |
>                                   descriptors      |-----------------------------------|
>                                                     B) T2 syscall writes 4 descriptors
>

Hi ALL,

Since Maciej posted a related patch to fix this issue, it took me a
little while to trace back to this thread. So here we are.

>                  Notes:
>                  1) T1 and T2 AF_XDP sockets are two different sockets,
>                     __xsk_generic_xmit will obtain two different mutexes.
>                  2) T1 and T2 can be executed simultaneously, there is no
>                     critical section whatsoever between them.
>                  3) T1 and T2 will obtain Completion Queue Lock for acquire + write,
>                     only slot acquire + write are under lock.
>                  4) T1 and T2 completion (skb destructor)
>                     doesn't need to be the same order as A) and B).
>                  5) What if T1 fails after T2 acquires slots?

What does it mean by 'fails'. Could you point out the accurate
function you said?

>                     cached_writer will be decreased by 2, T2 will
>                     submit failed descriptors of T1 (they shall be
>                     retransmitted in next TX).
>                     Submission of writer will move writer by 4 slots
>                     2 of these slots have failed T1 values. Last two
>                     slots of T2 will be missing, descriptor leak.

I wonder why the leak problem happens? IIUC, in the
__xsk_generic_xmit() + copy mode, xsk only tries to send the
descriptor from its own tx ring to the driver, like virtio_net as an
example. As you said, there are two xsks running in parallel. Why
could T2 send the descriptors that T1 puts into the completion queue?
__dev_direct_xmit() only passes the @skb that is built based on the
addr from per xsk tx ring.

Here are some maps related to the process you talked about:
case 1)
// T1 writes 2 descs in cq
[--1--][--2--][-null-][-null-][-null-][-null-][-null-]
                      |
                      cached_prod

// T1 fails because of NETDEV_TX_BUSY, and cq.cached_prod is decreased by 2.
[-null-][-null-][-null-][-null-][-null-][-null-][-null-]
     |
     cached_prod

// T2 starts to write at the first unused descs
[--1--][--2--][--3--][--4--][-null-][-null-][-null-]
                                        |
                                        cached_prod
So why can T2 send out the descs belonging to T1? In
__xsk_generic_xmit(), xsk_cq_reserve_addr_locked() initialises the
addr of acquired desc so it overwrites the invalid one previously
owned by T1. The addr is from per xsk tx ring... I'm lost. Could you
please share the detailed/key functions to shed more lights on this?
Thanks in advance.

I know you're not running on the (virtual) nic actually, but I still
want to know the possibility of the issue with normal end-to-end
transmission. In the virtio_net driver, __dev_direct_xmit() returns
BUSY only if the BQL takes effect, so your case might not happen here?
The reason why I asked is that I have a similar use case with
virtio_net and I am trying to understand whether it can happen in the
future.

Thanks,
Jason


>                  6) What if T2 completes before T1? writer will be
>                     moved by 4 slots. 2 of them are slots filled by T1.
>                     T2 will complete 2 own slots and 2 slots of T1, It's bad.
>                     T1 will complete last 2 slots of T2, also bad.
>
> This out-of-order completion can effectively cause User-space <-> Kernel-space
> data race. This patch solves that, by only acquiring cached_writer first and
> do the completion (sumission (write + increase writer)) after. This is the only
> way to make that bulletproof for multithreaded access, failures and
> out-of-order skb completions.
>
> > This is definitely a no-go (sk_buff and skb_shared_info space is
> > precious).
>
> Okay so where should I store It? Can you give me some advice?
>
> I left that there, because there is every information related to
> skb desctruction. Additionally this is the only place in skb related
> code that defines anything related to xsk: metadata, number of descriptors.
> SKBUFF doesn't. I need to hold this information somewhere, and skbuff or
> skb_shared_info are the only place I can store it. This need to be invariant
> across all skb fragments, and be released after skb completes.
>