linux-kernel - Re: Re: [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGfKRK2tcnf9WzNp@boxer>
Date: Fri, 4 Jul 2025 14:34:12 +0200
From: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
To: Jason Xing <kerneljasonxing@...il.com>
CC: <e.kubanski@...tner.samsung.com>, Stanislav Fomichev
	<stfomichev@...il.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"bjorn@...nel.org" <bjorn@...nel.org>, "magnus.karlsson@...el.com"
	<magnus.karlsson@...el.com>, "jonathan.lemon@...il.com"
	<jonathan.lemon@...il.com>
Subject: Re: Re: [PATCH bpf v2] xsk: Fix out of order segment free in
 __xsk_generic_xmit()

On Fri, Jul 04, 2025 at 07:37:22AM +0800, Jason Xing wrote:
> On Mon, Jun 2, 2025 at 5:28 PM Eryk Kubanski
> <e.kubanski@...tner.samsung.com> wrote:
> >
> > > I'm not sure I understand what's the issue here. If you're using the
> > > same XSK from different CPUs, you should take care of the ordering
> > > yourself on the userspace side?
> >
> > It's not a problem with user-space Completion Queue READER side.
> > Im talking exclusively about kernel-space Completion Queue WRITE side.
> >
> > This problem can occur when multiple sockets are bound to the same
> > umem, device, queue id. In this situation Completion Queue is shared.
> > This means it can be accessed by multiple threads on kernel-side.
> > Any use is indeed protected by spinlock, however any write sequence
> > (Acquire write slot as writer, write to slot, submit write slot to reader)
> > isn't atomic in any way and it's possible to submit not-yet-sent packet
> > descriptors back to user-space as TX completed.
> >
> > Up untill now, all write-back operations had two phases, each phase
> > locks the spinlock and unlocks it:
> > 1) Acquire slot + Write descriptor (increase cached-writer by N + write values)
> > 2) Submit slot to the reader (increase writer by N)
> >
> > Slot submission was solely based on the timing. Let's consider situation,
> > where two different threads issue a syscall for two different AF_XDP sockets
> > that are bound to the same umem, dev, queue-id.
> >
> > AF_XDP setup:
> >
> >                              kernel-space
> >
> >            Write   Read
> >             +--+   +--+
> >             |  |   |  |
> >             |  |   |  |
> >             |  |   |  |
> >  Completion |  |   |  | Fill
> >  Queue      |  |   |  | Queue
> >             |  |   |  |
> >             |  |   |  |
> >             |  |   |  |
> >             |  |   |  |
> >             +--+   +--+
> >             Read   Write
> >                              user-space
> >
> >
> >    +--------+         +--------+
> >    | AF_XDP |         | AF_XDP |
> >    +--------+         +--------+
> >
> >
> >
> >
> >
> > Possible out-of-order scenario:
> >
> >
> >                               writer         cached_writer1                      cached_writer2
> >                                  |                 |                                   |
> >                                  |                 |                                   |
> >                                  |                 |                                   |
> >                                  |                 |                                   |
> >                   +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
> >                   |              |        |        |        |        |        |        |                                              |
> >  Completion Queue |              |        |        |        |        |        |        |                                              |
> >                   |              |        |        |        |        |        |        |                                              |
> >                   +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
> >                                  |                 |                                   |
> >                                  |                 |                                   |
> >                                  |-----------------|                                   |
> >                                   A) T1 syscall    |                                   |
> >                                   writes 2         |                                   |
> >                                   descriptors      |-----------------------------------|
> >                                                     B) T2 syscall writes 4 descriptors
> >
> 
> Hi ALL,
> 
> Since Maciej posted a related patch to fix this issue, it took me a
> little while to trace back to this thread. So here we are.
> 
> >                  Notes:
> >                  1) T1 and T2 AF_XDP sockets are two different sockets,
> >                     __xsk_generic_xmit will obtain two different mutexes.
> >                  2) T1 and T2 can be executed simultaneously, there is no
> >                     critical section whatsoever between them.
> >                  3) T1 and T2 will obtain Completion Queue Lock for acquire + write,
> >                     only slot acquire + write are under lock.
> >                  4) T1 and T2 completion (skb destructor)
> >                     doesn't need to be the same order as A) and B).
> >                  5) What if T1 fails after T2 acquires slots?
> 
> What does it mean by 'fails'. Could you point out the accurate
> function you said?
> 
> >                     cached_writer will be decreased by 2, T2 will
> >                     submit failed descriptors of T1 (they shall be
> >                     retransmitted in next TX).
> >                     Submission of writer will move writer by 4 slots
> >                     2 of these slots have failed T1 values. Last two
> >                     slots of T2 will be missing, descriptor leak.
> 
> I wonder why the leak problem happens? IIUC, in the
> __xsk_generic_xmit() + copy mode, xsk only tries to send the
> descriptor from its own tx ring to the driver, like virtio_net as an
> example. As you said, there are two xsks running in parallel. Why
> could T2 send the descriptors that T1 puts into the completion queue?
> __dev_direct_xmit() only passes the @skb that is built based on the
> addr from per xsk tx ring.

 I admit it is non-trivial case.

Per my understanding before, based on Eryk's example, if T1 failed xmit
and reduced the cached_prod, T2 in its skb destructor would release two T1
umem addresses and two T2 addrs instead of 4 T2 addrs.

Putting this aside though, we had *correct* behavior before xsk
multi-buffer support, we should not let that change make it into kernel in
the first place. Hence my motivation to restore it.$

> 
> Here are some maps related to the process you talked about:
> case 1)
> // T1 writes 2 descs in cq
> [--1--][--2--][-null-][-null-][-null-][-null-][-null-]
>                       |
>                       cached_prod
> 
> // T1 fails because of NETDEV_TX_BUSY, and cq.cached_prod is decreased by 2.
> [-null-][-null-][-null-][-null-][-null-][-null-][-null-]
>      |
>      cached_prod
> 
> // T2 starts to write at the first unused descs
> [--1--][--2--][--3--][--4--][-null-][-null-][-null-]
>                                         |
>                                         cached_prod
> So why can T2 send out the descs belonging to T1? In
> __xsk_generic_xmit(), xsk_cq_reserve_addr_locked() initialises the
> addr of acquired desc so it overwrites the invalid one previously
> owned by T1. The addr is from per xsk tx ring... I'm lost. Could you
> please share the detailed/key functions to shed more lights on this?
> Thanks in advance.

Take another look at Eryk's example. The case he was providing was when t1
produced smaller amount of addrs followed by t2 with bigger count. Then
due to t1 failure, t2 was providing addrs produced by t1.

Your example talks about immediate failure of t1 whereas Eryk talked
about:
1. t1 produces addrs to cq
2. t2 produces addrs to cq
3. t2 starts xmit
4. t1 fails for some reason down in __xsk_generic_xmit()
4a. t1 reduces cached_prod
5. t2 completes, updates global state of cq's producer and exposing addrs
   produced by t1 and misses part of addrs produced by t2

> 
> I know you're not running on the (virtual) nic actually, but I still
> want to know the possibility of the issue with normal end-to-end
> transmission. In the virtio_net driver, __dev_direct_xmit() returns
> BUSY only if the BQL takes effect, so your case might not happen here?
> The reason why I asked is that I have a similar use case with
> virtio_net and I am trying to understand whether it can happen in the
> future.
> 
> Thanks,
> Jason
> 
> 
> >                  6) What if T2 completes before T1? writer will be
> >                     moved by 4 slots. 2 of them are slots filled by T1.
> >                     T2 will complete 2 own slots and 2 slots of T1, It's bad.
> >                     T1 will complete last 2 slots of T2, also bad.
> >
> > This out-of-order completion can effectively cause User-space <-> Kernel-space
> > data race. This patch solves that, by only acquiring cached_writer first and
> > do the completion (sumission (write + increase writer)) after. This is the only
> > way to make that bulletproof for multithreaded access, failures and
> > out-of-order skb completions.
> >
> > > This is definitely a no-go (sk_buff and skb_shared_info space is
> > > precious).
> >
> > Okay so where should I store It? Can you give me some advice?
> >
> > I left that there, because there is every information related to
> > skb desctruction. Additionally this is the only place in skb related
> > code that defines anything related to xsk: metadata, number of descriptors.
> > SKBUFF doesn't. I need to hold this information somewhere, and skbuff or
> > skb_shared_info are the only place I can store it. This need to be invariant
> > across all skb fragments, and be released after skb completes.
> >