netdev - Re: Re: [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoCAH+LN0HtL3wR3+Xw_GggMJS2JZwrKryuWo3f21YMVkA@mail.gmail.com>
Date: Fri, 4 Jul 2025 23:29:08 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: e.kubanski@...tner.samsung.com, Stanislav Fomichev <stfomichev@...il.com>, 
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "bjorn@...nel.org" <bjorn@...nel.org>, 
	"magnus.karlsson@...el.com" <magnus.karlsson@...el.com>, 
	"jonathan.lemon@...il.com" <jonathan.lemon@...il.com>
Subject: Re: Re: [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic_xmit()

On Fri, Jul 4, 2025 at 8:35 PM Maciej Fijalkowski
<maciej.fijalkowski@...el.com> wrote:
>
> On Fri, Jul 04, 2025 at 07:37:22AM +0800, Jason Xing wrote:
> > On Mon, Jun 2, 2025 at 5:28 PM Eryk Kubanski
> > <e.kubanski@...tner.samsung.com> wrote:
> > >
> > > > I'm not sure I understand what's the issue here. If you're using the
> > > > same XSK from different CPUs, you should take care of the ordering
> > > > yourself on the userspace side?
> > >
> > > It's not a problem with user-space Completion Queue READER side.
> > > Im talking exclusively about kernel-space Completion Queue WRITE side.
> > >
> > > This problem can occur when multiple sockets are bound to the same
> > > umem, device, queue id. In this situation Completion Queue is shared.
> > > This means it can be accessed by multiple threads on kernel-side.
> > > Any use is indeed protected by spinlock, however any write sequence
> > > (Acquire write slot as writer, write to slot, submit write slot to reader)
> > > isn't atomic in any way and it's possible to submit not-yet-sent packet
> > > descriptors back to user-space as TX completed.
> > >
> > > Up untill now, all write-back operations had two phases, each phase
> > > locks the spinlock and unlocks it:
> > > 1) Acquire slot + Write descriptor (increase cached-writer by N + write values)
> > > 2) Submit slot to the reader (increase writer by N)
> > >
> > > Slot submission was solely based on the timing. Let's consider situation,
> > > where two different threads issue a syscall for two different AF_XDP sockets
> > > that are bound to the same umem, dev, queue-id.
> > >
> > > AF_XDP setup:
> > >
> > >                              kernel-space
> > >
> > >            Write   Read
> > >             +--+   +--+
> > >             |  |   |  |
> > >             |  |   |  |
> > >             |  |   |  |
> > >  Completion |  |   |  | Fill
> > >  Queue      |  |   |  | Queue
> > >             |  |   |  |
> > >             |  |   |  |
> > >             |  |   |  |
> > >             |  |   |  |
> > >             +--+   +--+
> > >             Read   Write
> > >                              user-space
> > >
> > >
> > >    +--------+         +--------+
> > >    | AF_XDP |         | AF_XDP |
> > >    +--------+         +--------+
> > >
> > >
> > >
> > >
> > >
> > > Possible out-of-order scenario:
> > >
> > >
> > >                               writer         cached_writer1                      cached_writer2
> > >                                  |                 |                                   |
> > >                                  |                 |                                   |
> > >                                  |                 |                                   |
> > >                                  |                 |                                   |
> > >                   +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
> > >                   |              |        |        |        |        |        |        |                                              |
> > >  Completion Queue |              |        |        |        |        |        |        |                                              |
> > >                   |              |        |        |        |        |        |        |                                              |
> > >                   +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
> > >                                  |                 |                                   |
> > >                                  |                 |                                   |
> > >                                  |-----------------|                                   |
> > >                                   A) T1 syscall    |                                   |
> > >                                   writes 2         |                                   |
> > >                                   descriptors      |-----------------------------------|
> > >                                                     B) T2 syscall writes 4 descriptors
> > >
> >
> > Hi ALL,
> >
> > Since Maciej posted a related patch to fix this issue, it took me a
> > little while to trace back to this thread. So here we are.
> >
> > >                  Notes:
> > >                  1) T1 and T2 AF_XDP sockets are two different sockets,
> > >                     __xsk_generic_xmit will obtain two different mutexes.
> > >                  2) T1 and T2 can be executed simultaneously, there is no
> > >                     critical section whatsoever between them.
> > >                  3) T1 and T2 will obtain Completion Queue Lock for acquire + write,
> > >                     only slot acquire + write are under lock.
> > >                  4) T1 and T2 completion (skb destructor)
> > >                     doesn't need to be the same order as A) and B).
> > >                  5) What if T1 fails after T2 acquires slots?
> >
> > What does it mean by 'fails'. Could you point out the accurate
> > function you said?
> >
> > >                     cached_writer will be decreased by 2, T2 will
> > >                     submit failed descriptors of T1 (they shall be
> > >                     retransmitted in next TX).
> > >                     Submission of writer will move writer by 4 slots
> > >                     2 of these slots have failed T1 values. Last two
> > >                     slots of T2 will be missing, descriptor leak.
> >
> > I wonder why the leak problem happens? IIUC, in the
> > __xsk_generic_xmit() + copy mode, xsk only tries to send the
> > descriptor from its own tx ring to the driver, like virtio_net as an
> > example. As you said, there are two xsks running in parallel. Why
> > could T2 send the descriptors that T1 puts into the completion queue?
> > __dev_direct_xmit() only passes the @skb that is built based on the
> > addr from per xsk tx ring.
>
>  I admit it is non-trivial case.
>
> Per my understanding before, based on Eryk's example, if T1 failed xmit
> and reduced the cached_prod, T2 in its skb destructor would release two T1
> umem addresses and two T2 addrs instead of 4 T2 addrs.
>
> Putting this aside though, we had *correct* behavior before xsk
> multi-buffer support, we should not let that change make it into kernel in
> the first place. Hence my motivation to restore it.$
>
> >
> > Here are some maps related to the process you talked about:
> > case 1)
> > // T1 writes 2 descs in cq
> > [--1--][--2--][-null-][-null-][-null-][-null-][-null-]
> >                       |
> >                       cached_prod
> >
> > // T1 fails because of NETDEV_TX_BUSY, and cq.cached_prod is decreased by 2.
> > [-null-][-null-][-null-][-null-][-null-][-null-][-null-]
> >      |
> >      cached_prod
> >
> > // T2 starts to write at the first unused descs
> > [--1--][--2--][--3--][--4--][-null-][-null-][-null-]
> >                                         |
> >                                         cached_prod
> > So why can T2 send out the descs belonging to T1? In
> > __xsk_generic_xmit(), xsk_cq_reserve_addr_locked() initialises the
> > addr of acquired desc so it overwrites the invalid one previously
> > owned by T1. The addr is from per xsk tx ring... I'm lost. Could you
> > please share the detailed/key functions to shed more lights on this?
> > Thanks in advance.
>
> Take another look at Eryk's example. The case he was providing was when t1
> produced smaller amount of addrs followed by t2 with bigger count. Then
> due to t1 failure, t2 was providing addrs produced by t1.
>
> Your example talks about immediate failure of t1 whereas Eryk talked
> about:
> 1. t1 produces addrs to cq
> 2. t2 produces addrs to cq
> 3. t2 starts xmit
> 4. t1 fails for some reason down in __xsk_generic_xmit()
> 4a. t1 reduces cached_prod
> 5. t2 completes, updates global state of cq's producer and exposing addrs
>    produced by t1 and misses part of addrs produced by t2

Wow, thanks for sharing your understanding on this. It's very clear
and easy to understand to me.

Thanks,
Jason