[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoCAH+LN0HtL3wR3+Xw_GggMJS2JZwrKryuWo3f21YMVkA@mail.gmail.com>
Date: Fri, 4 Jul 2025 23:29:08 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: e.kubanski@...tner.samsung.com, Stanislav Fomichev <stfomichev@...il.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "bjorn@...nel.org" <bjorn@...nel.org>,
"magnus.karlsson@...el.com" <magnus.karlsson@...el.com>,
"jonathan.lemon@...il.com" <jonathan.lemon@...il.com>
Subject: Re: Re: [PATCH bpf v2] xsk: Fix out of order segment free in __xsk_generic_xmit()
On Fri, Jul 4, 2025 at 8:35 PM Maciej Fijalkowski
<maciej.fijalkowski@...el.com> wrote:
>
> On Fri, Jul 04, 2025 at 07:37:22AM +0800, Jason Xing wrote:
> > On Mon, Jun 2, 2025 at 5:28 PM Eryk Kubanski
> > <e.kubanski@...tner.samsung.com> wrote:
> > >
> > > > I'm not sure I understand what's the issue here. If you're using the
> > > > same XSK from different CPUs, you should take care of the ordering
> > > > yourself on the userspace side?
> > >
> > > It's not a problem with user-space Completion Queue READER side.
> > > Im talking exclusively about kernel-space Completion Queue WRITE side.
> > >
> > > This problem can occur when multiple sockets are bound to the same
> > > umem, device, queue id. In this situation Completion Queue is shared.
> > > This means it can be accessed by multiple threads on kernel-side.
> > > Any use is indeed protected by spinlock, however any write sequence
> > > (Acquire write slot as writer, write to slot, submit write slot to reader)
> > > isn't atomic in any way and it's possible to submit not-yet-sent packet
> > > descriptors back to user-space as TX completed.
> > >
> > > Up untill now, all write-back operations had two phases, each phase
> > > locks the spinlock and unlocks it:
> > > 1) Acquire slot + Write descriptor (increase cached-writer by N + write values)
> > > 2) Submit slot to the reader (increase writer by N)
> > >
> > > Slot submission was solely based on the timing. Let's consider situation,
> > > where two different threads issue a syscall for two different AF_XDP sockets
> > > that are bound to the same umem, dev, queue-id.
> > >
> > > AF_XDP setup:
> > >
> > > kernel-space
> > >
> > > Write Read
> > > +--+ +--+
> > > | | | |
> > > | | | |
> > > | | | |
> > > Completion | | | | Fill
> > > Queue | | | | Queue
> > > | | | |
> > > | | | |
> > > | | | |
> > > | | | |
> > > +--+ +--+
> > > Read Write
> > > user-space
> > >
> > >
> > > +--------+ +--------+
> > > | AF_XDP | | AF_XDP |
> > > +--------+ +--------+
> > >
> > >
> > >
> > >
> > >
> > > Possible out-of-order scenario:
> > >
> > >
> > > writer cached_writer1 cached_writer2
> > > | | |
> > > | | |
> > > | | |
> > > | | |
> > > +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
> > > | | | | | | | | |
> > > Completion Queue | | | | | | | | |
> > > | | | | | | | | |
> > > +--------------|--------|--------|--------|--------|--------|--------|----------------------------------------------+
> > > | | |
> > > | | |
> > > |-----------------| |
> > > A) T1 syscall | |
> > > writes 2 | |
> > > descriptors |-----------------------------------|
> > > B) T2 syscall writes 4 descriptors
> > >
> >
> > Hi ALL,
> >
> > Since Maciej posted a related patch to fix this issue, it took me a
> > little while to trace back to this thread. So here we are.
> >
> > > Notes:
> > > 1) T1 and T2 AF_XDP sockets are two different sockets,
> > > __xsk_generic_xmit will obtain two different mutexes.
> > > 2) T1 and T2 can be executed simultaneously, there is no
> > > critical section whatsoever between them.
> > > 3) T1 and T2 will obtain Completion Queue Lock for acquire + write,
> > > only slot acquire + write are under lock.
> > > 4) T1 and T2 completion (skb destructor)
> > > doesn't need to be the same order as A) and B).
> > > 5) What if T1 fails after T2 acquires slots?
> >
> > What does it mean by 'fails'. Could you point out the accurate
> > function you said?
> >
> > > cached_writer will be decreased by 2, T2 will
> > > submit failed descriptors of T1 (they shall be
> > > retransmitted in next TX).
> > > Submission of writer will move writer by 4 slots
> > > 2 of these slots have failed T1 values. Last two
> > > slots of T2 will be missing, descriptor leak.
> >
> > I wonder why the leak problem happens? IIUC, in the
> > __xsk_generic_xmit() + copy mode, xsk only tries to send the
> > descriptor from its own tx ring to the driver, like virtio_net as an
> > example. As you said, there are two xsks running in parallel. Why
> > could T2 send the descriptors that T1 puts into the completion queue?
> > __dev_direct_xmit() only passes the @skb that is built based on the
> > addr from per xsk tx ring.
>
> I admit it is non-trivial case.
>
> Per my understanding before, based on Eryk's example, if T1 failed xmit
> and reduced the cached_prod, T2 in its skb destructor would release two T1
> umem addresses and two T2 addrs instead of 4 T2 addrs.
>
> Putting this aside though, we had *correct* behavior before xsk
> multi-buffer support, we should not let that change make it into kernel in
> the first place. Hence my motivation to restore it.$
>
> >
> > Here are some maps related to the process you talked about:
> > case 1)
> > // T1 writes 2 descs in cq
> > [--1--][--2--][-null-][-null-][-null-][-null-][-null-]
> > |
> > cached_prod
> >
> > // T1 fails because of NETDEV_TX_BUSY, and cq.cached_prod is decreased by 2.
> > [-null-][-null-][-null-][-null-][-null-][-null-][-null-]
> > |
> > cached_prod
> >
> > // T2 starts to write at the first unused descs
> > [--1--][--2--][--3--][--4--][-null-][-null-][-null-]
> > |
> > cached_prod
> > So why can T2 send out the descs belonging to T1? In
> > __xsk_generic_xmit(), xsk_cq_reserve_addr_locked() initialises the
> > addr of acquired desc so it overwrites the invalid one previously
> > owned by T1. The addr is from per xsk tx ring... I'm lost. Could you
> > please share the detailed/key functions to shed more lights on this?
> > Thanks in advance.
>
> Take another look at Eryk's example. The case he was providing was when t1
> produced smaller amount of addrs followed by t2 with bigger count. Then
> due to t1 failure, t2 was providing addrs produced by t1.
>
> Your example talks about immediate failure of t1 whereas Eryk talked
> about:
> 1. t1 produces addrs to cq
> 2. t2 produces addrs to cq
> 3. t2 starts xmit
> 4. t1 fails for some reason down in __xsk_generic_xmit()
> 4a. t1 reduces cached_prod
> 5. t2 completes, updates global state of cq's producer and exposing addrs
> produced by t1 and misses part of addrs produced by t2
Wow, thanks for sharing your understanding on this. It's very clear
and easy to understand to me.
Thanks,
Jason
Powered by blists - more mailing lists