netdev - Re: [PATCH RFC net-next 2/2] xsk: introduce a cached cq to temporarily store descriptor addrs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoAv+dTK-Z=HNGUJNohxRu_oWCPQ4L1BRQT9nvB4WZMd7Q@mail.gmail.com>
Date: Sat, 15 Nov 2025 07:46:40 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: Magnus Karlsson <magnus.karlsson@...il.com>, davem@...emloft.net, edumazet@...gle.com, 
	kuba@...nel.org, pabeni@...hat.com, bjorn@...nel.org, 
	magnus.karlsson@...el.com, jonathan.lemon@...il.com, sdf@...ichev.me, 
	ast@...nel.org, daniel@...earbox.net, hawk@...nel.org, 
	john.fastabend@...il.com, joe@...a.to, willemdebruijn.kernel@...il.com, 
	fmancera@...e.de, csmate@....hu, bpf@...r.kernel.org, netdev@...r.kernel.org, 
	Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH RFC net-next 2/2] xsk: introduce a cached cq to
 temporarily store descriptor addrs

On Fri, Nov 14, 2025 at 11:53 PM Maciej Fijalkowski
<maciej.fijalkowski@...el.com> wrote:
>
> On Tue, Nov 11, 2025 at 10:02:58PM +0800, Jason Xing wrote:
> > Hi Magnus,
> >
> > On Tue, Nov 11, 2025 at 9:44 PM Magnus Karlsson
> > <magnus.karlsson@...il.com> wrote:
> > >
> > > On Tue, 11 Nov 2025 at 14:06, Jason Xing <kerneljasonxing@...il.com> wrote:
> > > >
> > > > Hi Maciej,
> > > >
> > > > On Mon, Nov 3, 2025 at 11:00 PM Maciej Fijalkowski
> > > > <maciej.fijalkowski@...el.com> wrote:
> > > > >
> > > > > On Sat, Nov 01, 2025 at 07:59:36AM +0800, Jason Xing wrote:
> > > > > > On Fri, Oct 31, 2025 at 10:02 PM Maciej Fijalkowski
> > > > > > <maciej.fijalkowski@...el.com> wrote:
> > > > > > >
> > > > > > > On Fri, Oct 31, 2025 at 05:32:30PM +0800, Jason Xing wrote:
> > > > > > > > From: Jason Xing <kernelxing@...cent.com>
> > > > > > > >
> > > > > > > > Before the commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> > > > > > > > production"), there is one issue[1] which causes the wrong publish
> > > > > > > > of descriptors in race condidtion. The above commit fixes the issue
> > > > > > > > but adds more memory operations in the xmit hot path and interrupt
> > > > > > > > context, which can cause side effect in performance.
> > > > > > > >
> > > > > > > > This patch tries to propose a new solution to fix the problem
> > > > > > > > without manipulating the allocation and deallocation of memory. One
> > > > > > > > of the key points is that I borrowed the idea from the above commit
> > > > > > > > that postpones updating the ring->descs in xsk_destruct_skb()
> > > > > > > > instead of in __xsk_generic_xmit().
> > > > > > > >
> > > > > > > > The core logics are as show below:
> > > > > > > > 1. allocate a new local queue. Only its cached_prod member is used.
> > > > > > > > 2. write the descriptors into the local queue in the xmit path. And
> > > > > > > >    record the cached_prod as @start_addr that reflects the
> > > > > > > >    start position of this queue so that later the skb can easily
> > > > > > > >    find where its addrs are written in the destruction phase.
> > > > > > > > 3. initialize the upper 24 bits of destructor_arg to store @start_addr
> > > > > > > >    in xsk_skb_init_misc().
> > > > > > > > 4. Initialize the lower 8 bits of destructor_arg to store how many
> > > > > > > >    descriptors the skb owns in xsk_update_num_desc().
> > > > > > > > 5. write the desc addr(s) from the @start_addr from the cached cq
> > > > > > > >    one by one into the real cq in xsk_destruct_skb(). In turn sync
> > > > > > > >    the global state of the cq.
> > > > > > > >
> > > > > > > > The format of destructor_arg is designed as:
> > > > > > > >  ------------------------ --------
> > > > > > > > |       start_addr       |  num   |
> > > > > > > >  ------------------------ --------
> > > > > > > > Using upper 24 bits is enough to keep the temporary descriptors. And
> > > > > > > > it's also enough to use lower 8 bits to show the number of descriptors
> > > > > > > > that one skb owns.
> > > > > > > >
> > > > > > > > [1]: https://lore.kernel.org/all/20250530095957.43248-1-e.kubanski@partner.samsung.com/
> > > > > > > >
> > > > > > > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > > > > > > ---
> > > > > > > > I posted the series as an RFC because I'd like to hear more opinions on
> > > > > > > > the current rought approach so that the fix[2] can be avoided and
> > > > > > > > mitigate the impact of performance. This patch might have bugs because
> > > > > > > > I decided to spend more time on it after we come to an agreement. Please
> > > > > > > > review the overall concepts. Thanks!
> > > > > > > >
> > > > > > > > Maciej, could you share with me the way you tested jumbo frame? I used
> > > > > > > > ./xdpsock -i enp2s0f1 -t -q 1 -S -s 9728 but the xdpsock utilizes the
> > > > > > > > nic more than 90%, which means I cannot see the performance impact.
> > > > > >
> > > > > > Could you provide the command you used? Thanks :)
> > > > > >
> > > > > > > >
> > > > > > > > [2]:https://lore.kernel.org/all/20251030140355.4059-1-fmancera@suse.de/
> > > > > > > > ---
> > > > > > > >  include/net/xdp_sock.h      |   1 +
> > > > > > > >  include/net/xsk_buff_pool.h |   1 +
> > > > > > > >  net/xdp/xsk.c               | 104 ++++++++++++++++++++++++++++--------
> > > > > > > >  net/xdp/xsk_buff_pool.c     |   1 +
> > > > > > > >  4 files changed, 84 insertions(+), 23 deletions(-)
> > > > > > >
> > > > > > > (...)
> > > > > > >
> > > > > > > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > > > > > > index aa9788f20d0d..6e170107dec7 100644
> > > > > > > > --- a/net/xdp/xsk_buff_pool.c
> > > > > > > > +++ b/net/xdp/xsk_buff_pool.c
> > > > > > > > @@ -99,6 +99,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
> > > > > > > >
> > > > > > > >       pool->fq = xs->fq_tmp;
> > > > > > > >       pool->cq = xs->cq_tmp;
> > > > > > > > +     pool->cached_cq = xs->cached_cq;
> > > > > > >
> > > > > > > Jason,
> > > > > > >
> > > > > > > pool can be shared between multiple sockets that bind to same <netdev,qid>
> > > > > > > tuple. I believe here you're opening up for the very same issue Eryk
> > > > > > > initially reported.
> > > > > >
> > > > > > Actually it shouldn't happen because the cached_cq is more of the
> > > > > > temporary array that helps the skb store its start position. The
> > > > > > cached_prod of cached_cq can only be increased, not decreased. In the
> > > > > > skb destruction phase, only those skbs that go to the end of life need
> > > > > > to sync its desc from cached_cq to cq. For some skbs that are released
> > > > > > before the tx completion, we don't need to clear its record in
> > > > > > cached_cq at all and cq remains untouched.
> > > > > >
> > > > > > To put it in a simple way, the patch you proposed uses kmem_cached*
> > > > > > helpers to store the addr and write the addr into cq at the end of
> > > > > > lifecycle while the current patch uses a pre-allocated memory to
> > > > > > store. So it avoids the allocation and deallocation.
> > > > > >
> > > > > > Unless I'm missing something important. If so, I'm still convinced
> > > > > > this temporary queue can solve the problem since essentially it's a
> > > > > > better substitute for kmem cache to retain high performance.
>
> Back after health issues!

Hi Maciej,

Hope you're fully recovered:)

>
> Jason, I am still not convinced about this solution.
>
> In shared pool setups, the temp cq will also be shared, which means that
> two parallel processes can produce addresses onto temp cq and therefore
> expose address to a socket that it does not belong to. In order to make
> this work you would have to know upfront the descriptor count of given
> frame and reserve this during processing the first descriptor.
>
> socket 0                        socket 1
> prod addr 0xAA
> prod addr 0xBB
>                                 prod addr 0xDD
> prod addr 0xCC
>                                 prod addr 0xEE
>
> socket 0 calls skb destructor with num desc == 3, placing 0xDD onto cq
> which has not been sent yet, therefore potentially corrupting it.

Thanks for spotting this case!

Yes, it can happen, so let's turn into a per-xsk granularity? If each
xsk has its own temp queue, then the problem would disappear and good
news is that we don't need extra locks like pool->cq_lock to prevent
multiple parallel xsks accessing the temp queue.

Hope you can agree with this method. It borrows your idea and then
only uses a _pre-allocated buffer_ to replace kmem_cache_alloc() in
the hot path. This solution will direct us more to a high performance
direction. IMHO, I‘d rather not see any degradation in performance
because of some issues.

Thanks,
Jason

>
> For now, I think we should move forward with Fernando's fix as there have
> been multiple reports already regarding broken state of xsk copy mode.
>
> > > > >
> > > > > I need a bit more time on this, probably I'll respond tomorrow.
> > > >
> > > > I'd like to know if you have any further comments on this? And should
> > > > I continue to post as an official series?
> > >
> > > Hi Jason,
> > >
> > > Maciej has been out-of-office for a couple of days. He should
> > > hopefully be back later this week, so please wait for his comments.
> >
> > Thanks for letting me know. I will wait :)
> >
> > Thanks,
> > Jason