[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoBTuOnnhAUD9gwbt8VBf+m=c08c-+cOUyjuPLyx29xUWw@mail.gmail.com>
Date: Tue, 18 Nov 2025 08:01:52 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: Magnus Karlsson <magnus.karlsson@...il.com>, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, pabeni@...hat.com, bjorn@...nel.org,
magnus.karlsson@...el.com, jonathan.lemon@...il.com, sdf@...ichev.me,
ast@...nel.org, daniel@...earbox.net, hawk@...nel.org,
john.fastabend@...il.com, joe@...a.to, willemdebruijn.kernel@...il.com,
fmancera@...e.de, csmate@....hu, bpf@...r.kernel.org, netdev@...r.kernel.org,
Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH RFC net-next 2/2] xsk: introduce a cached cq to
temporarily store descriptor addrs
On Tue, Nov 18, 2025 at 12:05 AM Maciej Fijalkowski
<maciej.fijalkowski@...el.com> wrote:
>
> On Sat, Nov 15, 2025 at 07:46:40AM +0800, Jason Xing wrote:
> > On Fri, Nov 14, 2025 at 11:53 PM Maciej Fijalkowski
> > <maciej.fijalkowski@...el.com> wrote:
> > >
> > > On Tue, Nov 11, 2025 at 10:02:58PM +0800, Jason Xing wrote:
> > > > Hi Magnus,
> > > >
> > > > On Tue, Nov 11, 2025 at 9:44 PM Magnus Karlsson
> > > > <magnus.karlsson@...il.com> wrote:
> > > > >
> > > > > On Tue, 11 Nov 2025 at 14:06, Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > >
> > > > > > Hi Maciej,
> > > > > >
> > > > > > On Mon, Nov 3, 2025 at 11:00 PM Maciej Fijalkowski
> > > > > > <maciej.fijalkowski@...el.com> wrote:
> > > > > > >
> > > > > > > On Sat, Nov 01, 2025 at 07:59:36AM +0800, Jason Xing wrote:
> > > > > > > > On Fri, Oct 31, 2025 at 10:02 PM Maciej Fijalkowski
> > > > > > > > <maciej.fijalkowski@...el.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Oct 31, 2025 at 05:32:30PM +0800, Jason Xing wrote:
> > > > > > > > > > From: Jason Xing <kernelxing@...cent.com>
> > > > > > > > > >
> > > > > > > > > > Before the commit 30f241fcf52a ("xsk: Fix immature cq descriptor
> > > > > > > > > > production"), there is one issue[1] which causes the wrong publish
> > > > > > > > > > of descriptors in race condidtion. The above commit fixes the issue
> > > > > > > > > > but adds more memory operations in the xmit hot path and interrupt
> > > > > > > > > > context, which can cause side effect in performance.
> > > > > > > > > >
> > > > > > > > > > This patch tries to propose a new solution to fix the problem
> > > > > > > > > > without manipulating the allocation and deallocation of memory. One
> > > > > > > > > > of the key points is that I borrowed the idea from the above commit
> > > > > > > > > > that postpones updating the ring->descs in xsk_destruct_skb()
> > > > > > > > > > instead of in __xsk_generic_xmit().
> > > > > > > > > >
> > > > > > > > > > The core logics are as show below:
> > > > > > > > > > 1. allocate a new local queue. Only its cached_prod member is used.
> > > > > > > > > > 2. write the descriptors into the local queue in the xmit path. And
> > > > > > > > > > record the cached_prod as @start_addr that reflects the
> > > > > > > > > > start position of this queue so that later the skb can easily
> > > > > > > > > > find where its addrs are written in the destruction phase.
> > > > > > > > > > 3. initialize the upper 24 bits of destructor_arg to store @start_addr
> > > > > > > > > > in xsk_skb_init_misc().
> > > > > > > > > > 4. Initialize the lower 8 bits of destructor_arg to store how many
> > > > > > > > > > descriptors the skb owns in xsk_update_num_desc().
> > > > > > > > > > 5. write the desc addr(s) from the @start_addr from the cached cq
> > > > > > > > > > one by one into the real cq in xsk_destruct_skb(). In turn sync
> > > > > > > > > > the global state of the cq.
> > > > > > > > > >
> > > > > > > > > > The format of destructor_arg is designed as:
> > > > > > > > > > ------------------------ --------
> > > > > > > > > > | start_addr | num |
> > > > > > > > > > ------------------------ --------
> > > > > > > > > > Using upper 24 bits is enough to keep the temporary descriptors. And
> > > > > > > > > > it's also enough to use lower 8 bits to show the number of descriptors
> > > > > > > > > > that one skb owns.
> > > > > > > > > >
> > > > > > > > > > [1]: https://lore.kernel.org/all/20250530095957.43248-1-e.kubanski@partner.samsung.com/
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > > > > > > > > ---
> > > > > > > > > > I posted the series as an RFC because I'd like to hear more opinions on
> > > > > > > > > > the current rought approach so that the fix[2] can be avoided and
> > > > > > > > > > mitigate the impact of performance. This patch might have bugs because
> > > > > > > > > > I decided to spend more time on it after we come to an agreement. Please
> > > > > > > > > > review the overall concepts. Thanks!
> > > > > > > > > >
> > > > > > > > > > Maciej, could you share with me the way you tested jumbo frame? I used
> > > > > > > > > > ./xdpsock -i enp2s0f1 -t -q 1 -S -s 9728 but the xdpsock utilizes the
> > > > > > > > > > nic more than 90%, which means I cannot see the performance impact.
> > > > > > > >
> > > > > > > > Could you provide the command you used? Thanks :)
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [2]:https://lore.kernel.org/all/20251030140355.4059-1-fmancera@suse.de/
> > > > > > > > > > ---
> > > > > > > > > > include/net/xdp_sock.h | 1 +
> > > > > > > > > > include/net/xsk_buff_pool.h | 1 +
> > > > > > > > > > net/xdp/xsk.c | 104 ++++++++++++++++++++++++++++--------
> > > > > > > > > > net/xdp/xsk_buff_pool.c | 1 +
> > > > > > > > > > 4 files changed, 84 insertions(+), 23 deletions(-)
> > > > > > > > >
> > > > > > > > > (...)
> > > > > > > > >
> > > > > > > > > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > > > > > > > > index aa9788f20d0d..6e170107dec7 100644
> > > > > > > > > > --- a/net/xdp/xsk_buff_pool.c
> > > > > > > > > > +++ b/net/xdp/xsk_buff_pool.c
> > > > > > > > > > @@ -99,6 +99,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
> > > > > > > > > >
> > > > > > > > > > pool->fq = xs->fq_tmp;
> > > > > > > > > > pool->cq = xs->cq_tmp;
> > > > > > > > > > + pool->cached_cq = xs->cached_cq;
> > > > > > > > >
> > > > > > > > > Jason,
> > > > > > > > >
> > > > > > > > > pool can be shared between multiple sockets that bind to same <netdev,qid>
> > > > > > > > > tuple. I believe here you're opening up for the very same issue Eryk
> > > > > > > > > initially reported.
> > > > > > > >
> > > > > > > > Actually it shouldn't happen because the cached_cq is more of the
> > > > > > > > temporary array that helps the skb store its start position. The
> > > > > > > > cached_prod of cached_cq can only be increased, not decreased. In the
> > > > > > > > skb destruction phase, only those skbs that go to the end of life need
> > > > > > > > to sync its desc from cached_cq to cq. For some skbs that are released
> > > > > > > > before the tx completion, we don't need to clear its record in
> > > > > > > > cached_cq at all and cq remains untouched.
> > > > > > > >
> > > > > > > > To put it in a simple way, the patch you proposed uses kmem_cached*
> > > > > > > > helpers to store the addr and write the addr into cq at the end of
> > > > > > > > lifecycle while the current patch uses a pre-allocated memory to
> > > > > > > > store. So it avoids the allocation and deallocation.
> > > > > > > >
> > > > > > > > Unless I'm missing something important. If so, I'm still convinced
> > > > > > > > this temporary queue can solve the problem since essentially it's a
> > > > > > > > better substitute for kmem cache to retain high performance.
> > >
> > > Back after health issues!
> >
> > Hi Maciej,
> >
> > Hope you're fully recovered:)
> >
> > >
> > > Jason, I am still not convinced about this solution.
> > >
> > > In shared pool setups, the temp cq will also be shared, which means that
> > > two parallel processes can produce addresses onto temp cq and therefore
> > > expose address to a socket that it does not belong to. In order to make
> > > this work you would have to know upfront the descriptor count of given
> > > frame and reserve this during processing the first descriptor.
> > >
> > > socket 0 socket 1
> > > prod addr 0xAA
> > > prod addr 0xBB
> > > prod addr 0xDD
> > > prod addr 0xCC
> > > prod addr 0xEE
> > >
> > > socket 0 calls skb destructor with num desc == 3, placing 0xDD onto cq
> > > which has not been sent yet, therefore potentially corrupting it.
> >
> > Thanks for spotting this case!
> >
> > Yes, it can happen, so let's turn into a per-xsk granularity? If each
> > xsk has its own temp queue, then the problem would disappear and good
> > news is that we don't need extra locks like pool->cq_lock to prevent
> > multiple parallel xsks accessing the temp queue.
>
> Sure, when you're confident this is working solution then you can post it.
> But from my POV we should go with Fernando's patch and then you can send
> patches to bpf-next as improvements. There are people out there with
> broken xsk waiting for a fix.
Fine, I will officially post it on the next branch. But I think at
that time, I have to revert both patches (your and Fernando's
patches)? Will his patch be applied to the stable branch only so that
I can make it on the next branch?
>
> >
> > Hope you can agree with this method. It borrows your idea and then
> > only uses a _pre-allocated buffer_ to replace kmem_cache_alloc() in
> > the hot path. This solution will direct us more to a high performance
> > direction. IMHO, I‘d rather not see any degradation in performance
> > because of some issues.
>
> I have to disagree here even though my work was around perf improvements
> in the past. Code has to be correct and we have to respect bug reports. So
> clarity and correctness comes before performance. If we silently accept
> some breakage then in the future nothing would spot syzbot from preparing
> a bug reproducer. Addressing this consumes developer's/maintainer's time.
No no no, I meant we're all striving for high performance direction
under the condition all the bugs are addressed. So the current series
surely brings more complexity but it can be good in the long run. Of
course, I know what you meant here :)
Thanks,
Jason
Powered by blists - more mailing lists