netdev - Re: [PATCH v7 bpf] xsk: fix immature cq descriptor production

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoAiY1_OvVAJwWj-YwWY3_9QOWQ_Dwsn5V4vy+wnOQJJog@mail.gmail.com>
Date: Tue, 2 Sep 2025 21:38:37 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc: bpf@...r.kernel.org, ast@...nel.org, daniel@...earbox.net, 
	andrii@...nel.org, netdev@...r.kernel.org, magnus.karlsson@...el.com, 
	stfomichev@...il.com, Eryk Kubanski <e.kubanski@...tner.samsung.com>
Subject: Re: [PATCH v7 bpf] xsk: fix immature cq descriptor production

On Tue, Sep 2, 2025 at 6:56 PM Maciej Fijalkowski
<maciej.fijalkowski@...el.com> wrote:
>
> On Tue, Sep 02, 2025 at 08:02:30AM +0800, Jason Xing wrote:
> > On Tue, Sep 2, 2025 at 4:37 AM Maciej Fijalkowski
> > <maciej.fijalkowski@...el.com> wrote:
> > >
> > > On Tue, Sep 02, 2025 at 12:09:39AM +0800, Jason Xing wrote:
> > > > On Sat, Aug 30, 2025 at 2:10 AM Maciej Fijalkowski
> > > > <maciej.fijalkowski@...el.com> wrote:
> > > > >
> > > > > Eryk reported an issue that I have put under Closes: tag, related to
> > > > > umem addrs being prematurely produced onto pool's completion queue.
> > > > > Let us make the skb's destructor responsible for producing all addrs
> > > > > that given skb used.
> > > > >
> > > > > Commit from fixes tag introduced the buggy behavior, it was not broken
> > > > > from day 1, but rather when XSK multi-buffer got introduced.
> > > > >
> > > > > In order to mitigate performance impact as much as possible, mimic the
> > > > > linear and frag parts within skb by storing the first address from XSK
> > > > > descriptor at sk_buff::destructor_arg. For fragments, store them at ::cb
> > > > > via list. The nodes that will go onto list will be allocated via
> > > > > kmem_cache. xsk_destruct_skb() will consume address stored at
> > > > > ::destructor_arg and optionally go through list from ::cb, if count of
> > > > > descriptors associated with this particular skb is bigger than 1.
> > > > >
> > > > > Previous approach where whole array for storing UMEM addresses from XSK
> > > > > descriptors was pre-allocated during first fragment processing yielded
> > > > > too big performance regression for 64b traffic. In current approach
> > > > > impact is much reduced on my tests and for jumbo frames I observed
> > > > > traffic being slower by at most 9%.
> > > > >
> > > > > Magnus suggested to have this way of processing special cased for
> > > > > XDP_SHARED_UMEM, so we would identify this during bind and set different
> > > > > hooks for 'backpressure mechanism' on CQ and for skb destructor, but
> > > > > given that results looked promising on my side I decided to have a
> > > > > single data path for XSK generic Tx. I suppose other auxiliary stuff
> > > > > such as helpers introduced in this patch would have to land as well in
> > > > > order to make it work, so we might have ended up with more noisy diff.
> > > > >
> > > > > Fixes: b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
> > > > > Reported-by: Eryk Kubanski <e.kubanski@...tner.samsung.com>
> > > > > Closes: https://lore.kernel.org/netdev/20250530103456.53564-1-e.kubanski@partner.samsung.com/
> > > > > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
> > > > > ---
> > > > >
> > > > > Jason, please test this v7 on your setup, I would appreciate if you
> > > > > would report results from your testbed. Thanks!
> > > > >
> > > > > v1:
> > > > > https://lore.kernel.org/bpf/20250702101648.1942562-1-maciej.fijalkowski@intel.com/
> > > > > v2:
> > > > > https://lore.kernel.org/bpf/20250705135512.1963216-1-maciej.fijalkowski@intel.com/
> > > > > v3:
> > > > > https://lore.kernel.org/bpf/20250806154127.2161434-1-maciej.fijalkowski@intel.com/
> > > > > v4:
> > > > > https://lore.kernel.org/bpf/20250813171210.2205259-1-maciej.fijalkowski@intel.com/
> > > > > v5:
> > > > > https://lore.kernel.org/bpf/aKXBHGPxjpBDKOHq@boxer/T/
> > > > > v6:
> > > > > https://lore.kernel.org/bpf/20250820154416.2248012-1-maciej.fijalkowski@intel.com/
> > > > >
> > > > > v1->v2:
> > > > > * store addrs in array carried via destructor_arg instead having them
> > > > >   stored in skb headroom; cleaner and less hacky approach;
> > > > > v2->v3:
> > > > > * use kmem_cache for xsk_addrs allocation (Stan/Olek)
> > > > > * set err when xsk_addrs allocation fails (Dan)
> > > > > * change xsk_addrs layout to avoid holes
> > > > > * free xsk_addrs on error path
> > > > > * rebase
> > > > > v3->v4:
> > > > > * have kmem_cache as percpu vars
> > > > > * don't drop unnecessary braces (unrelated) (Stan)
> > > > > * use idx + i in xskq_prod_write_addr (Stan)
> > > > > * alloc kmem_cache on bind (Stan)
> > > > > * keep num_descs as first member in xsk_addrs (Magnus)
> > > > > * add ack from Magnus
> > > > > v4->v5:
> > > > > * have a single kmem_cache per xsk subsystem (Stan)
> > > > > v5->v6:
> > > > > * free skb in xsk_build_skb_zerocopy() when xsk_addrs allocation fails
> > > > >   (Stan)
> > > > > * unregister netdev notifier if creating kmem_cache fails (Stan)
> > > > > v6->v7:
> > > > > * don't include Acks from Magnus/Stan; let them review the new
> > > > >   approach:)
> > > > > * store first desc at sk_buff::destructor_arg and rest of frags in list
> > > > >   stored at sk_buff::cb
> > > > > * keep the kmem_cache but don't use it for allocation of whole array at
> > > > >   one shot but rather alloc single nodes of list
> > > > >
> > > > > ---
> > > > >  net/xdp/xsk.c       | 99 ++++++++++++++++++++++++++++++++++++++-------
> > > > >  net/xdp/xsk_queue.h | 12 ++++++
> > > > >  2 files changed, 97 insertions(+), 14 deletions(-)
> > > > >
>
> (...)
>
> > > > >  {
> > > > > -       long num = xsk_get_num_desc(xdp_sk(skb->sk)->skb) + 1;
> > > > > -
> > > > > -       skb_shinfo(skb)->destructor_arg = (void *)num;
> > > > > +       INIT_LIST_HEAD(&XSKCB(skb)->addrs_list);
> > > > > +       XSKCB(skb)->num_descs = 0;
> > > > > +       skb_shinfo(skb)->destructor_arg = (void *)addr;
> > > > >  }
> > > > >
> > > > >  static void xsk_consume_skb(struct sk_buff *skb)
> > > > >  {
> > > > >         struct xdp_sock *xs = xdp_sk(skb->sk);
> > > > > +       u32 num_descs = xsk_get_num_desc(skb);
> > > > > +       struct xsk_addr_node *pos, *tmp;
> > > > > +
> > > > > +       if (unlikely(num_descs > 1)) {
> > > >
> > > > I suspect the use of 'unlikely'. For some application turning on the
> > > > multi-buffer feature, if it tries to transmit a large chunk of data,
> > > > it can become 'likely' then. So it depends how people use it.
> > >
> > > This pattern is used in major of XDP multi-buffer related code, for
> > > example:
> > > $ grep -irn "xdp_buff_has_frags" net/core/xdp.c
> > > 553:    if (likely(!xdp_buff_has_frags(xdp)))
> > > 641:    if (unlikely(xdp_buff_has_frags(xdp))) {
> > > 777:    if (unlikely(xdp_buff_has_frags(xdp)) &&
> > >
> > > Drivers however tend to be mixed around this.
> >
> > I see. And I have no strong opinion on this.
> >
> > >
> > > >
> > > > > +               list_for_each_entry_safe(pos, tmp, &XSKCB(skb)->addrs_list, addr_node) {
> > > >
> > > > It seems no need to use xxx_safe() since the whole process (from
> > > > allocating skb to freeing skb) makes sure each skb can be processed
> > > > atomically?
> > >
> > > We're deleting nodes from linked list so we need the @tmp for further list
> > > traversal, I'm not following your statement about atomicity here?
> >
> > I mean this list is chained around each skb. It's not possible for one
> > skb to do the allocation operation and free operation at the same
> > time, right? That means it's not possible for one list to do the
> > delete operation and add operation at the same time. If so, the
> > xxx_safe() seems unneeded.
>
> _safe() variants are meant to allow you to delete nodes while traversing
> the list.
> You wouldn't be able to traverse the list when in body of the loop nodes
> are deleted as the ->next pointer is poisoned by list_del(). _safe()
> variant utilizes additional 'tmp' parameter to allow you doing this
> operation.

Sure, this is exactly how _safe() works. My take is we don't need to
use _safe() to keep safety because it's not possible for one reader
traversing the entire addr list while another one is trying to delete
node. If it can happen, then _safe() does make sense.

>
> I feel like you misunderstood these macros as they would provide some kind
> of serialization mechanism.
>
> >
> > >
> > > >
> > > > > +                       list_del(&pos->addr_node);
> > > > > +                       kmem_cache_free(xsk_tx_generic_cache, pos);
> > > > > +               }
> > > > > +       }
> > > > >
> > > > >         skb->destructor = sock_wfree;
> > > > > -       xsk_cq_cancel_locked(xs->pool, xsk_get_num_desc(skb));
> > > > > +       xsk_cq_cancel_locked(xs->pool, num_descs);
> > > > >         /* Free skb without triggering the perf drop trace */
> > > > >         consume_skb(skb);
> > > > >         xs->skb = NULL;
> > > > > @@ -623,6 +668,8 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > >                         return ERR_PTR(err);
> > > > >
> > > > >                 skb_reserve(skb, hr);
> > > > > +
> > > > > +               xsk_set_destructor_arg(skb, desc->addr);
> > > > >         }
> > > > >
> > > > >         addr = desc->addr;
> > > > > @@ -694,6 +741,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >                         err = skb_store_bits(skb, 0, buffer, len);
> > > > >                         if (unlikely(err))
> > > > >                                 goto free_err;
> > > > > +
> > > > > +                       xsk_set_destructor_arg(skb, desc->addr);
> > > > >                 } else {
> > > > >                         int nr_frags = skb_shinfo(skb)->nr_frags;
> > > > >                         struct page *page;
> > > > > @@ -759,7 +808,19 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > > > >         skb->mark = READ_ONCE(xs->sk.sk_mark);
> > > > >         skb->destructor = xsk_destruct_skb;
> > > > >         xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
> > > > > -       xsk_set_destructor_arg(skb);
> > > > > +
> > > > > +       xsk_inc_num_desc(skb);
> > > > > +       if (unlikely(xsk_get_num_desc(skb) > 1)) {
> > > > > +               struct xsk_addr_node *xsk_addr;
> > > > > +
> > > > > +               xsk_addr = kmem_cache_zalloc(xsk_tx_generic_cache, GFP_KERNEL);
> > > > > +               if (!xsk_addr) {
> > > >
> > > > num of descs needs to be decreased by one here? We probably miss the
> > > > chance to add this addr node into the list this time. Sorry, I'm not
> > > > so sure if this err path handles correctly.
> > >
> > > In theory yes, but xsk_consume_skb() will not crash without this by any
> > > means. If we would have a case where we failed on second frag, @num_descs
> > > would indeed by 2 but addrs_list would just be empty.
> >
> > I wasn't stating very clearly. If the second frag fails on the above
> > step, next time in xsk_consume_skb() the same skb will try to revisit
>
> you meant xsk_build_skb() I assume?

Oh, yes.

>
> > the second frag/desc because of xsk_cq_cancel_locked(xs->pool, 1); in
> > xsk_build_skb(). Then it will try to re-allocate the page for the
> > second desc by calling alloc_page() in xsk_consume_skb()? IIUC, it
> > will be messy around this skb. Finally, the descs of this skb will be
> > increased to 3, which should be 2 actually if the skb only needs to
> > carry two frags/descs?
>
> You're correct here! Even though it would not hurt later successful send
> case, other paths that use xsk_get_num_desc() would be broken - say that
> you failed at one point with kmem_cache_zalloc() and then you succeed, you
> have your full skb that is passed to ndo_start_xmit() but it ends with
> NETDEV_TX_BUSY - then even xskq_cons_cancel_n() is fed with wrong value.
> And regarding alloc_page - skb would carry doubled frag in
> skb_shared_info.
>
> This is rather unlikely to happen, but needs to be addressed of course.
> There are two approaches, either we do the allocations upfront or free
> whole skb when kmem cache allocation fails.
>
> I'll send a v8 with this fixed, but overall this path needs a refactor...

Looking forward to your update version:)

Thanks,
Jason