netdev - Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoBvLHFJJuYawJc3wY2aOrn5CQ3s5+sbC2M24_QNLyBHsg@mail.gmail.com>
Date: Fri, 22 Aug 2025 09:13:21 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: "Karlsson, Magnus" <magnus.karlsson@...el.com>, Björn Töpel <bjorn@...nel.org>, 
	davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, 
	maciej.fijalkowski@...el.com, jonathan.lemon@...il.com, sdf@...ichev.me, 
	ast@...nel.org, daniel@...earbox.net, john.fastabend@...il.com, 
	horms@...nel.org, andrew+netdev@...n.ch, bpf@...r.kernel.org, 
	netdev@...r.kernel.org, Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next 2/2] xsk: support generic batch xmit in copy mode

On Fri, Aug 22, 2025 at 1:29 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>
> I need some help from Cc Magnus or Björn, to explain why you changes
> fails in xsk_destruct_skb().

Oh, I mean the reason for using socket level accounting is we need to
make sure of the safety in tx completion period. In xsk, that is,
xsk_destruct_skb needs to fetch the corresponding sk from skb and
manipulate its ring structure. Without accounting, the socket can be
released and destroyed before the driver calls skb->destructor(). Only
with the accounting protection, the socket is still alive because the
following code:
sock_wfree()
    -> if (refcount_sub_and_test(len, &sk->sk_wmem_alloc))
             __sk_free(sk);

It seems no way to rid the accounting feature for now without
refactoring the whole logic.

We can probably remove the sk_sndbuf limitation, but I still do more
investigation :)

Thanks,
Jason

>
>
> On 15/08/2025 08.44, Jason Xing wrote:
> > On Tue, Aug 12, 2025 at 10:30 PM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
> >>
> > ...
> >>
> >> But this also requires changing the SKB alloc function used by
> >> xsk_build_skb(). As a seperate patch, I recommend that you change the
> >> sock_alloc_send_skb() to instead use build_skb (or build_skb_around).
> >> I expect this will be a large performance improvement on it's own.
> >> Can I ask you to benchmark this change before the batch xmit change?
> >>
> >> Opinions needed from other maintainers please (I might be wrong!):
> >> I don't think the socket level accounting done in sock_alloc_send_skb()
> >> is correct/relevant for AF_XDP/XSK, because the "backpressure mechanism"
> >> code comment above.
> >
> > Here I'm bringing back the last test you expected to know :)
> >
> > I use alloc_skb() to replace sock_alloc_send_skb() and introduce other
> > minor changes, say, removing sock_wfree() from xsk_destruct_skb(). It
> > turns out to be a stable 5% performance improvement on i40e driver.
> > slight improvement on virtio_net. That's good news.
> >
> > Bad news is that the above logic has bugs like freeing skb in the napi
> > poll causes accessing skb->sk in xsk_destruct_skb() which triggers a
> > NULL pointer issue. How did I spot this one? I removed the BQL flow
> > control and started two xdpsock on different queues, then I saw a
> > panic[1]... To solve the problem like that, I'm afraid that we still
> > need to charge a certain length value into sk_wmem_alloc so that
> > sock_wfree(skb) can be the last one to free the socket finally.
> >
> > So this socket level accounting mechanism keeps its safety in the above case.
> >
> > IMHO, we can get rid of the limitation of sk_sndbuf but still use
> > skb_set_owner_w() that charges the len of skb. If we stick to removing
> > the whole accounting function, probably we have to adjust the position
> > of xsk_cq_submit_locked(), but I reckon for now it's not practical...
> >
> > Any thoughts on this?
> >
> > [1]
> >   997 [  133.528449] RIP: 0010:xsk_destruct_skb+0x6a/0x90
> >   998 [  133.528920] Code: 8b 6c 02 28 48 8b 43 18 4c 8b a0 68 03 00 00
> > 49 8d 9c 24 e8 00 00 00 48 89 df e8 f1 eb 06 00 48 89 c6 49 8b 84 24
> > 88 00 00 00 <48> 8b 50 10 03 2a 48      8b 40 10 48 89 df 89 28 5b 5d
> > 41 5c e9 6e ec
> >   999 [  133.530526] RSP: 0018:ffffae71c06a0d08 EFLAGS: 00010046
> > 1000 [  133.531005] RAX: 0000000000000000 RBX: ffff9f42c81c49e8 RCX:
> > 00000000000002e7
> > 1001 [  133.531631] RDX: 0000000000000001 RSI: 0000000000000286 RDI:
> > ffff9f42c81c49e8
> > 1002 [  133.532249] RBP: 0000000000000001 R08: 0000000000000008 R09:
> > 00000000000000001003 [  133.532867] R10: ffffffff978080c0 R11:
> > ffffae71c06a0ff8 R12: ffff9f42c81c4900
> > 1004 [  133.533491] R13: ffffae71c06a0d88 R14: ffff9f42e0f1f900 R15:
> > ffff9f42ce850d801005 [  133.534123] FS:  0000000000000000(0000)
> > GS:ffff9f5227655000(0000) knlGS:00000000000000001006 [  133.534831]
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > 1007 [  133.535366] CR2: 0000000000000010 CR3: 000000011c820000 CR4:
> > 00000000003506f0
> > 1008 [  133.536014] Call Trace:
> > 1009 [  133.536313]  <IRQ>
> > 1010 [  133.536583]  skb_release_head_state+0x20/0x90
> > 1011 [  133.537021]  napi_consume_skb+0x42/0x120
> > 1012 [  133.537429]  __free_old_xmit+0x76/0x170 [virtio_net]
> > 1013 [  133.537923]  free_old_xmit+0x53/0xc0 [virtio_net]
> > 1014 [  133.538395]  virtnet_poll+0xed/0x5d0 [virtio_net]
> > 1015 [  133.538867]  ? blake2s_compress+0x52/0xa0
> > 1016 [  133.539286]  __napi_poll+0x28/0x200
> > 1017 [  133.539668]  net_rx_action+0x319/0x400
> > 1018 [  133.540068]  ? sched_clock_cpu+0xb/0x190
> > 1019 [  133.540482]  ? __run_timers+0x1d1/0x260
> > 1020 [  133.540906]  ? __pfx_dl_task_timer+0x10/0x10
> > 1021 [  133.541349]  ? lock_timer_base+0x72/0x90
> > 1022 [  133.541767]  handle_softirqs+0xce/0x2e0
> > 1023 [  133.542178]  __irq_exit_rcu+0xc6/0xf0
> > 1024 [  133.542575]  common_interrupt+0x81/0xa0
> >
> > Thanks,
> > Jason
>