linux-kernel - Re: [PATCH net] xsk: correct tx_ring_empty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz127WoEFbm1Gg7OquYJCLigg03N1XrKDcXejsUmVuQ3PA@mail.gmail.com>
Date: Tue, 1 Apr 2025 13:00:39 +0200
From: Magnus Karlsson <magnus.karlsson@...il.com>
To: Wang Liang <wangliang74@...wei.com>
Cc: Stanislav Fomichev <stfomichev@...il.com>, bjorn@...nel.org, magnus.karlsson@...el.com, 
	maciej.fijalkowski@...el.com, jonathan.lemon@...il.com, davem@...emloft.net, 
	edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, horms@...nel.org, 
	ast@...nel.org, daniel@...earbox.net, hawk@...nel.org, 
	john.fastabend@...il.com, yuehaibing@...wei.com, zhangchangzhong@...wei.com, 
	netdev@...r.kernel.org, bpf@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net] xsk: correct tx_ring_empty_descs count statistics

On Tue, 1 Apr 2025 at 11:33, Wang Liang <wangliang74@...wei.com> wrote:
>
>
> 在 2025/4/1 16:12, Magnus Karlsson 写道:
> > On Tue, 1 Apr 2025 at 09:44, Wang Liang <wangliang74@...wei.com> wrote:
> >>
> >> 在 2025/4/1 14:57, Magnus Karlsson 写道:
> >>> On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@...wei.com> wrote:
> >>>> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
> >>>>> On 03/31, Stanislav Fomichev wrote:
> >>>>>> On 03/29, Wang Liang wrote:
> >>>>>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
> >>>>>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
> >>>>>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
> >>>>>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
> >>>>>>>
> >>>>>>>      xsk_poll
> >>>>>>>        xsk_generic_xmit
> >>>>>>>          __xsk_generic_xmit
> >>>>>>>            xskq_cons_peek_desc
> >>>>>>>              xskq_cons_read_desc
> >>>>>>>                q->queue_empty_descs++;
> >>> Sorry, but I do not understand how to reproduce this error. So you
> >>> first issue a setsockopt with the XDP_TX_RING option and then you do
> >>> not "reserve tx ring". What does that last "not reserve tx ring" mean?
> >>> No mmap() of that ring, or something else? I guess you have bound the
> >>> socket with a bind()? Some pseudo code on how to reproduce this would
> >>> be helpful. Just want to understand so I can help. Thank you.
> >> Sorry, the last email is garbled, and send again.
> >>
> >> Ok. Some pseudo code like below:
> >>
> >>       fd = socket(AF_XDP, SOCK_RAW, 0);
> >>       setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
> >>
> >>       setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, &fill_size,
> >> sizeof(fill_size));
> >>       setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &comp_size,
> >> sizeof(comp_size));
> >>       mmap(NULL, off.fr.desc + fill_size * sizeof(__u64), ...,
> >> XDP_UMEM_PGOFF_FILL_RING);
> >>       mmap(NULL, off.cr.desc + comp_size * sizeof(__u64), ...,
> >> XDP_UMEM_PGOFF_COMPLETION_RING);
> >>
> >>       setsockopt(fd, SOL_XDP, XDP_RX_RING, &rx_size, sizeof(rx_size));
> >>       setsockopt(fd, SOL_XDP, XDP_TX_RING, &tx_size, sizeof(tx_size));
> >>       mmap(NULL, off.rx.desc + rx_size * sizeof(struct xdp_desc), ...,
> >> XDP_PGOFF_RX_RING);
> >>       mmap(NULL, off.tx.desc + tx_size * sizeof(struct xdp_desc), ...,
> >> XDP_PGOFF_TX_RING);
> >>
> >>       bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
> >>       bpf_map_update_elem(xsk_map_fd, &queue_id, &fd, 0);
> >>
> >>       while(!global_exit) {
> >>           poll(fds, 1, -1);
> >>           handle_receive_packets(...);
> >>       }
> >>
> >> The xsk is created success, and xs->tx is initialized.
> >>
> >> The "not reserve tx ring" means user app do not update tx ring producer.
> >> Like:
> >>
> >>       xsk_ring_prod__reserve(tx, 1, &tx_idx);
> >>       xsk_ring_prod__tx_desc(tx, tx_idx)->addr = frame;
> >>       xsk_ring_prod__tx_desc(tx, tx_idx)->len = pkg_length;
> >>       xsk_ring_prod__submit(tx, 1);
> >>
> >> These functions (xsk_ring_prod__reserve, etc.) is provided by libxdp.
> >>
> >> The tx->producer is not updated, so the xs->tx->cached_cons and
> >> xs->tx->cached_prod are always zero.
> >>
> >> When receive packets and user app call poll(), xsk_generic_xmit() will be
> >> triggered by xsk_poll(), leading to this issue.
> > Thanks, that really helped. The problem here is that the kernel cannot
> > guess your intent. Since you created a socket with both Rx and Tx, it
> > thinks you will use it for both, so it should increase
> > queue_empty_descs in this case as you did not provide any Tx descs.
> > Your proposed patch will break this. Consider this Tx case with the
> > exact same init code as you have above but with this send loop:
> >
> > while(!global_exit) {
> >         maybe_send_packets(...);
> >         poll(fds, 1, -1);
> > }
> >
> > With your patch, the queue_empty_descs will never be increased in the
> > case when I do not submit any Tx descs, even though we would like it
> > to be so.
> >
> > So in my mind, you have a couple of options:
> >
> > * Create two sockets, one rx only and one tx only and use the
> > SHARED_UMEM mode to bind them to the same netdev and queue id. In your
> > loop above, you would use the Rx socket. This might have the drawback
> > that you need to call poll() twice if you are both sending and
> > receiving in the same loop. But the stats will be the way you want
> > them to be.
> >
> > * Introduce a new variable in user space that you increase every time
> > you do poll() in your loop above. When displaying the statistics, just
> > deduct this variable from the queue_empty_descs that the kernel
> > reports using the XDP_STATISTICS getsockopt().
> >
> > Hope this helps.
>
>
> Thank you for the advices.
>
>  From user view, queue_empty_descs increases when the app only receive
> packets and call poll(), it is some confusing.

But if the only thing you are doing in the app is receive packets, you
should create an Rx only socket.

> In your Tx case, if user app use sendto() to send packets, the
> queue_empty_descs will increase. This is reasonably.
>
> In linux manual, poll() waits for some event on a file descriptor. But
> in af_xdp, poll() has a side effect of send msg.

If I remember correctly, we did this so that zero-copy mode and copy
mode should have the same behavior from an app point of view. But I do
not remember why we implemented this behavior in zero-copy in the
first place. Maybe out of necessity since zero-copy Tx is driven by
the driver.

> Previous commit 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in
> AF_XDP rings") add need_wakeup flag. It mainly work in rx process. When
> the application and driver run on the same core, if the fill ring is
> empty, the driver can set the need_wakeup flag and return, so the
> application could be scheduled to produce entries of the fill ring.
>
> The commit df551058f7a3 ("xsk: Fix crash in poll when device does not
> support ndo_xsk_wakeup") add sendmsg function in xsk_poll(), considering
> some devices donot define ndo_xsk_wakeup.
>
> At present the value of need_wakeup & XDP_WAKEUP_TX is always true,
> except the mlx5 driver. If the mlx5 driver queued some packets for tx,
> it may clear XDP_WAKEUP_TX by xsk_clear_tx_need_wakeup(). So when user
> app use sendto() to send packets, __xsk_sendmsg() will return without
> calling xsk_generic_xmit().

Mellanox is doing it in the correct way. The Intel drivers had it like
this too in the beginning, until we discovered a race condition in the
HW in which the interrupt was not armed at the same time as the
need_wakeup flag was cleared. This would then lead to packets not
being sent and user-space not knowing about it. As to why all other
drivers copied this unfortunate fall-back, I do not know.

> So I have a bold suggestion, how about removing xsk_generic_xmit() in
> xsk_poll()?

That would indeed be bold :-)! But we cannot do this since it would
break existing applications that rely on this. I think you have to use
one of the workarounds I pitched in the previous mail. If you are
really just receiving, you should create an Rx only socket.

> >>>>>>> To avoid this count error, add check for tx descs before send msg in poll.
> >>>>>>>
> >>>>>>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
> >>>>>>> Signed-off-by: Wang Liang <wangliang74@...wei.com>
> >>>>>> Acked-by: Stanislav Fomichev <sdf@...ichev.me>
> >>>>> Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
> >>>>> cached prod/cons. How is it supposed to work when the actual tx
> >>>>> descriptor is posted? Is there anything besides xskq_cons_peek_desc from
> >>>>> __xsk_generic_xmit that refreshes cached_prod?
> >>>> Yes, you are right!
> >>>>
> >>>> How about using xskq_cons_nb_entries() to check free descriptors?
> >>>>
> >>>> Like this:
> >>>>
> >>>>
> >>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >>>> index e5d104ce7b82..babb7928d335 100644
> >>>> --- a/net/xdp/xsk.c
> >>>> +++ b/net/xdp/xsk.c
> >>>> @@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct
> >>>> socket *sock,
> >>>>            if (pool->cached_need_wakeup) {
> >>>>                    if (xs->zc)
> >>>>                            xsk_wakeup(xs, pool->cached_need_wakeup);
> >>>> -               else if (xs->tx)
> >>>> +               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
> >>>>                            /* Poll needs to drive Tx also in copy mode */
> >>>>                            xsk_generic_xmit(sk);
> >>>>            }
> >>>>
> >>>>