lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz2QXNN4so-EgR8sU8A86E_AeYx1w_b+BSVeCgzr1kaR+g@mail.gmail.com>
Date: Tue, 1 Apr 2025 10:12:14 +0200
From: Magnus Karlsson <magnus.karlsson@...il.com>
To: Wang Liang <wangliang74@...wei.com>
Cc: Stanislav Fomichev <stfomichev@...il.com>, bjorn@...nel.org, magnus.karlsson@...el.com, 
	maciej.fijalkowski@...el.com, jonathan.lemon@...il.com, davem@...emloft.net, 
	edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, horms@...nel.org, 
	ast@...nel.org, daniel@...earbox.net, hawk@...nel.org, 
	john.fastabend@...il.com, yuehaibing@...wei.com, zhangchangzhong@...wei.com, 
	netdev@...r.kernel.org, bpf@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net] xsk: correct tx_ring_empty_descs count statistics

On Tue, 1 Apr 2025 at 09:44, Wang Liang <wangliang74@...wei.com> wrote:
>
>
> 在 2025/4/1 14:57, Magnus Karlsson 写道:
> > On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@...wei.com> wrote:
> >>
> >> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
> >>> On 03/31, Stanislav Fomichev wrote:
> >>>> On 03/29, Wang Liang wrote:
> >>>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
> >>>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
> >>>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
> >>>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
> >>>>>
> >>>>>     xsk_poll
> >>>>>       xsk_generic_xmit
> >>>>>         __xsk_generic_xmit
> >>>>>           xskq_cons_peek_desc
> >>>>>             xskq_cons_read_desc
> >>>>>               q->queue_empty_descs++;
> > Sorry, but I do not understand how to reproduce this error. So you
> > first issue a setsockopt with the XDP_TX_RING option and then you do
> > not "reserve tx ring". What does that last "not reserve tx ring" mean?
> > No mmap() of that ring, or something else? I guess you have bound the
> > socket with a bind()? Some pseudo code on how to reproduce this would
> > be helpful. Just want to understand so I can help. Thank you.
> Sorry, the last email is garbled, and send again.
>
> Ok. Some pseudo code like below:
>
>      fd = socket(AF_XDP, SOCK_RAW, 0);
>      setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
>
>      setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, &fill_size,
> sizeof(fill_size));
>      setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &comp_size,
> sizeof(comp_size));
>      mmap(NULL, off.fr.desc + fill_size * sizeof(__u64), ...,
> XDP_UMEM_PGOFF_FILL_RING);
>      mmap(NULL, off.cr.desc + comp_size * sizeof(__u64), ...,
> XDP_UMEM_PGOFF_COMPLETION_RING);
>
>      setsockopt(fd, SOL_XDP, XDP_RX_RING, &rx_size, sizeof(rx_size));
>      setsockopt(fd, SOL_XDP, XDP_TX_RING, &tx_size, sizeof(tx_size));
>      mmap(NULL, off.rx.desc + rx_size * sizeof(struct xdp_desc), ...,
> XDP_PGOFF_RX_RING);
>      mmap(NULL, off.tx.desc + tx_size * sizeof(struct xdp_desc), ...,
> XDP_PGOFF_TX_RING);
>
>      bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
>      bpf_map_update_elem(xsk_map_fd, &queue_id, &fd, 0);
>
>      while(!global_exit) {
>          poll(fds, 1, -1);
>          handle_receive_packets(...);
>      }
>
> The xsk is created success, and xs->tx is initialized.
>
> The "not reserve tx ring" means user app do not update tx ring producer.
> Like:
>
>      xsk_ring_prod__reserve(tx, 1, &tx_idx);
>      xsk_ring_prod__tx_desc(tx, tx_idx)->addr = frame;
>      xsk_ring_prod__tx_desc(tx, tx_idx)->len = pkg_length;
>      xsk_ring_prod__submit(tx, 1);
>
> These functions (xsk_ring_prod__reserve, etc.) is provided by libxdp.
>
> The tx->producer is not updated, so the xs->tx->cached_cons and
> xs->tx->cached_prod are always zero.
>
> When receive packets and user app call poll(), xsk_generic_xmit() will be
> triggered by xsk_poll(), leading to this issue.

Thanks, that really helped. The problem here is that the kernel cannot
guess your intent. Since you created a socket with both Rx and Tx, it
thinks you will use it for both, so it should increase
queue_empty_descs in this case as you did not provide any Tx descs.
Your proposed patch will break this. Consider this Tx case with the
exact same init code as you have above but with this send loop:

while(!global_exit) {
       maybe_send_packets(...);
       poll(fds, 1, -1);
}

With your patch, the queue_empty_descs will never be increased in the
case when I do not submit any Tx descs, even though we would like it
to be so.

So in my mind, you have a couple of options:

* Create two sockets, one rx only and one tx only and use the
SHARED_UMEM mode to bind them to the same netdev and queue id. In your
loop above, you would use the Rx socket. This might have the drawback
that you need to call poll() twice if you are both sending and
receiving in the same loop. But the stats will be the way you want
them to be.

* Introduce a new variable in user space that you increase every time
you do poll() in your loop above. When displaying the statistics, just
deduct this variable from the queue_empty_descs that the kernel
reports using the XDP_STATISTICS getsockopt().

Hope this helps.

> >>>>> To avoid this count error, add check for tx descs before send msg in poll.
> >>>>>
> >>>>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
> >>>>> Signed-off-by: Wang Liang <wangliang74@...wei.com>
> >>>> Acked-by: Stanislav Fomichev <sdf@...ichev.me>
> >>> Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
> >>> cached prod/cons. How is it supposed to work when the actual tx
> >>> descriptor is posted? Is there anything besides xskq_cons_peek_desc from
> >>> __xsk_generic_xmit that refreshes cached_prod?
> >>
> >> Yes, you are right!
> >>
> >> How about using xskq_cons_nb_entries() to check free descriptors?
> >>
> >> Like this:
> >>
> >>
> >> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >> index e5d104ce7b82..babb7928d335 100644
> >> --- a/net/xdp/xsk.c
> >> +++ b/net/xdp/xsk.c
> >> @@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct
> >> socket *sock,
> >>           if (pool->cached_need_wakeup) {
> >>                   if (xs->zc)
> >>                           xsk_wakeup(xs, pool->cached_need_wakeup);
> >> -               else if (xs->tx)
> >> +               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
> >>                           /* Poll needs to drive Tx also in copy mode */
> >>                           xsk_generic_xmit(sk);
> >>           }
> >>
> >>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ