lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ffc84696-f9c6-4869-9f1e-7faf45d99060@huawei.com>
Date: Tue, 1 Apr 2025 17:33:35 +0800
From: Wang Liang <wangliang74@...wei.com>
To: Magnus Karlsson <magnus.karlsson@...il.com>
CC: Stanislav Fomichev <stfomichev@...il.com>, <bjorn@...nel.org>,
	<magnus.karlsson@...el.com>, <maciej.fijalkowski@...el.com>,
	<jonathan.lemon@...il.com>, <davem@...emloft.net>, <edumazet@...gle.com>,
	<kuba@...nel.org>, <pabeni@...hat.com>, <horms@...nel.org>, <ast@...nel.org>,
	<daniel@...earbox.net>, <hawk@...nel.org>, <john.fastabend@...il.com>,
	<yuehaibing@...wei.com>, <zhangchangzhong@...wei.com>,
	<netdev@...r.kernel.org>, <bpf@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net] xsk: correct tx_ring_empty_descs count statistics


在 2025/4/1 16:12, Magnus Karlsson 写道:
> On Tue, 1 Apr 2025 at 09:44, Wang Liang <wangliang74@...wei.com> wrote:
>>
>> 在 2025/4/1 14:57, Magnus Karlsson 写道:
>>> On Tue, 1 Apr 2025 at 04:36, Wang Liang <wangliang74@...wei.com> wrote:
>>>> 在 2025/4/1 6:03, Stanislav Fomichev 写道:
>>>>> On 03/31, Stanislav Fomichev wrote:
>>>>>> On 03/29, Wang Liang wrote:
>>>>>>> The tx_ring_empty_descs count may be incorrect, when set the XDP_TX_RING
>>>>>>> option but do not reserve tx ring. Because xsk_poll() try to wakeup the
>>>>>>> driver by calling xsk_generic_xmit() for non-zero-copy mode. So the
>>>>>>> tx_ring_empty_descs count increases once the xsk_poll()is called:
>>>>>>>
>>>>>>>      xsk_poll
>>>>>>>        xsk_generic_xmit
>>>>>>>          __xsk_generic_xmit
>>>>>>>            xskq_cons_peek_desc
>>>>>>>              xskq_cons_read_desc
>>>>>>>                q->queue_empty_descs++;
>>> Sorry, but I do not understand how to reproduce this error. So you
>>> first issue a setsockopt with the XDP_TX_RING option and then you do
>>> not "reserve tx ring". What does that last "not reserve tx ring" mean?
>>> No mmap() of that ring, or something else? I guess you have bound the
>>> socket with a bind()? Some pseudo code on how to reproduce this would
>>> be helpful. Just want to understand so I can help. Thank you.
>> Sorry, the last email is garbled, and send again.
>>
>> Ok. Some pseudo code like below:
>>
>>       fd = socket(AF_XDP, SOCK_RAW, 0);
>>       setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
>>
>>       setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, &fill_size,
>> sizeof(fill_size));
>>       setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &comp_size,
>> sizeof(comp_size));
>>       mmap(NULL, off.fr.desc + fill_size * sizeof(__u64), ...,
>> XDP_UMEM_PGOFF_FILL_RING);
>>       mmap(NULL, off.cr.desc + comp_size * sizeof(__u64), ...,
>> XDP_UMEM_PGOFF_COMPLETION_RING);
>>
>>       setsockopt(fd, SOL_XDP, XDP_RX_RING, &rx_size, sizeof(rx_size));
>>       setsockopt(fd, SOL_XDP, XDP_TX_RING, &tx_size, sizeof(tx_size));
>>       mmap(NULL, off.rx.desc + rx_size * sizeof(struct xdp_desc), ...,
>> XDP_PGOFF_RX_RING);
>>       mmap(NULL, off.tx.desc + tx_size * sizeof(struct xdp_desc), ...,
>> XDP_PGOFF_TX_RING);
>>
>>       bind(fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
>>       bpf_map_update_elem(xsk_map_fd, &queue_id, &fd, 0);
>>
>>       while(!global_exit) {
>>           poll(fds, 1, -1);
>>           handle_receive_packets(...);
>>       }
>>
>> The xsk is created success, and xs->tx is initialized.
>>
>> The "not reserve tx ring" means user app do not update tx ring producer.
>> Like:
>>
>>       xsk_ring_prod__reserve(tx, 1, &tx_idx);
>>       xsk_ring_prod__tx_desc(tx, tx_idx)->addr = frame;
>>       xsk_ring_prod__tx_desc(tx, tx_idx)->len = pkg_length;
>>       xsk_ring_prod__submit(tx, 1);
>>
>> These functions (xsk_ring_prod__reserve, etc.) is provided by libxdp.
>>
>> The tx->producer is not updated, so the xs->tx->cached_cons and
>> xs->tx->cached_prod are always zero.
>>
>> When receive packets and user app call poll(), xsk_generic_xmit() will be
>> triggered by xsk_poll(), leading to this issue.
> Thanks, that really helped. The problem here is that the kernel cannot
> guess your intent. Since you created a socket with both Rx and Tx, it
> thinks you will use it for both, so it should increase
> queue_empty_descs in this case as you did not provide any Tx descs.
> Your proposed patch will break this. Consider this Tx case with the
> exact same init code as you have above but with this send loop:
>
> while(!global_exit) {
>         maybe_send_packets(...);
>         poll(fds, 1, -1);
> }
>
> With your patch, the queue_empty_descs will never be increased in the
> case when I do not submit any Tx descs, even though we would like it
> to be so.
>
> So in my mind, you have a couple of options:
>
> * Create two sockets, one rx only and one tx only and use the
> SHARED_UMEM mode to bind them to the same netdev and queue id. In your
> loop above, you would use the Rx socket. This might have the drawback
> that you need to call poll() twice if you are both sending and
> receiving in the same loop. But the stats will be the way you want
> them to be.
>
> * Introduce a new variable in user space that you increase every time
> you do poll() in your loop above. When displaying the statistics, just
> deduct this variable from the queue_empty_descs that the kernel
> reports using the XDP_STATISTICS getsockopt().
>
> Hope this helps.


Thank you for the advices.

 From user view, queue_empty_descs increases when the app only receive 
packets and call poll(), it is some confusing.

In your Tx case, if user app use sendto() to send packets, the 
queue_empty_descs will increase. This is reasonably.

In linux manual, poll() waits for some event on a file descriptor. But 
in af_xdp, poll() has a side effect of send msg.

Previous commit 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in 
AF_XDP rings") add need_wakeup flag. It mainly work in rx process. When 
the application and driver run on the same core, if the fill ring is 
empty, the driver can set the need_wakeup flag and return, so the 
application could be scheduled to produce entries of the fill ring.

The commit df551058f7a3 ("xsk: Fix crash in poll when device does not 
support ndo_xsk_wakeup") add sendmsg function in xsk_poll(), considering 
some devices donot define ndo_xsk_wakeup.

At present the value of need_wakeup & XDP_WAKEUP_TX is always true, 
except the mlx5 driver. If the mlx5 driver queued some packets for tx, 
it may clear XDP_WAKEUP_TX by xsk_clear_tx_need_wakeup(). So when user 
app use sendto() to send packets, __xsk_sendmsg() will return without 
calling xsk_generic_xmit().

So I have a bold suggestion, how about removing xsk_generic_xmit() in 
xsk_poll()?

>>>>>>> To avoid this count error, add check for tx descs before send msg in poll.
>>>>>>>
>>>>>>> Fixes: df551058f7a3 ("xsk: Fix crash in poll when device does not support ndo_xsk_wakeup")
>>>>>>> Signed-off-by: Wang Liang <wangliang74@...wei.com>
>>>>>> Acked-by: Stanislav Fomichev <sdf@...ichev.me>
>>>>> Hmm, wait, I stumbled upon xskq_has_descs again and it looks only at
>>>>> cached prod/cons. How is it supposed to work when the actual tx
>>>>> descriptor is posted? Is there anything besides xskq_cons_peek_desc from
>>>>> __xsk_generic_xmit that refreshes cached_prod?
>>>> Yes, you are right!
>>>>
>>>> How about using xskq_cons_nb_entries() to check free descriptors?
>>>>
>>>> Like this:
>>>>
>>>>
>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>> index e5d104ce7b82..babb7928d335 100644
>>>> --- a/net/xdp/xsk.c
>>>> +++ b/net/xdp/xsk.c
>>>> @@ -993,7 +993,7 @@ static __poll_t xsk_poll(struct file *file, struct
>>>> socket *sock,
>>>>            if (pool->cached_need_wakeup) {
>>>>                    if (xs->zc)
>>>>                            xsk_wakeup(xs, pool->cached_need_wakeup);
>>>> -               else if (xs->tx)
>>>> +               else if (xs->tx && xskq_cons_nb_entries(xs->tx, 1))
>>>>                            /* Poll needs to drive Tx also in copy mode */
>>>>                            xsk_generic_xmit(sk);
>>>>            }
>>>>
>>>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ