lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <550af81b-6d62-4fc3-9df3-2d74989f4ca0@nvidia.com>
Date: Mon, 23 Dec 2024 08:17:08 +0000
From: Alex Lazar <alazar@...dia.com>
To: Joe Damato <jdamato@...tly.com>, "aleksander.lobakin@...el.com"
	<aleksander.lobakin@...el.com>, "almasrymina@...gle.com"
	<almasrymina@...gle.com>, "amritha.nambiar@...el.com"
	<amritha.nambiar@...el.com>, "bigeasy@...utronix.de" <bigeasy@...utronix.de>,
	"bjorn@...osinc.com" <bjorn@...osinc.com>, "corbet@....net" <corbet@....net>,
	Dan Jurgens <danielj@...dia.com>, "davem@...emloft.net"
	<davem@...emloft.net>, "donald.hunter@...il.com" <donald.hunter@...il.com>,
	"dsahern@...nel.org" <dsahern@...nel.org>, "edumazet@...gle.com"
	<edumazet@...gle.com>, "hawk@...nel.org" <hawk@...nel.org>,
	"jiri@...nulli.us" <jiri@...nulli.us>, "johannes.berg@...el.com"
	<johannes.berg@...el.com>, "kuba@...nel.org" <kuba@...nel.org>,
	"leitao@...ian.org" <leitao@...ian.org>, "leon@...nel.org" <leon@...nel.org>,
	"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
	"lorenzo@...nel.org" <lorenzo@...nel.org>, "michael.chan@...adcom.com"
	<michael.chan@...adcom.com>, "mkarsten@...terloo.ca" <mkarsten@...terloo.ca>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>, "pabeni@...hat.com"
	<pabeni@...hat.com>, Saeed Mahameed <saeedm@...dia.com>, "sdf@...ichev.me"
	<sdf@...ichev.me>, "skhawaja@...gle.com" <skhawaja@...gle.com>,
	"sridhar.samudrala@...el.com" <sridhar.samudrala@...el.com>, Tariq Toukan
	<tariqt@...dia.com>, "willemdebruijn.kernel@...il.com"
	<willemdebruijn.kernel@...il.com>, "xuanzhuo@...ux.alibaba.com"
	<xuanzhuo@...ux.alibaba.com>, Gal Pressman <gal@...dia.com>, Nimrod Oren
	<noren@...dia.com>, Dror Tennenbaum <drort@...dia.com>, Dragos Tatulea
	<dtatulea@...dia.com>
Subject: Re: [net-next v6 0/9] Add support for per-NAPI config via netlink



On 20/12/2024 19:40, Joe Damato wrote:
> On Wed, Dec 18, 2024 at 09:08:58AM -0800, Joe Damato wrote:
>> On Wed, Dec 18, 2024 at 11:22:33AM +0000, Alex Lazar wrote:
>>> Hi Joe and all,
>>>
>>> I am part of the NVIDIA Eth drivers team, and we are experiencing a problem,
>>> sibesced to this change: commit 86e25f40aa1e ("net: napi: Add napi_config")
>>>
>>> The issue occurs when sending packets from one machine to another.
>>> On the receiver side, we have XSK (XDPsock) that receives the packet and sends it
>>> back to the sender.
>>> At some point, one packet (packet A) gets "stuck," and if we send a new packet
>>> (packet B), it "pushes" the previous one. Packet A is then processed by the NAPI
>>> poll, and packet B gets stuck, and so on.
>>>
>>> Your change involves moving napi_hash_del() and napi_hash_add() from
>>> netif_napi_del() and netif_napi_add_weight() to napi_enable() and napi_disable().
>>> If I move them back to netif_napi_del() and netif_napi_add_weight(),
>>> the issue is resolved (I moved the entire if/else block, not just the napi_hash_del/add).
>>>
>>> This issue occurs with both the new and old APIs (netif_napi_add/_config).
>>> Moving the napi_hash_add() and napi_hash_del() functions resolves it for both.
>>> I am debugging this, no breakthrough so far.
>>>
>>> I would appreciate if you could look into this.
>>> We can provide more details per request.
>>
>> I appreciate your report, but there is not a lot in your message to
>> help debug the issue.
>>
>> Can you please:
>>
>> 1.) Verify that the kernel tree you are testing on has commit
>> cecc1555a8c2 ("net: Make napi_hash_lock irq safe") included ? If it
>> does not, can you pull in that commit and re-run your test and
>> report back if that fixes your problem?

I verified that the kernel tree includes commit cecc1555a8c2 ("net: Make 
napi_hash_lock irq safe"), but the issue still occurs.

>>
>> 2.) If (1) does not fix your problem, can you please reply with at
>> least the following information:
>>    - Specify what device this is happening on (in case I have access
>>      to one)

We are using two ConnectX-5 cards connected back-to-back.

>>    - Which driver is affected

The affected driver is the MLX5 driver.

>>    - Which upstream kernel SHA you are building your test kernel from

The upstream kernel SHA we are building is 9163b05eca1d ("Merge branch 
'add-support-for-so_priority-cmsg'").

>>    - The reproducer program(s) with clear instructions on how exactly
>>      to run it/them in order to reproduce the issue

Test setup/configuration:
On one side, we use a Python script with the scapy.all library to create 
UDP packets of size 1024, using port 19017 and the MAC/IP of the other side.
On the other side, we define an n-tuple filter (ethtool --config-ntuple 
eth2 flow-type udp4 dst-port 19017 action 4) and run xdpsock (xdpsock -i 
eth2 -N -q 4 --l2fwd -z -B).
In the test, we send a single packet each time, which is received and 
sent back to the sender.
As part of the validation, we check the statistics on the other side and 
notice a discrepancy between what xdpsock shows and what we see in the 
driver (ethtool -S eth2 | grep "tx_xsk_xmit").

> 
> I didn't hear back on the above, but wanted to let you know that
> I'll be out of the office soon, so my responses/bandwidth for
> helping to debug this will be limited over the next week or two.

Hi Joe,

Thanks for the quick response.
Comments inline, If you need more details or further clarification, 
please let me know.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ