lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 8 May 2024 23:16:21 +0000
From: Kamaljit Singh <Kamaljit.Singh1@....com>
To: Sagi Grimberg <sagi@...mberg.me>, Chaitanya Kulkarni
	<chaitanyak@...dia.com>
CC: "kbusch@...nel.org" <kbusch@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-nvme@...ts.infradead.org"
	<linux-nvme@...ts.infradead.org>
Subject: Re: WQ_UNBOUND workqueue warnings from multiple drivers

Sagi,

>Does this happen with a 90-%100% read workload?
Yes, we’ve now seen it with 100% reads as well. Here’s the Medusa cmd we used. I’ve removed the devices for brevity.
sudo /opt/medusa_labs/test_tools/bin/maim 20g -b8K -Q128 -Y1 -M30 --full-device -B3 -r -d900000 <device_list>
 
We saw the original issue with the upstream kernel v6.6.21. But now we’re also seeing it with Ubuntu 24.04 (kernel 6.8.0-31-generic), where IOs are timing out and forcing connection drops.
 
 
>Question, are you working with a Linux controller?
No, with our ASIC (NVMe Fabrics bridge).
 
>what is the ctrl ioccsz?
ioccsz    : 4
 
 
Thanks,
Kamaljit
 
From: Sagi Grimberg <sagi@...mberg.me>
Date: Sunday, April 7, 2024 at 13:08
To: Kamaljit Singh <Kamaljit.Singh1@....com>, Chaitanya Kulkarni <chaitanyak@...dia.com>
Cc: kbusch@...nel.org <kbusch@...nel.org>, linux-kernel@...r.kernel.org <linux-kernel@...r.kernel.org>, linux-nvme@...ts.infradead.org <linux-nvme@...ts.infradead.org>
Subject: Re: WQ_UNBOUND workqueue warnings from multiple drivers
CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.


On 03/04/2024 2:50, Kamaljit Singh wrote:
> Sagi, Chaitanya,
>
> Sorry for the delay, found your replies in the junk folder :(
>
>>   Was the test you were running read-heavy?
> No, most of the failing fio tests were doing heavy writes. All were with 8 Controllers and 32 NS each. io-specs are below.
>
> [1] bs=16k, iodepth=16, rwmixread=0, numjobs=16
> Failed in ~1 min
>
> Some others were:
> [2] bs=8k, iodepth=16, rwmixread=5, numjobs=16
> [3] bs=8k, iodepth=16, rwmixread=50, numjobs=16

Interesting, that is the opposite of what I would suspect (I thought that
the workload would be read-only or read-mostly).

Does this happen with a 90-%100% read workload?

If we look at nvme_tcp_io_work() it is essentially looping
doing send() and recv() and every iteration checks if a 1ms
deadline elapsed. The fact that it happens on a 100% write
workload leads me to conclude that the only way this can
happen if sending a single 16K request to a controller on its
own takes more than 10ms, which is unexpected...

Question, are you working with a Linux controller? what
is the ctrl ioccsz?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ