lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6d3af8dd-30c3-48d4-9083-7f00ea21ff8c@nvidia.com>
Date: Thu, 21 Mar 2024 17:36:15 +0000
From: Chaitanya Kulkarni <chaitanyak@...dia.com>
To: Sagi Grimberg <sagi@...mberg.me>, Kamaljit Singh <Kamaljit.Singh1@....com>
CC: "kbusch@...nel.org" <kbusch@...nel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-nvme@...ts.infradead.org"
	<linux-nvme@...ts.infradead.org>
Subject: Re: WQ_UNBOUND workqueue warnings from multiple drivers

On 3/20/24 02:11, Sagi Grimberg wrote:
>
>
> On 19/03/2024 0:33, Kamaljit Singh wrote:
>> Hello,
>>
>> After switching from Kernel v6.6.2 to v6.6.21 we're now seeing these 
>> workqueue
>> warnings. I found a discussion thread about the the Intel drm driver 
>> here
>> https://lore.kernel.org/lkml/ZO-BkaGuVCgdr3wc@slm.duckdns.org/T/
>>
>> and this related bug report 
>> https://gitlab.freedesktop.org/drm/intel/-/issues/9245
>> but that that drm fix isn't merged into v6.6.21. It appears that we 
>> may need the same
>> WQ_UNBOUND change to the nvme host tcp driver among others.
>>   [Fri Mar 15 22:30:06 2024] workqueue: nvme_tcp_io_work [nvme_tcp] 
>> hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
>> [Fri Mar 15 23:44:58 2024] workqueue: drain_vmap_area_work hogged CPU 
>> for >10000us 4 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 09:55:27 2024] workqueue: drain_vmap_area_work hogged CPU 
>> for >10000us 8 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 17:51:18 2024] workqueue: nvme_tcp_io_work [nvme_tcp] 
>> hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
>> [Sat Mar 16 23:04:14 2024] workqueue: nvme_tcp_io_work [nvme_tcp] 
>> hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
>> [Sun Mar 17 21:35:46 2024] perf: interrupt took too long (2707 > 
>> 2500), lowering kernel.perf_event_max_sample_rate to 73750
>> [Sun Mar 17 21:49:34 2024] workqueue: drain_vmap_area_work hogged CPU 
>> for >10000us 16 times, consider switching to WQ_UNBOUND
>> ...
>> workqueue: drm_fb_helper_damage_work [drm_kms_helper] hogged CPU for 
>> >10000us 32 times, consider switching to WQ_UNBOUND
>
> Hey Kamaljit,
>
> Its interesting that this happens because nvme_tcp_io_work is bound to 
> 1 jiffie.
> Although in theory we do not stop receiving from a socket once we 
> started, so
> I guess this can happen in some extreme cases. Was the test you were 
> running
> read-heavy?
>
> I was thinking that we may want to optionally move the recv path to 
> softirq instead to
> get some latency improvements, although I don't know if that would 
> improve the situation
> if we end up spending a lot of time in soft-irq...
>
>>     Thanks,
>> Kamaljit Singh
>
>

we need a regular test for this in blktests as it doesn't look like we 
caught this in
regular testing ...

Kamaljit, can you please provide details of the tests you are running so 
we can
reproduce ?

-ck


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ