lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aSYuX47uH4zT-FKi@dread.disaster.area>
Date: Wed, 26 Nov 2025 09:31:59 +1100
From: Dave Chinner <david@...morbit.com>
To: Karim Manaouil <kmanaouil.dev@...il.com>
Cc: Carlos Maiolino <cem@...nel.org>, linux-xfs@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: Too many xfs-conv kworker threads

On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote:
> Hi folks,
> 
> I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15
> with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can
> achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores
> AMD EPYC 9224.

The mkfs.xfs (or xfs_info) output for the filesystem is on this
device is?

> I am running thpchallenge-fio from mmtests (its purpose is described
> here [1]). It's a fio job that inefficiently writes a large number of 64K
> files. On a system with 128GiB of RAM, it could create up to 100K files.
> A typical fio config looks like this:
> 
> [global]
> direct=0
> ioengine=sync
> blocksize=4096
> invalidate=0
> fallocate=none
> create_on_open=1
> 
> [writer]
> nrfiles=785988

That's ~800k files, not 100k.

> filesize=65536
> readwrite=write
> numjobs=4
> filename_format=$jobnum/workfile.$filenum
> 
> I noticed that, at some point, top reports around 42650 sleeping tasks,
> example:
> 
> Tasks: 42651 total,   1 running, 42650 sleeping,   0 stopped,   0 zombie
> 
> This is a test machine from a fresh boot running vanilla Debian.
> 
> After checking, it turned out, it was a massive list of xfs-conv
> kworkers. Something like this (truncated):
> 
>   58214 ?        I      0:00 [kworker/47:203-xfs-conv/md127]
>   58215 ?        I      0:00 [kworker/47:204-xfs-conv/md127]
>   58216 ?        I      0:00 [kworker/47:205-xfs-conv/md127]
>   58217 ?        I      0:00 [kworker/47:206-xfs-conv/md127]
>   58219 ?        I      0:00 [kworker/12:539-xfs-conv/md127]
>   58220 ?        I      0:00 [kworker/12:540-xfs-conv/md127]
>   58221 ?        I      0:00 [kworker/12:541-xfs-conv/md127]
>   58222 ?        I      0:00 [kworker/12:542-xfs-conv/md127]
>   58223 ?        I      0:00 [kworker/12:543-xfs-conv/md127]
>   58224 ?        I      0:00 [kworker/12:544-xfs-conv/md127]
>   58225 ?        I      0:00 [kworker/12:545-xfs-conv/md127]
>   58227 ?        I      0:00 [kworker/38:155-xfs-conv/md127]
>   58228 ?        I      0:00 [kworker/38:156-xfs-conv/md127]
>   58230 ?        I      0:00 [kworker/38:158-xfs-conv/md127]
>   58233 ?        I      0:00 [kworker/38:161-xfs-conv/md127]
>   58235 ?        I      0:00 [kworker/8:537-xfs-conv/md127]
>   58237 ?        I      0:00 [kworker/8:539-xfs-conv/md127]
>   58238 ?        I      0:00 [kworker/8:540-xfs-conv/md127]
>   58239 ?        I      0:00 [kworker/8:541-xfs-conv/md127]
>   58240 ?        I      0:00 [kworker/8:542-xfs-conv/md127]
>   58241 ?        I      0:00 [kworker/8:543-xfs-conv/md127]
> 
> It seems like the kernel is creating too many kworkers on each CPU.

Or there are tens of thousands of small random write IOs 
in flight at any given time and this process is serialising on
unwritten extent conversion during IO completion processing. i.e.
so many kworker threads indicates work processing is blocking, so
each new work queued gets a new kworker thread created to process
it.

I suspect that the filesystem is running out of journal space, and
then unwritten extent conversion transactions start lock-stepping
waiting for journal space. Hence the question about mkfs.xfs output.

Also info about IO load (e.g. `iostat -dxm 5` output) whilst the test
is running would be useful, because even fast devices can end up
being really slow when something stupid is being done...

Kernel profiles across all CPUs whilst the workload is running and
in this state would be useful.

> I am not sure if this has any effect on performance, but potentially,
> there is some scheduling overhead?!

It probably does, but a trainsmash of stalled in-progress work like
this is typically a symptom of some other misbehaviour occuring.

FWIW, for a workload intended to produce "inefficient write IO",
this is sort of behaviour is definitely indicating something
"inefficient" is occurring during write IO. So, in the end, there is
a definite possiblity that there may not actually be anything that
can be "fixed" here....

-Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ