linux-kernel - Re: Too many xfs-conv kworker threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aSeAPOZpcGaONne9@dread.disaster.area>
Date: Thu, 27 Nov 2025 09:33:32 +1100
From: Dave Chinner <david@...morbit.com>
To: Karim Manaouil <kmanaouil.dev@...il.com>
Cc: Carlos Maiolino <cem@...nel.org>, linux-xfs@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: Too many xfs-conv kworker threads

On Wed, Nov 26, 2025 at 01:27:21PM +0000, Karim Manaouil wrote:
> 
> Hi Dave,
> 
> Thanks for looking at this.
> 
> On Wed, Nov 26, 2025 at 09:31:59AM +1100, Dave Chinner wrote:
> > On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote:
> > > Hi folks,
> > > 
> > > I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15
> > > with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can
> > > achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores
> > > AMD EPYC 9224.
> > 
> > The mkfs.xfs (or xfs_info) output for the filesystem is on this
> > device is?
> 
> Here is xfs_info
> 
> meta-data=/dev/md127             isize=512    agcount=48, agsize=20346496 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1

rmapbt is enabled. Important.

> This is the last 20/30s from iostat -dxm5 during the test. It's been the
> same consistently throughput the test at ~80/89% utilization.
> 
> Device              w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz      aqu-sz  %util
> md127           68713.80   1051.87     0.00   0.00    1.05    15.68       72.14  89.52
> md127           66888.40    943.12     0.00   0.00    0.92    14.44       61.68  88.08
> md127           68453.80    653.24     0.00   0.00    1.23     9.77       84.37  87.12
> md127           82154.80    604.90     0.00   0.00    1.64     7.54      134.87  86.88
> md127           70320.60    295.50     0.00   0.00    1.97     4.30      138.60  87.12
> md127           19574.60     84.99     0.00   0.00    2.27     4.45       44.48  24.96
                                                                 ^^^^

And the average write IO size is between 4-16kB, and it's reaching
hundreds of IO in flight at the block layer at once. So, yeah, the
stress test is definitely resulting in inefficient IO patterns as
intended.

As for the writeback IO rate, this is pretty typical for delayed
allocation - writeback is single threaded and can block. Best case
for delayed allocation is 100-120k allocations per second.  Every IO
in your workload requires allocation, and it's running at about
70-80k allocations a second.

So, yeah, that seems a bit low, but not unexpectedly low.

> In addition, I got the kernel profile with perf record -a -g.
> 
> Please find at the end of this email the output of (~500 lines of) perf report.
> 
> I have also generated the flamegraph here to make life easy.
> 
> https://limewire.com/d/b5lJ1#ZigjlrS9mg

The vast majority of IO completion work is updating the rmapbt
in xfs_rmap_convert(). There  looks to be ~10x the CPU overhead in
updating the rmapbt (5%) vs the bmapbt (0.5%) during unwritten
extent conversion.

And I'd suggest that all the xfs-conv kworker threads are being
created because the rmapbt updates are contending on the AGF lock
to be able to perform the rmapbt update.

i.e. unwritten extent conversion bmbt updates are per-inode (no
global resources needed), whilst the rmapbt updates are per-AG.
Every file that is in the same AG will contend for the same AGF lock
to do rmap updates.

It will also contend with IO submission because it is doing
allocation and that requires holding the AGF locked.

IOWs, the contention point here is AGF locking for the rmapbt
updates during IO submission and IO completion.  If you turn off
rmapbt it will go somewhat faster, but it won't magically run at
device speed because writeback is single threaded.  I have some
ideas on how to reduce contention on the AGF for allocation and
rmapbt updates, but they are just ideas at this point.

> > > I am not sure if this has any effect on performance, but potentially,
> > > there is some scheduling overhead?!
> > 
> > It probably does, but a trainsmash of stalled in-progress work like
> > this is typically a symptom of some other misbehaviour occuring.
> > 
> > FWIW, for a workload intended to produce "inefficient write IO",
> > this is sort of behaviour is definitely indicating something
> > "inefficient" is occurring during write IO. So, in the end, there is
> > a definite possiblity that there may not actually be anything that
> > can be "fixed" here....
> 
> You're right, but having 45k kworker threads still looks questionable to me
> even with the inefficiency in mind.

The explosion of kworker threads is a result of scheduler behaviour.
It moves the writeback thread around because it is unbound and
frequently blocks, whilst other kernel tasks that are bound to a
specific CPU (like xfs-conv processing) takes scheduling priority.

It's not ideal behaviour in this particular corner case, but for a
stress test that is intended to create "inefficient IO patterns",
this is exactly the sort of behaviour it should be exercising.
Rmember, this is an artificial stress test....

-Dave.
-- 
Dave Chinner
david@...morbit.com