linux-kernel - Re: ext4 writeback performance issue in 6.12

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2nuegl4wtmu3lkprcomfeluii77ofrmkn4ukvbx2gesnqlsflk@yx466sbd7bni>
Date: Wed, 8 Oct 2025 18:35:29 +0200
From: Jan Kara <jack@...e.cz>
To: Matt Fleming <matt@...dmodwrite.com>
Cc: adilger.kernel@...ger.ca, kernel-team@...udflare.com, 
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, 
	tytso@....edu, willy@...radead.org, Baokun Li <libaokun1@...wei.com>, 
	Jan Kara <jack@...e.cz>
Subject: Re: ext4 writeback performance issue in 6.12

Hi Matt!

Nice talking to you again :)

On Wed 08-10-25 16:07:05, Matt Fleming wrote:
> (Adding Baokun and Jan in case they have any ideas)
> On Mon, Oct 06, 2025 at 12:56:15 +0100, Matt Fleming wrote:
> > Hi,
> > 
> > We're seeing writeback take a long time and triggering blocked task
> > warnings on some of our database nodes, e.g.
> > 
> >   INFO: task kworker/34:2:243325 blocked for more than 225 seconds.
> >         Tainted: G           O       6.12.41-cloudflare-2025.8.2 #1
> >   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >   task:kworker/34:2    state:D stack:0     pid:243325 tgid:243325 ppid:2      task_flags:0x4208060 flags:0x00004000
> >   Workqueue: cgroup_destroy css_free_rwork_fn
> >   Call Trace:
> >    <TASK>
> >    __schedule+0x4fb/0xbf0
> >    schedule+0x27/0xf0
> >    wb_wait_for_completion+0x5d/0x90
> >    ? __pfx_autoremove_wake_function+0x10/0x10
> >    mem_cgroup_css_free+0x19/0xb0
> >    css_free_rwork_fn+0x4e/0x430
> >    process_one_work+0x17e/0x330
> >    worker_thread+0x2ce/0x3f0
> >    ? __pfx_worker_thread+0x10/0x10
> >    kthread+0xd2/0x100
> >    ? __pfx_kthread+0x10/0x10
> >    ret_from_fork+0x34/0x50
> >    ? __pfx_kthread+0x10/0x10
> >    ret_from_fork_asm+0x1a/0x30
> >    </TASK>

So this particular hang check warning will be silenced by [1]. That being
said if the writeback is indeed taking longer than expected (depends on
cgroup configuration etc.) these patches will obviously not fix it. Based
on what you write below, are you saying that most of the time from these
225s is spent in the filesystem allocating blocks? I'd expect we'd spend
most of the time waiting for IO to complete...

[1] https://lore.kernel.org/linux-fsdevel/20250930065637.1876707-1-sunjunchao@bytedance.com/

> > A large chunk of system time (4.43%) is being spent in the following
> > code path:
> > 
> >    ext4_get_group_info+9
> >    ext4_mb_good_group+41
> >    ext4_mb_find_good_group_avg_frag_lists+136
> >    ext4_mb_regular_allocator+2748
> >    ext4_mb_new_blocks+2373
> >    ext4_ext_map_blocks+2149
> >    ext4_map_blocks+294
> >    ext4_do_writepages+2031
> >    ext4_writepages+173
> >    do_writepages+229
> >    __writeback_single_inode+65
> >    writeback_sb_inodes+544
> >    __writeback_inodes_wb+76
> >    wb_writeback+413
> >    wb_workfn+196
> >    process_one_work+382
> >    worker_thread+718
> >    kthread+210
> >    ret_from_fork+52
> >    ret_from_fork_asm+26
> > 
> > That's the path through the CR_GOAL_LEN_FAST allocator.
> > 
> > The primary reason for all these cycles looks to be that we're spending
> > a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment
> > lists seem quite big and the function fails to find a suitable group
> > pretty much every time it's called either because the frag list is empty
> > (orders 10-13) or the average size is < 1280 (order 9). I'm assuming it
> > falls back to a linear scan at that point.
> > 
> >   https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8
> > 
> > $ sudo cat /proc/fs/ext4/md127/mb_structs_summary
> > optimize_scan: 1
> > max_free_order_lists:
> > 	list_order_0_groups: 0
> > 	list_order_1_groups: 1
> > 	list_order_2_groups: 6
> > 	list_order_3_groups: 42
> > 	list_order_4_groups: 513
> > 	list_order_5_groups: 62
> > 	list_order_6_groups: 434
> > 	list_order_7_groups: 2602
> > 	list_order_8_groups: 10951
> > 	list_order_9_groups: 44883
> > 	list_order_10_groups: 152357
> > 	list_order_11_groups: 24899
> > 	list_order_12_groups: 30461
> > 	list_order_13_groups: 18756
> > avg_fragment_size_lists:
> > 	list_order_0_groups: 108
> > 	list_order_1_groups: 411
> > 	list_order_2_groups: 1640
> > 	list_order_3_groups: 5809
> > 	list_order_4_groups: 14909
> > 	list_order_5_groups: 31345
> > 	list_order_6_groups: 54132
> > 	list_order_7_groups: 90294
> > 	list_order_8_groups: 77322
> > 	list_order_9_groups: 10096
> > 	list_order_10_groups: 0
> > 	list_order_11_groups: 0
> > 	list_order_12_groups: 0
> > 	list_order_13_groups: 0
> > 
> > These machines are striped and are using noatime:
> > 
> > $ grep ext4 /proc/mounts
> > /dev/md127 /state ext4 rw,noatime,stripe=1280 0 0
> > 
> > Is there some tunable or configuration option that I'm missing that
> > could help here to avoid wasting time in
> > ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to
> > fail an order 9 allocation anyway?

So I'm somewhat confused here. How big is the allocation request? Above you
write that average size of order 9 bucket is < 1280 which is true and
makes me assume the allocation is for 1 stripe which is 1280 blocks. But
here you write about order 9 allocation.

Anyway, stripe aligned allocations don't always play well with
mb_optimize_scan logic, so you can try mounting the filesystem with
mb_optimize_scan=0 mount option.

								Honza

-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR