linux-kernel - Re: ext4 writeback performance issue in 6.12

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aOesS6Feov9mrbJh@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Thu, 9 Oct 2025 18:06:27 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Matt Fleming <matt@...dmodwrite.com>
Cc: "Theodore Ts'o" <tytso@....edu>, Andreas Dilger <adilger.kernel@...ger.ca>,
        linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org,
        kernel-team@...udflare.com, linux-fsdevel@...r.kernel.org,
        Matthew Wilcox <willy@...radead.org>
Subject: Re: ext4 writeback performance issue in 6.12

On Mon, Oct 06, 2025 at 12:56:15PM +0100, Matt Fleming wrote:
> Hi,
> 
> We're seeing writeback take a long time and triggering blocked task
> warnings on some of our database nodes, e.g.
> 
>   INFO: task kworker/34:2:243325 blocked for more than 225 seconds.
>         Tainted: G           O       6.12.41-cloudflare-2025.8.2 #1
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:kworker/34:2    state:D stack:0     pid:243325 tgid:243325 ppid:2      task_flags:0x4208060 flags:0x00004000
>   Workqueue: cgroup_destroy css_free_rwork_fn
>   Call Trace:
>    <TASK>
>    __schedule+0x4fb/0xbf0
>    schedule+0x27/0xf0
>    wb_wait_for_completion+0x5d/0x90
>    ? __pfx_autoremove_wake_function+0x10/0x10
>    mem_cgroup_css_free+0x19/0xb0
>    css_free_rwork_fn+0x4e/0x430
>    process_one_work+0x17e/0x330
>    worker_thread+0x2ce/0x3f0
>    ? __pfx_worker_thread+0x10/0x10
>    kthread+0xd2/0x100
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork+0x34/0x50
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork_asm+0x1a/0x30
>    </TASK>
> 
> A large chunk of system time (4.43%) is being spent in the following
> code path:
> 
>    ext4_get_group_info+9
>    ext4_mb_good_group+41
>    ext4_mb_find_good_group_avg_frag_lists+136
>    ext4_mb_regular_allocator+2748
>    ext4_mb_new_blocks+2373
>    ext4_ext_map_blocks+2149
>    ext4_map_blocks+294
>    ext4_do_writepages+2031
>    ext4_writepages+173
>    do_writepages+229
>    __writeback_single_inode+65
>    writeback_sb_inodes+544
>    __writeback_inodes_wb+76
>    wb_writeback+413
>    wb_workfn+196
>    process_one_work+382
>    worker_thread+718
>    kthread+210
>    ret_from_fork+52
>    ret_from_fork_asm+26
> 
> That's the path through the CR_GOAL_LEN_FAST allocator.
> 
> The primary reason for all these cycles looks to be that we're spending
> a lot of time in ext4_mb_find_good_group_avg_frag_lists(). The fragment
> lists seem quite big and the function fails to find a suitable group
> pretty much every time it's called either because the frag list is empty
> (orders 10-13) or the average size is < 1280 (order 9). I'm assuming it
> falls back to a linear scan at that point.
> 
>   https://gist.github.com/mfleming/5b16ee4cf598e361faf54f795a98c0a8
> 
> $ sudo cat /proc/fs/ext4/md127/mb_structs_summary
> optimize_scan: 1
> max_free_order_lists:
> 	list_order_0_groups: 0
> 	list_order_1_groups: 1
> 	list_order_2_groups: 6
> 	list_order_3_groups: 42
> 	list_order_4_groups: 513
> 	list_order_5_groups: 62
> 	list_order_6_groups: 434
> 	list_order_7_groups: 2602
> 	list_order_8_groups: 10951
> 	list_order_9_groups: 44883
> 	list_order_10_groups: 152357
> 	list_order_11_groups: 24899
> 	list_order_12_groups: 30461
> 	list_order_13_groups: 18756
> avg_fragment_size_lists:
> 	list_order_0_groups: 108
> 	list_order_1_groups: 411
> 	list_order_2_groups: 1640
> 	list_order_3_groups: 5809
> 	list_order_4_groups: 14909
> 	list_order_5_groups: 31345
> 	list_order_6_groups: 54132
> 	list_order_7_groups: 90294
> 	list_order_8_groups: 77322
> 	list_order_9_groups: 10096
> 	list_order_10_groups: 0
> 	list_order_11_groups: 0
> 	list_order_12_groups: 0
> 	list_order_13_groups: 0
> 
> These machines are striped and are using noatime:

Hi Matt,

Thanks for the details, we have had issues in past where the allocator
gets stuck in a loop trying too hard to find blocks that are aligned to
the stripe size [1] but this particular issue was patched in an pre 6.12
kernel.

Coming to the above details, ext4_mb_find_good_group_avg_frag_list()
exits early if there are no groups of the needed so if we do have many
order 9+ allocations we shouldn't have been spending more time there.
The issue I think are the order 9 allocations, which allocator thinks it
can satisfy but it ends up not being able to find the space easily.
If ext4_mb_find_group_avg_frag_list() is indeed a bottleneck, there
are 2 places where it could be getting called from:

- ext4_mb_choose_next_group_goal_fast (criteria =
	EXT4_MB_CR_GOAL_LEN_FAST)
- ext4_mb_choose_next_group_best_avail (criteria =
	EXT4_MB_CR_BEST_AVAIL_LEN)

Will it be possible for you to use bpf to try to figure out which one of
the callers is actually the one bottlenecking (mihgt be tricky since
they will mostly get inlined) and a sample of values for ac_g_ex->fe_len
and ac_b_ex->fe_len if possible.

Also, can you share the ext4 mb stats by enabling it via:

 echo 1 > /sys/fs/ext4/vda2/mb_stats

And then once you are able to replicate it for a few mins: 

  cat /proc/fs/ext4/vda2/mb_stats

This will also give some idea on where the allocator is spending more
time.

Also, as Ted suggested, switching stripe off might also help here.

Regards,
Ojaswin
> 
> $ grep ext4 /proc/mounts
> /dev/md127 /state ext4 rw,noatime,stripe=1280 0 0
> 
> Is there some tunable or configuration option that I'm missing that
> could help here to avoid wasting time in
> ext4_mb_find_good_group_avg_frag_lists() when it's most likely going to
> fail an order 9 allocation anyway?
> 
> I'm happy to provide any more details that might help.
> 
> Thanks,
> Matt