[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251009101748.529277-1-matt@readmodwrite.com>
Date: Thu, 9 Oct 2025 11:17:48 +0100
From: Matt Fleming <matt@...dmodwrite.com>
To: Jan Kara <jack@...e.cz>
Cc: adilger.kernel@...ger.ca,
kernel-team@...udflare.com,
libaokun1@...wei.com,
linux-ext4@...r.kernel.org,
linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org,
matt@...dmodwrite.com,
tytso@....edu,
willy@...radead.org
Subject: Re: ext4 writeback performance issue in 6.12
On Wed, Oct 08, 2025 at 06:35:29PM +0200, Jan Kara wrote:
> Hi Matt!
>
> Nice talking to you again :)
Same. It's been too long :)
> On Wed 08-10-25 16:07:05, Matt Fleming wrote:
>
[...]
> So this particular hang check warning will be silenced by [1]. That being
> said if the writeback is indeed taking longer than expected (depends on
> cgroup configuration etc.) these patches will obviously not fix it. Based
> on what you write below, are you saying that most of the time from these
> 225s is spent in the filesystem allocating blocks? I'd expect we'd spend
> most of the time waiting for IO to complete...
Yeah, you're right. Most of the time is spenting waiting for writeback
to complete.
> So I'm somewhat confused here. How big is the allocation request? Above you
> write that average size of order 9 bucket is < 1280 which is true and
> makes me assume the allocation is for 1 stripe which is 1280 blocks. But
> here you write about order 9 allocation.
Sorry, I muddled my words. The allocation request is for 1280 blocks.
> Anyway, stripe aligned allocations don't always play well with
> mb_optimize_scan logic, so you can try mounting the filesystem with
> mb_optimize_scan=0 mount option.
Thanks, but unfortunately running with mb_optimize_scan=0 gives us much
worse performance. It looks like it's taking a long time to write out
even 1 page to disk. The flusher thread has been running for 20+hours
now non-stop and it's blocking tasks waiting on writeback.
[Thu Oct 9 09:49:59 2025] INFO: task dockerd:45649 blocked for more than 70565 seconds.
mfleming@...e:~$ ps -p 50674 -o pid,etime,cputime,comm
PID ELAPSED TIME COMMAND
50674 20:18:25 20:14:15 kworker/u400:20+flush-9:127
A perf profile shows:
# Overhead Command Shared Object Symbol
# ........ ............... ................. ...................................
#
32.09% kworker/u400:20 [kernel.kallsyms] [k] ext4_get_group_info
|
|--11.91%--ext4_mb_prefetch
| ext4_mb_regular_allocator
| ext4_mb_new_blocks
| ext4_ext_map_blocks
| ext4_map_blocks
| ext4_do_writepages
| ext4_writepages
| do_writepages
| __writeback_single_inode
| writeback_sb_inodes
| __writeback_inodes_wb
| wb_writeback
| wb_workfn
| process_one_work
| worker_thread
| kthread
| ret_from_fork
| ret_from_fork_asm
|
|--7.23%--ext4_mb_regular_allocator
| ext4_mb_new_blocks
| ext4_ext_map_blocks
| ext4_map_blocks
| ext4_do_writepages
| ext4_writepages
| do_writepages
| __writeback_single_inode
| writeback_sb_inodes
| __writeback_inodes_wb
| wb_writeback
| wb_workfn
| process_one_work
| worker_thread
| kthread
| ret_from_fork
| ret_from_fork_asm
mfleming@...e:~$ sudo perf ftrace latency -b -p 50674 -T ext4_mb_regular_allocator -- sleep 10
# DURATION | COUNT | GRAPH |
0 - 1 us | 0 | |
1 - 2 us | 0 | |
2 - 4 us | 0 | |
4 - 8 us | 0 | |
8 - 16 us | 0 | |
16 - 32 us | 0 | |
32 - 64 us | 0 | |
64 - 128 us | 0 | |
128 - 256 us | 0 | |
256 - 512 us | 0 | |
512 - 1024 us | 0 | |
1 - 2 ms | 0 | |
2 - 4 ms | 0 | |
4 - 8 ms | 0 | |
8 - 16 ms | 0 | |
16 - 32 ms | 0 | |
32 - 64 ms | 0 | |
64 - 128 ms | 85 | ############################################# |
128 - 256 ms | 1 | |
256 - 512 ms | 0 | |
512 - 1024 ms | 0 | |
1 - ... s | 0 | |
mfleming@...e:~$ sudo perf ftrace latency -b -p 50674 -T ext4_mb_prefetch -- sleep 10
# DURATION | COUNT | GRAPH |
0 - 1 us | 130 | |
1 - 2 us | 1962306 | #################################### |
2 - 4 us | 497793 | ######### |
4 - 8 us | 4598 | |
8 - 16 us | 277 | |
16 - 32 us | 21 | |
32 - 64 us | 10 | |
64 - 128 us | 1 | |
128 - 256 us | 0 | |
256 - 512 us | 0 | |
512 - 1024 us | 0 | |
1 - 2 ms | 0 | |
2 - 4 ms | 0 | |
4 - 8 ms | 0 | |
8 - 16 ms | 0 | |
16 - 32 ms | 0 | |
32 - 64 ms | 0 | |
64 - 128 ms | 0 | |
128 - 256 ms | 0 | |
256 - 512 ms | 0 | |
512 - 1024 ms | 0 | |
1 - ... s | 0 | |
mfleming@...e:~$ sudo bpftrace -e 'fentry:vmlinux:writeback_sb_inodes / tid==50674/ { @in = args.work->nr_pages; @start=nsecs;} fexit:vmlinux:writeback_sb_inodes /tid == 50674/ { $delta = (nsecs - @start) / 1000000; printf("IN: work->nr_pages=%d, OUT: work->nr_pages=%d, wrote=%d page(s) in %dms\n", @in, args.work->nr_pages, @in - args.work->nr_pages, $delta);} END{clear(@in);} interval:s:5 { exit();}'
Attaching 4 probes...
IN: work->nr_pages=6095831, OUT: work->nr_pages=6095830, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095830, OUT: work->nr_pages=6095829, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095829, OUT: work->nr_pages=6095828, wrote=1 page(s) in 108ms
IN: work->nr_pages=6095828, OUT: work->nr_pages=6095827, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095827, OUT: work->nr_pages=6095826, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095826, OUT: work->nr_pages=6095825, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095825, OUT: work->nr_pages=6095824, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095824, OUT: work->nr_pages=6095823, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095823, OUT: work->nr_pages=6095822, wrote=1 page(s) in 107ms
IN: work->nr_pages=6095822, OUT: work->nr_pages=6095821, wrote=1 page(s) in 106ms
Thanks,
Matt
Powered by blists - more mailing lists