linux-ext4 - Re: jbd2: io throttle for metadata buffers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b85e49c6-588e-63c7-b153-a273183f810e@huaweicloud.com>
Date: Mon, 19 Aug 2024 20:49:31 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Haifeng Xu <haifeng.xu@...pee.com>
Cc: tytso@....edu, jack@...e.com, linux-ext4@...r.kernel.org,
 linux-kernel@...r.kernel.org, hanjinke.666@...edance.com, yukuai3@...wei.com
Subject: Re: jbd2: io throttle for metadata buffers

Hello, Haifeng.

On 2024/8/19 18:19, Haifeng Xu wrote:
> Hi, matsers!
> 
> 
> We encountered high load issuses in our production environment recently. And the kernel version is stable-5.15.39
> the filesystem is ext4(ordered).
> 
> 
> After digging into it, we found the problem is due to io.max
> 
> 
> thread 1:                                             
> 
> PID: 189529  TASK: ffff92ab51e5c080  CPU: 34  COMMAND: "mc"
>  #0 [ffffa638db807800] __schedule at ffffffff83b19898
>  #1 [ffffa638db807888] schedule at ffffffff83b19e9e
>  #2 [ffffa638db8078a8] io_schedule at ffffffff83b1a316
>  #3 [ffffa638db8078c0] bit_wait_io at ffffffff83b1a751
>  #4 [ffffa638db8078d8] __wait_on_bit at ffffffff83b1a373
>  #5 [ffffa638db807918] out_of_line_wait_on_bit at ffffffff83b1a46d
>  #6 [ffffa638db807970] __wait_on_buffer at ffffffff831b9c64
>  #7 [ffffa638db807988] jbd2_log_do_checkpoint at ffffffff832b556e
>  #8 [ffffa638db8079e8] __jbd2_log_wait_for_space at ffffffff832b55dc
>  #9 [ffffa638db807a30] add_transaction_credits at ffffffff832af369
> #10 [ffffa638db807a98] start_this_handle at ffffffff832af50f
> #11 [ffffa638db807b20] jbd2__journal_start at ffffffff832afe1f
> #12 [ffffa638db807b60] __ext4_journal_start_sb at ffffffff83241af3
> #13 [ffffa638db807ba8] __ext4_new_inode at ffffffff83253be6
> #14 [ffffa638db807c80] ext4_mkdir at ffffffff8327ec9e
> #15 [ffffa638db807d10] vfs_mkdir at ffffffff83182a92
> #16 [ffffa638db807d50] ovl_mkdir_real at ffffffffc0965c9f [overlay]
> #17 [ffffa638db807d80] ovl_create_real at ffffffffc0965e8b [overlay]
> #18 [ffffa638db807db8] ovl_create_or_link at ffffffffc09677cc [overlay]
> #19 [ffffa638db807e10] ovl_create_object at ffffffffc0967a48 [overlay]
> #20 [ffffa638db807e60] ovl_mkdir at ffffffffc0967ad3 [overlay]
> #21 [ffffa638db807e70] vfs_mkdir at ffffffff83182a92
> #22 [ffffa638db807eb0] do_mkdirat at ffffffff83184305
> #23 [ffffa638db807f08] __x64_sys_mkdirat at ffffffff831843df
> #24 [ffffa638db807f28] do_syscall_64 at ffffffff83b0bf1c
> #25 [ffffa638db807f50] entry_SYSCALL_64_after_hwframe at ffffffff83c0007c
> 
> other threads:
> 
> 
> PID: 21125  TASK: ffff929f5b9a0000  CPU: 44  COMMAND: "task_server"
>  #0 [ffffa638aff9b900] __schedule at ffffffff83b19898
>  #1 [ffffa638aff9b988] schedule at ffffffff83b19e9e
>  #2 [ffffa638aff9b9a8] schedule_preempt_disabled at ffffffff83b1a24e
>  #3 [ffffa638aff9b9b8] __mutex_lock at ffffffff83b1af28
>  #4 [ffffa638aff9ba38] __mutex_lock_slowpath at ffffffff83b1b1a3
>  #5 [ffffa638aff9ba48] mutex_lock at ffffffff83b1b1e2
>  #6 [ffffa638aff9ba60] mutex_lock_io at ffffffff83b1b210
>  #7 [ffffa638aff9ba80] __jbd2_log_wait_for_space at ffffffff832b563b
>  #8 [ffffa638aff9bac8] add_transaction_credits at ffffffff832af369
>  #9 [ffffa638aff9bb30] start_this_handle at ffffffff832af50f
> #10 [ffffa638aff9bbb8] jbd2__journal_start at ffffffff832afe1f
> #11 [ffffa638aff9bbf8] __ext4_journal_start_sb at ffffffff83241af3
> #12 [ffffa638aff9bc40] ext4_dirty_inode at ffffffff83266d0a
> #13 [ffffa638aff9bc60] __mark_inode_dirty at ffffffff831ab423
> #14 [ffffa638aff9bca0] generic_update_time at ffffffff8319169d
> #15 [ffffa638aff9bcb0] inode_update_time at ffffffff831916e5
> #16 [ffffa638aff9bcc0] file_update_time at ffffffff83191b01
> #17 [ffffa638aff9bd08] file_modified at ffffffff83191d47
> #18 [ffffa638aff9bd20] ext4_write_checks at ffffffff8324e6e4
> #19 [ffffa638aff9bd40] ext4_buffered_write_iter at ffffffff8324edfb
> #20 [ffffa638aff9bd78] ext4_file_write_iter at ffffffff8324f553
> #21 [ffffa638aff9bdf8] ext4_file_write_iter at ffffffff8324f505
> #22 [ffffa638aff9be00] new_sync_write at ffffffff8316dfca
> #23 [ffffa638aff9be90] vfs_write at ffffffff8316e975
> #24 [ffffa638aff9bec8] ksys_write at ffffffff83170a97
> #25 [ffffa638aff9bf08] __x64_sys_write at ffffffff83170b2a
> #26 [ffffa638aff9bf18] do_syscall_64 at ffffffff83b0bf1c
> #27 [ffffa638aff9bf38] asm_common_interrupt at ffffffff83c00cc8
> #28 [ffffa638aff9bf50] entry_SYSCALL_64_after_hwframe at ffffffff83c0007c
> 
> 
> The cgroup of thread1 has set io.max, so the j_checkpoint_mutex can't be released and many threads must wait for it.
> I have some questions about the throttle for the metadata buffers.
> 
> 1) writeback 
> 
> jbd2 converts the buffer head from jbddirty to buffer_dirty and trigger the write back in __jbd2_journal_temp_unlink_buffer().
> By default, the blkcg in bdi_writeback attached to block device inode is blkcg_root which has no io throttle rules. But there may be other
> threads which invoke sync_filesystem, such as umount overlayfs. This operation will write out all dirty data associated with the block
> device. In this case, the bdi_writeback attached to block device inode may changed due to Boyer-Moore majority vote algorithm.
> And the blkcg in bdi_writeback attached to block device inode is the group where the thread allocate the buffer head and dev page.
> 
> So the writeback process of metadata buffers can also be throttled, right?
> 
> 
> 2) checkpoint
> 
> If the free log space is not suffcient, we will do checkpoint to update log tail. During the process, if the buffer head hasn't been 
> written out by wirteback. we will lock the buffer head and submit bio in current context.
> 
> So the throttle rules may be different from writeback?
> 
> 
> 3）j_checkpoint_mutex
> If we can't make any progress in checkpoint due to io throttle, the j_checkpoint_mutex can'be release and block many others threads.
> 
> So can we cancel the throttle rules for metadata buffers and keep it in blkcg_root?
> 

It seems that iocost have already act as blkcg_root if bios have
REQ_META set(ext4's metadata bh should've set this flag), but
blk-thottle doesn't, Jinke had submitted a patch to improve this
case, maybe it could help, please take a look at this patch. Or
maybe we could add some similar logic in blk-throttle like iocost
does for REQ_META.

https://lore.kernel.org/linux-block/20230228085935.71465-1-hanjinke.666@bytedance.com/

Thanks,
Yi.