[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <97fc38e6-a226-5e22-efc2-4405beb6d75b@huaweicloud.com>
Date: Mon, 19 Aug 2024 21:38:06 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Zhang Yi <yi.zhang@...weicloud.com>, Haifeng Xu <haifeng.xu@...pee.com>
Cc: tytso@....edu, jack@...e.com, linux-ext4@...r.kernel.org,
linux-kernel@...r.kernel.org, hanjinke.666@...edance.com,
Tejun Heo <tj@...nel.org>, linux-block <linux-block@...r.kernel.org>,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: jbd2: io throttle for metadata buffers
+CC Tejun
+CC block
在 2024/08/19 20:49, Zhang Yi 写道:
> Hello, Haifeng.
>
> On 2024/8/19 18:19, Haifeng Xu wrote:
>> Hi, matsers!
>>
>>
>> We encountered high load issuses in our production environment recently. And the kernel version is stable-5.15.39
>> the filesystem is ext4(ordered).
>>
>>
>> After digging into it, we found the problem is due to io.max
>>
>>
>> thread 1:
>>
>> PID: 189529 TASK: ffff92ab51e5c080 CPU: 34 COMMAND: "mc"
>> #0 [ffffa638db807800] __schedule at ffffffff83b19898
>> #1 [ffffa638db807888] schedule at ffffffff83b19e9e
>> #2 [ffffa638db8078a8] io_schedule at ffffffff83b1a316
>> #3 [ffffa638db8078c0] bit_wait_io at ffffffff83b1a751
>> #4 [ffffa638db8078d8] __wait_on_bit at ffffffff83b1a373
>> #5 [ffffa638db807918] out_of_line_wait_on_bit at ffffffff83b1a46d
>> #6 [ffffa638db807970] __wait_on_buffer at ffffffff831b9c64
>> #7 [ffffa638db807988] jbd2_log_do_checkpoint at ffffffff832b556e
>> #8 [ffffa638db8079e8] __jbd2_log_wait_for_space at ffffffff832b55dc
>> #9 [ffffa638db807a30] add_transaction_credits at ffffffff832af369
>> #10 [ffffa638db807a98] start_this_handle at ffffffff832af50f
>> #11 [ffffa638db807b20] jbd2__journal_start at ffffffff832afe1f
>> #12 [ffffa638db807b60] __ext4_journal_start_sb at ffffffff83241af3
>> #13 [ffffa638db807ba8] __ext4_new_inode at ffffffff83253be6
>> #14 [ffffa638db807c80] ext4_mkdir at ffffffff8327ec9e
>> #15 [ffffa638db807d10] vfs_mkdir at ffffffff83182a92
>> #16 [ffffa638db807d50] ovl_mkdir_real at ffffffffc0965c9f [overlay]
>> #17 [ffffa638db807d80] ovl_create_real at ffffffffc0965e8b [overlay]
>> #18 [ffffa638db807db8] ovl_create_or_link at ffffffffc09677cc [overlay]
>> #19 [ffffa638db807e10] ovl_create_object at ffffffffc0967a48 [overlay]
>> #20 [ffffa638db807e60] ovl_mkdir at ffffffffc0967ad3 [overlay]
>> #21 [ffffa638db807e70] vfs_mkdir at ffffffff83182a92
>> #22 [ffffa638db807eb0] do_mkdirat at ffffffff83184305
>> #23 [ffffa638db807f08] __x64_sys_mkdirat at ffffffff831843df
>> #24 [ffffa638db807f28] do_syscall_64 at ffffffff83b0bf1c
>> #25 [ffffa638db807f50] entry_SYSCALL_64_after_hwframe at ffffffff83c0007c
>>
>> other threads:
>>
>>
>> PID: 21125 TASK: ffff929f5b9a0000 CPU: 44 COMMAND: "task_server"
>> #0 [ffffa638aff9b900] __schedule at ffffffff83b19898
>> #1 [ffffa638aff9b988] schedule at ffffffff83b19e9e
>> #2 [ffffa638aff9b9a8] schedule_preempt_disabled at ffffffff83b1a24e
>> #3 [ffffa638aff9b9b8] __mutex_lock at ffffffff83b1af28
>> #4 [ffffa638aff9ba38] __mutex_lock_slowpath at ffffffff83b1b1a3
>> #5 [ffffa638aff9ba48] mutex_lock at ffffffff83b1b1e2
>> #6 [ffffa638aff9ba60] mutex_lock_io at ffffffff83b1b210
>> #7 [ffffa638aff9ba80] __jbd2_log_wait_for_space at ffffffff832b563b
>> #8 [ffffa638aff9bac8] add_transaction_credits at ffffffff832af369
>> #9 [ffffa638aff9bb30] start_this_handle at ffffffff832af50f
>> #10 [ffffa638aff9bbb8] jbd2__journal_start at ffffffff832afe1f
>> #11 [ffffa638aff9bbf8] __ext4_journal_start_sb at ffffffff83241af3
>> #12 [ffffa638aff9bc40] ext4_dirty_inode at ffffffff83266d0a
>> #13 [ffffa638aff9bc60] __mark_inode_dirty at ffffffff831ab423
>> #14 [ffffa638aff9bca0] generic_update_time at ffffffff8319169d
>> #15 [ffffa638aff9bcb0] inode_update_time at ffffffff831916e5
>> #16 [ffffa638aff9bcc0] file_update_time at ffffffff83191b01
>> #17 [ffffa638aff9bd08] file_modified at ffffffff83191d47
>> #18 [ffffa638aff9bd20] ext4_write_checks at ffffffff8324e6e4
>> #19 [ffffa638aff9bd40] ext4_buffered_write_iter at ffffffff8324edfb
>> #20 [ffffa638aff9bd78] ext4_file_write_iter at ffffffff8324f553
>> #21 [ffffa638aff9bdf8] ext4_file_write_iter at ffffffff8324f505
>> #22 [ffffa638aff9be00] new_sync_write at ffffffff8316dfca
>> #23 [ffffa638aff9be90] vfs_write at ffffffff8316e975
>> #24 [ffffa638aff9bec8] ksys_write at ffffffff83170a97
>> #25 [ffffa638aff9bf08] __x64_sys_write at ffffffff83170b2a
>> #26 [ffffa638aff9bf18] do_syscall_64 at ffffffff83b0bf1c
>> #27 [ffffa638aff9bf38] asm_common_interrupt at ffffffff83c00cc8
>> #28 [ffffa638aff9bf50] entry_SYSCALL_64_after_hwframe at ffffffff83c0007c
>>
>>
>> The cgroup of thread1 has set io.max, so the j_checkpoint_mutex can't be released and many threads must wait for it.
>> I have some questions about the throttle for the metadata buffers.
>>
>> 1) writeback
>>
>> jbd2 converts the buffer head from jbddirty to buffer_dirty and trigger the write back in __jbd2_journal_temp_unlink_buffer().
>> By default, the blkcg in bdi_writeback attached to block device inode is blkcg_root which has no io throttle rules. But there may be other
>> threads which invoke sync_filesystem, such as umount overlayfs. This operation will write out all dirty data associated with the block
>> device. In this case, the bdi_writeback attached to block device inode may changed due to Boyer-Moore majority vote algorithm.
>> And the blkcg in bdi_writeback attached to block device inode is the group where the thread allocate the buffer head and dev page.
>>
>> So the writeback process of metadata buffers can also be throttled, right?
>>
>>
>> 2) checkpoint
>>
>> If the free log space is not suffcient, we will do checkpoint to update log tail. During the process, if the buffer head hasn't been
>> written out by wirteback. we will lock the buffer head and submit bio in current context.
>>
>> So the throttle rules may be different from writeback?
>>
>>
>> 3)j_checkpoint_mutex
>> If we can't make any progress in checkpoint due to io throttle, the j_checkpoint_mutex can'be release and block many others threads.
>>
>> So can we cancel the throttle rules for metadata buffers and keep it in blkcg_root?
>>
>
> It seems that iocost have already act as blkcg_root if bios have
> REQ_META set(ext4's metadata bh should've set this flag), but
> blk-thottle doesn't, Jinke had submitted a patch to improve this
> case, maybe it could help, please take a look at this patch. Or
> maybe we could add some similar logic in blk-throttle like iocost
> does for REQ_META.
>
> https://lore.kernel.org/linux-block/20230228085935.71465-1-hanjinke.666@bytedance.com/
Hi, Tejun
This patch can solve the priority inversion problem, however, I just
come up with a new idea:
For meta IO, just issue the IO directly like iocost, and then try to
pay debt. Fortunately, we already have 'carryover_bytes/ios' that
already do this for the case that limit changes, and it'll be easy
to do this for meta IO, just update 'carryover_bytes/ios' and dispatch
directly.
BTW, this is another reason that we should add a new module in iocost to
replace blk-throtl.
Thanks,
Kuai
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index dc6140fa3de0..38ffe0f95682 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1595,6 +1595,32 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
spin_unlock_irq(&q->queue_lock);
}
+static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio,
bool rw)
+{
+ struct throtl_service_queue *sq = &tg->service_queue;
+
+ /* throtl is FIFO - if bios are already queued, should queue */
+ if (sq->nr_queued[rw])
+ return false;
+
+ /* within limits, let's charge and dispatch directly */
+ if (!tg_may_dispatch(tg, bio, NULL))
+ return false;
+
+ return true;
+}
+
+static void throtl_dispatch_bio_in_debt(struct throtl_grp *tg, struct
bio *bio,
+ bool rw)
+{
+ unsigned int bio_size = throtl_bio_data_size(bio);
+
+ if (!bio_flagged(bio, BIO_BPS_THROTTLED))
+ tg->carryover_bytes[rw] -= bio_size;
+
+ tg->carryover_ios[rw]--;
+}
+
bool __blk_throtl_bio(struct bio *bio)
{
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
@@ -1611,34 +1637,28 @@ bool __blk_throtl_bio(struct bio *bio)
sq = &tg->service_queue;
while (true) {
- if (tg->last_low_overflow_time[rw] == 0)
- tg->last_low_overflow_time[rw] = jiffies;
- /* throtl is FIFO - if bios are already queued, should
queue */
- if (sq->nr_queued[rw])
- break;
-
- /* if above limits, break to queue */
- if (!tg_may_dispatch(tg, bio, NULL)) {
- tg->last_low_overflow_time[rw] = jiffies;
+ if (tg_within_limit(tg, bio, rw)) {
+ /* within limits, let's charge and dispatch
directly */
+ throtl_charge_bio(tg, bio);
+
+ /*
+ * We need to trim slice even when bios are not
being queued
+ * otherwise it might happen that a bio is not
queued for
+ * a long time and slice keeps on extending and
trim is not
+ * called for a long time. Now if limits are
reduced suddenly
+ * we take into account all the IO dispatched so
far at new
+ * low rate and * newly queued IO gets a really
long dispatch
+ * time.
+ *
+ * So keep on trimming slice even if bio is not
queued.
+ */
+ throtl_trim_slice(tg, rw);
+ } else if (bio_issue_as_root_blkg(bio)) {
+ throtl_dispatch_bio_in_debt(tg, bio, rw);
+ } else {
break;
}
- /* within limits, let's charge and dispatch directly */
- throtl_charge_bio(tg, bio);
-
- /*
- * We need to trim slice even when bios are not being queued
- * otherwise it might happen that a bio is not queued for
- * a long time and slice keeps on extending and trim is not
- * called for a long time. Now if limits are reduced
suddenly
- * we take into account all the IO dispatched so far at new
- * low rate and * newly queued IO gets a really long
dispatch
- * time.
- *
- * So keep on trimming slice even if bio is not queued.
- */
- throtl_trim_slice(tg, rw);
-
/*
* @bio passed through this layer without being throttled.
* Climb up the ladder. If we're already at the top, it
>
> Thanks,
> Yi.
>
> .
>
Powered by blists - more mailing lists