linux-kernel - Re: jbd2: io throttle for metadata buffers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <97fc38e6-a226-5e22-efc2-4405beb6d75b@huaweicloud.com>
Date: Mon, 19 Aug 2024 21:38:06 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Zhang Yi <yi.zhang@...weicloud.com>, Haifeng Xu <haifeng.xu@...pee.com>
Cc: tytso@....edu, jack@...e.com, linux-ext4@...r.kernel.org,
 linux-kernel@...r.kernel.org, hanjinke.666@...edance.com,
 Tejun Heo <tj@...nel.org>, linux-block <linux-block@...r.kernel.org>,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: jbd2: io throttle for metadata buffers

+CC Tejun
+CC block

在 2024/08/19 20:49, Zhang Yi 写道:
> Hello, Haifeng.
> 
> On 2024/8/19 18:19, Haifeng Xu wrote:
>> Hi, matsers!
>>
>>
>> We encountered high load issuses in our production environment recently. And the kernel version is stable-5.15.39
>> the filesystem is ext4(ordered).
>>
>>
>> After digging into it, we found the problem is due to io.max
>>
>>
>> thread 1:
>>
>> PID: 189529  TASK: ffff92ab51e5c080  CPU: 34  COMMAND: "mc"
>>   #0 [ffffa638db807800] __schedule at ffffffff83b19898
>>   #1 [ffffa638db807888] schedule at ffffffff83b19e9e
>>   #2 [ffffa638db8078a8] io_schedule at ffffffff83b1a316
>>   #3 [ffffa638db8078c0] bit_wait_io at ffffffff83b1a751
>>   #4 [ffffa638db8078d8] __wait_on_bit at ffffffff83b1a373
>>   #5 [ffffa638db807918] out_of_line_wait_on_bit at ffffffff83b1a46d
>>   #6 [ffffa638db807970] __wait_on_buffer at ffffffff831b9c64
>>   #7 [ffffa638db807988] jbd2_log_do_checkpoint at ffffffff832b556e
>>   #8 [ffffa638db8079e8] __jbd2_log_wait_for_space at ffffffff832b55dc
>>   #9 [ffffa638db807a30] add_transaction_credits at ffffffff832af369
>> #10 [ffffa638db807a98] start_this_handle at ffffffff832af50f
>> #11 [ffffa638db807b20] jbd2__journal_start at ffffffff832afe1f
>> #12 [ffffa638db807b60] __ext4_journal_start_sb at ffffffff83241af3
>> #13 [ffffa638db807ba8] __ext4_new_inode at ffffffff83253be6
>> #14 [ffffa638db807c80] ext4_mkdir at ffffffff8327ec9e
>> #15 [ffffa638db807d10] vfs_mkdir at ffffffff83182a92
>> #16 [ffffa638db807d50] ovl_mkdir_real at ffffffffc0965c9f [overlay]
>> #17 [ffffa638db807d80] ovl_create_real at ffffffffc0965e8b [overlay]
>> #18 [ffffa638db807db8] ovl_create_or_link at ffffffffc09677cc [overlay]
>> #19 [ffffa638db807e10] ovl_create_object at ffffffffc0967a48 [overlay]
>> #20 [ffffa638db807e60] ovl_mkdir at ffffffffc0967ad3 [overlay]
>> #21 [ffffa638db807e70] vfs_mkdir at ffffffff83182a92
>> #22 [ffffa638db807eb0] do_mkdirat at ffffffff83184305
>> #23 [ffffa638db807f08] __x64_sys_mkdirat at ffffffff831843df
>> #24 [ffffa638db807f28] do_syscall_64 at ffffffff83b0bf1c
>> #25 [ffffa638db807f50] entry_SYSCALL_64_after_hwframe at ffffffff83c0007c
>>
>> other threads:
>>
>>
>> PID: 21125  TASK: ffff929f5b9a0000  CPU: 44  COMMAND: "task_server"
>>   #0 [ffffa638aff9b900] __schedule at ffffffff83b19898
>>   #1 [ffffa638aff9b988] schedule at ffffffff83b19e9e
>>   #2 [ffffa638aff9b9a8] schedule_preempt_disabled at ffffffff83b1a24e
>>   #3 [ffffa638aff9b9b8] __mutex_lock at ffffffff83b1af28
>>   #4 [ffffa638aff9ba38] __mutex_lock_slowpath at ffffffff83b1b1a3
>>   #5 [ffffa638aff9ba48] mutex_lock at ffffffff83b1b1e2
>>   #6 [ffffa638aff9ba60] mutex_lock_io at ffffffff83b1b210
>>   #7 [ffffa638aff9ba80] __jbd2_log_wait_for_space at ffffffff832b563b
>>   #8 [ffffa638aff9bac8] add_transaction_credits at ffffffff832af369
>>   #9 [ffffa638aff9bb30] start_this_handle at ffffffff832af50f
>> #10 [ffffa638aff9bbb8] jbd2__journal_start at ffffffff832afe1f
>> #11 [ffffa638aff9bbf8] __ext4_journal_start_sb at ffffffff83241af3
>> #12 [ffffa638aff9bc40] ext4_dirty_inode at ffffffff83266d0a
>> #13 [ffffa638aff9bc60] __mark_inode_dirty at ffffffff831ab423
>> #14 [ffffa638aff9bca0] generic_update_time at ffffffff8319169d
>> #15 [ffffa638aff9bcb0] inode_update_time at ffffffff831916e5
>> #16 [ffffa638aff9bcc0] file_update_time at ffffffff83191b01
>> #17 [ffffa638aff9bd08] file_modified at ffffffff83191d47
>> #18 [ffffa638aff9bd20] ext4_write_checks at ffffffff8324e6e4
>> #19 [ffffa638aff9bd40] ext4_buffered_write_iter at ffffffff8324edfb
>> #20 [ffffa638aff9bd78] ext4_file_write_iter at ffffffff8324f553
>> #21 [ffffa638aff9bdf8] ext4_file_write_iter at ffffffff8324f505
>> #22 [ffffa638aff9be00] new_sync_write at ffffffff8316dfca
>> #23 [ffffa638aff9be90] vfs_write at ffffffff8316e975
>> #24 [ffffa638aff9bec8] ksys_write at ffffffff83170a97
>> #25 [ffffa638aff9bf08] __x64_sys_write at ffffffff83170b2a
>> #26 [ffffa638aff9bf18] do_syscall_64 at ffffffff83b0bf1c
>> #27 [ffffa638aff9bf38] asm_common_interrupt at ffffffff83c00cc8
>> #28 [ffffa638aff9bf50] entry_SYSCALL_64_after_hwframe at ffffffff83c0007c
>>
>>
>> The cgroup of thread1 has set io.max, so the j_checkpoint_mutex can't be released and many threads must wait for it.
>> I have some questions about the throttle for the metadata buffers.
>>
>> 1) writeback
>>
>> jbd2 converts the buffer head from jbddirty to buffer_dirty and trigger the write back in __jbd2_journal_temp_unlink_buffer().
>> By default, the blkcg in bdi_writeback attached to block device inode is blkcg_root which has no io throttle rules. But there may be other
>> threads which invoke sync_filesystem, such as umount overlayfs. This operation will write out all dirty data associated with the block
>> device. In this case, the bdi_writeback attached to block device inode may changed due to Boyer-Moore majority vote algorithm.
>> And the blkcg in bdi_writeback attached to block device inode is the group where the thread allocate the buffer head and dev page.
>>
>> So the writeback process of metadata buffers can also be throttled, right?
>>
>>
>> 2) checkpoint
>>
>> If the free log space is not suffcient, we will do checkpoint to update log tail. During the process, if the buffer head hasn't been
>> written out by wirteback. we will lock the buffer head and submit bio in current context.
>>
>> So the throttle rules may be different from writeback?
>>
>>
>> 3）j_checkpoint_mutex
>> If we can't make any progress in checkpoint due to io throttle, the j_checkpoint_mutex can'be release and block many others threads.
>>
>> So can we cancel the throttle rules for metadata buffers and keep it in blkcg_root?
>>
> 
> It seems that iocost have already act as blkcg_root if bios have
> REQ_META set(ext4's metadata bh should've set this flag), but
> blk-thottle doesn't, Jinke had submitted a patch to improve this
> case, maybe it could help, please take a look at this patch. Or
> maybe we could add some similar logic in blk-throttle like iocost
> does for REQ_META.
> 
> https://lore.kernel.org/linux-block/20230228085935.71465-1-hanjinke.666@bytedance.com/

Hi, Tejun

This patch can solve the priority inversion problem, however, I just
come up with a new idea:

For meta IO, just issue the IO directly like iocost, and then try to
pay debt. Fortunately, we already have 'carryover_bytes/ios' that
already do this for the case that limit changes, and it'll be easy
to do this for meta IO, just update 'carryover_bytes/ios' and dispatch
directly.

BTW, this is another reason that we should add a new module in iocost to
replace blk-throtl.

Thanks,
Kuai

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index dc6140fa3de0..38ffe0f95682 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1595,6 +1595,32 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
         spin_unlock_irq(&q->queue_lock);
  }

+static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, 
bool rw)
+{
+       struct throtl_service_queue *sq = &tg->service_queue;
+
+       /* throtl is FIFO - if bios are already queued, should queue */
+       if (sq->nr_queued[rw])
+               return false;
+
+       /* within limits, let's charge and dispatch directly */
+       if (!tg_may_dispatch(tg, bio, NULL))
+               return false;
+
+       return true;
+}
+
+static void throtl_dispatch_bio_in_debt(struct throtl_grp *tg, struct 
bio *bio,
+                                       bool rw)
+{
+       unsigned int bio_size = throtl_bio_data_size(bio);
+
+       if (!bio_flagged(bio, BIO_BPS_THROTTLED))
+               tg->carryover_bytes[rw] -= bio_size;
+
+       tg->carryover_ios[rw]--;
+}
+
  bool __blk_throtl_bio(struct bio *bio)
  {
         struct request_queue *q = bdev_get_queue(bio->bi_bdev);
@@ -1611,34 +1637,28 @@ bool __blk_throtl_bio(struct bio *bio)
         sq = &tg->service_queue;

         while (true) {
-               if (tg->last_low_overflow_time[rw] == 0)
-                       tg->last_low_overflow_time[rw] = jiffies;
-               /* throtl is FIFO - if bios are already queued, should 
queue */
-               if (sq->nr_queued[rw])
-                       break;
-
-               /* if above limits, break to queue */
-               if (!tg_may_dispatch(tg, bio, NULL)) {
-                       tg->last_low_overflow_time[rw] = jiffies;
+               if (tg_within_limit(tg, bio, rw)) {
+                       /* within limits, let's charge and dispatch 
directly */
+                       throtl_charge_bio(tg, bio);
+
+                       /*
+                        * We need to trim slice even when bios are not 
being queued
+                        * otherwise it might happen that a bio is not 
queued for
+                        * a long time and slice keeps on extending and 
trim is not
+                        * called for a long time. Now if limits are 
reduced suddenly
+                        * we take into account all the IO dispatched so 
far at new
+                        * low rate and * newly queued IO gets a really 
long dispatch
+                        * time.
+                        *
+                        * So keep on trimming slice even if bio is not 
queued.
+                        */
+                       throtl_trim_slice(tg, rw);
+               } else if (bio_issue_as_root_blkg(bio)) {
+                       throtl_dispatch_bio_in_debt(tg, bio, rw);
+               } else {
                         break;
                 }

-               /* within limits, let's charge and dispatch directly */
-               throtl_charge_bio(tg, bio);
-
-               /*
-                * We need to trim slice even when bios are not being queued
-                * otherwise it might happen that a bio is not queued for
-                * a long time and slice keeps on extending and trim is not
-                * called for a long time. Now if limits are reduced 
suddenly
-                * we take into account all the IO dispatched so far at new
-                * low rate and * newly queued IO gets a really long 
dispatch
-                * time.
-                *
-                * So keep on trimming slice even if bio is not queued.
-                */
-               throtl_trim_slice(tg, rw);
-
                 /*
                  * @bio passed through this layer without being throttled.
                  * Climb up the ladder.  If we're already at the top, it

> 
> Thanks,
> Yi.
> 
> .
>