linux-kernel - Re: [External] Re: [PATCH v5] blk-throtl: Introduce sync and async queues for blk-throtl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <7a348aec-0e5c-ec6a-36cd-30a844d276ad@bytedance.com>
Date:   Thu, 12 Jan 2023 16:42:00 +0800
From:   hanjinke <hanjinke.666@...edance.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     josef@...icpanda.com, axboe@...nel.dk, cgroups@...r.kernel.org,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        yinxin.x@...edance.com
Subject: Re: [External] Re: [PATCH v5] blk-throtl: Introduce sync and async
 queues for blk-throtl



在 2023/1/12 上午1:05, Tejun Heo 写道:
> On Thu, Jan 12, 2023 at 12:20:30AM +0800, Jinke Han wrote:
>> From: Jinke Han <hanjinke.666@...edance.com>
>>
>> Now we don't distinguish sync write ios from normal buffer write ios
>> in blk-throtl. A bio with REQ_SYNC tagged always mean it will be wait
>> until write completion soon after it submit. So it's reasonable for sync
>> io to complete as soon as possible.
>>
>> In our test, fio writes a 100g file in sequential 4k blocksize in
>> a container with low bps limit configured (wbps=10M). More than 1200
>> ios were throttled in blk-throtl queue and the avarage throtle time
>> of each io is 140s. At the same time, the operation of saving a small
>> file by vim will be blocked amolst 140s. As a fsync will be send by vim,
>> the sync ios of fsync will be blocked by a huge amount of buffer write
>> ios ahead. This is also a priority inversion problem within one cgroup.
>> In the database scene, things got really bad with blk-throtle enabled
>> as fsync is called very often.
>>
>> This patch splits bio queue into sync and async queues for blk-throtl
>> and gives a huge priority to sync write ios. Sync queue only make sense
>> for write ios as we treat all read io as sync io. I think it's a nice
>> respond to the semantics of REQ_SYNC. Bios with REQ_META and REQ_PRIO
>> gains the same priority as they are important to fs. This may avoid
>> some potential priority inversion problems.
>>
>> With this patch, do the same test above, the duration of the fsync sent
>> by vim drops to several hundreds of milliseconds.
>>
>> Signed-off-by: Jinke Han <hanjinke.666@...edance.com>
>> Signed-off-by: Xin Yin <yinxin.x@...edance.com>
> 
> Acked-by: Tejun Heo <tj@...nel.org>
> 
> with some nits below:
> 
>> +/**
>> + * throtl_qnode_bio_peek - peek a bio from a qn
>> + * @qn: the qnode to peek from
>> + *
>> + * For read, always peek bio from the SYNC queue.
>> + *
>> + * For write, we always peek bio from next_to_disp. If it's NULL, a bio
>                      ^
>                      first
> 
>> + * will be popped from SYNC or ASYNC queue to fill it. The next_to_disp
>> + * is used to make sure that the peeked bio and the next popped bio are
>                                     ^
>                                     previously
> 
>> + * always the same even in case that the spinlock of queue was released
>> + * and re-holded.
>            ^
>            re-grabbed / re-acquired
>> + *
>> + * Without the next_to_disp, consider the following situation:
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^
>        maybe drop this part and move the latter part to the end of the
>        previous para?
> 
>> + * Assumed that there are only bios queued in ASYNC queue and the SYNC
>        ^
>        Assume
> 
>> + * queue is empty and all ASYNC bios are 1M in size and the bps limit is
>> + * 1M/s. The throtl_slice is 100ms. The current slice is [jiffies1,
>> + * jiffies1+100] and the bytes_disp[w] is 0.
>> + *
>> + * The disp_sync_cnt is 0 as it was set 0 after each dispatching of a
>> + * ASYNC bio. A ASYNC bio wil be peeked to check in tg_may_dispatch.
>> + * Obviously, it can't be dispatched in current slice and the wait time
>> + * is 900ms. The slice will be extended to [jiffies1, jiffies1+1000] in
>> + * tg_may_dispatch. The spinlock of the queue will be released after the
>> + * process of dispatch giving up. A 4k size SYNC bio was queued in and
>> + * the SYNC queue becomes no-empty. After 900ms, it's time to dispatch
>> + * the tg, the SYNC bio will be popped to dispatched as the disp_sync_cnt
>> + * is 0 and the SYNC queue is no-empty. The slice will be extended to
>        ^
>   Maybe combine the previous several sentences like:
> 
>   The queue lock is released and a 4k SYNC bio gets queued during the 900ms
>   wait.
> 
>> + * [jiffies1, jiffies1+1100] in tg_may_dispatch. Then the slice will be
>> + * trimed to [jiffies1+1000, jiffies1+1100] after the SYNC bio was
>> + * dispatched. Then the former 1M size ASYNC bio will be peeked to be
>> + * checked and still can't be dispatched because of overlimit within
>> + * the current slice. The same thing may happen DISPACH_SYNC_FACTOR times
>> + * if always there is a SYNC bio be queued in the SYNC queue when the
>> + * ASYNC bio is waiting. This means that in nearly 5s, we have dispathed
>> + * four 4k SYNC bios and one  1M ASYNC bio. It is hard to fill up the
>> + * bandwidth considering that the bps limit is 1M/s.
> 
> Simiarly I think the information can be conveyed in a more compact form.
> 
> Thanks.
> 
The comment will be further adjusted based on your suggestions and the
v6 with your Acked-by will be send.

Thanks
Jinke