linux-kernel - Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <611f02a8-8430-16cf-46e5-e9417982b077@huaweicloud.com>
Date: Mon, 24 Feb 2025 20:03:32 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Ming Lei <ming.lei@...hat.com>, Yu Kuai <yukuai1@...weicloud.com>
Cc: tj@...nel.org, josef@...icpanda.com, axboe@...nel.dk,
 cgroups@...r.kernel.org, linux-block@...r.kernel.org,
 linux-kernel@...r.kernel.org, yi.zhang@...wei.com, yangerkun@...wei.com,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time

Hi,

在 2025/02/24 16:56, Ming Lei 写道:
> On Mon, Feb 24, 2025 at 03:03:18PM +0800, Yu Kuai wrote:
>> Hi, Ming!
>>
>> 在 2025/02/24 11:28, Ming Lei 写道:
>>> throtl_trim_slice() returns immediately if throtl_slice_used()
>>> is true.
>>>
>>> And throtl_slice_used() checks jiffies in [start, end] via time_in_range(),
>>> so if `start <= jiffies <= end', it still returns false.
>>
>> Yes, I misread the code, by thinking throtl_slice_used() will return
>> true if the slice is still used. :(
>>
>>
>>>> BTW, throtl_trim_slice() looks like problematic:
>>>>
>>>> -       if (bytes_trim <= 0 && io_trim <= 0)
>>>> +       if (bytes_trim <= 0 || io_trim <= 0 ||
>>>> +           tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
>>>>                   return;
>>> That is exactly what my patch is doing, just taking deviation and
>>> timeout into account, also U64_MAX limit has to be excluded.
>> Yes, perhaps you can add some comments in the last two conditions of
>> your patch.
> 
> Yes, we need to add comment on the check, how about the following words?
> 
> ```
> 
> If actually rate doesn't match with expected rate, do not trim slice
> otherwise the present rate control info is lost, we don't have chance
> to compensate it in the following period of this slice any more.

So, I just give your patch a test, and result is 1.3s while 1s is
expected. While debuging, a new idea come up in mind. :)

How about keep at least one slice out of consideration from
throtl_trim_slice()? With following patch, the result is between
1.01-1.03s in my VM.

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8d149aff9fd0..5207c85098a5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -604,9 +604,12 @@ static inline void throtl_trim_slice(struct 
throtl_grp *tg, bool rw)

         time_elapsed = rounddown(jiffies - tg->slice_start[rw],
                                  tg->td->throtl_slice);
-       if (!time_elapsed)
+       /* don't trim slice until at least 2 slice is used */
+       if (time_elapsed < tg->td->throtl_slice * 2)
                 return;

+       /* dispite the last slice, trim previous slice */
+       time_elapsed -= tg->td->throtl_slice;
         bytes_trim = calculate_bytes_allowed(tg_bps_limit(tg, rw),
                                              time_elapsed) +
                      tg->carryover_bytes[rw];

> 
> Add one deviation threshold since bio size is usually not divisible by
> present rate, and bio dispatch should be done or nothing
> 
> Also limit max slice size for avoiding to extend the window too big.
> 
> ```
> 
> 
>> I think maybe you're concerned about the case IO is
>> throttled by IOs and bytes_disp should really erase extra bytes that
>> does not reach bps_limit.
>>
>>>
>>> If you test the patch, it works just fine. My patch controls bytes_exp
>>> <= 1.5 * disp, then throtl/001 can be completed in 1.5sec, and if it is
>>> changed to 1.25 * disp, the test can be completed in 1.25sec.
>>>
>>> With this fix, why do we have to play the complicated carryover
>>> trick?
>>>
>>
>> I understand your code now. carryover_bytes in this case is wrong, as
>> long as new slice is not started, and trim slice doesn't erase extra
>> bytes by mistake, throttle time should be accurate over time because
>> bytes_disp is accurate.
> 
> Yes.
> 
> More or less wait for handling single bio can just affect instantaneous rate,
> but the algorithm is adaptive so it can adjust the following wait if the
> slice isn't over.
> 
>>
>> And root cause really is throtl_trim_slice().
>>
>> However, by code review, I think throtl_start_new_slice() should be
>> handled as well, the same as throtl_trim_slice(), if the old bio is
>> dispatched with one more jiffies wait time, issue a new bio in the same
>> jiffies will have the same problem. For example:
> 
> throtl_start_new_slice() is called when nothing is queued and the current
> slice is used, so probably it is fine.
> 
> throtl_start_new_slice_with_credit() is called when dispatching one
> bio, and it is still called when the current slice is used.
> 
>>
>> Caller do sync IO, with io size: (bps_limit / (HZ / throtl_slice) + 1),
> 
> This will cause wait for every single IO.
> 
> But once the IO is throttled because of the wait, throtl_start_new_slice()
> won't succeed any more, meantime throtl_trim_slice() will try to fix the
> rate control for the extra 1 jiffies.
> 
> So in-tree code with trim fix works just fine if consecutive IOs are
> coming.
> 
>> and caller will issue new IO when old IO is done. Then in this case,
>> each IO will start a new slice and wait for throtl_slice + 1 jiffies. I
>> believe this can be fixed as well so that BIOs after the first one will
>> only wait for throtl_silce jiffies.
> 
> I guess the above case only exists in theory when you submit new IO
> after the old IO is dispatched immediately. Not sure this case is really
> meaningful because submission period/latency is added/faked by user.

Yes, this case is just a theory, we don't need to care for now. :)

Thanks,
Kuai

> 
> 
> Thanks,
> Ming
> 
> 
> .
>