[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d98316fa-3022-c7ca-d985-da10265592b3@huaweicloud.com>
Date: Wed, 3 Jan 2024 14:11:47 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Ming Lei <ming.lei@...hat.com>, Yu Kuai <yukuai1@...weicloud.com>
Cc: bvanassche@....org, hch@....de, axboe@...nel.dk,
linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
yi.zhang@...wei.com, yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next RFC] block: support to account io_ticks precisely
Hi, Ming!
在 2024/01/03 12:02, Ming Lei 写道:
> On Tue, Dec 05, 2023 at 05:37:43PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@...wei.com>
>>
>> Currently, io_ticks is accounted based on sampling, specifically
>> update_io_ticks() will always account io_ticks by 1 jiffies from
>> bdev_start_io_acct()/blk_account_io_start(), and the result can be
>> inaccurate, for example(HZ is 250):
>>
>> Test script:
>> fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
>>
>> Test result: util is about 90%, while the disk is really idle.
>>
>> In order to account io_ticks precisely, update_io_ticks() must know if
>> there are IO inflight already, and this requires overhead slightly,
>> hence precise io accounting is disabled by default, and user can enable
>> it through sysfs entry.
>
> Yeah, the trouble is from commit 5b18b5a73760 ("block: delete part_round_stats and
> switch to less precise counting"), and real reason is that IO inflight
> info is too expensive to maintain in fast path, and RH have got several customer
> complaint in this area too.
>
>>
>> Noted that for rq-based devcie, part_stat_local_inc/dec() and
>> part_in_flight() is used to track inflight instead of iterating tags,
>> which is not supposed to be used in fast path because 'tags->lock' is
>> grabbed in blk_mq_find_and_get_req().
>
> You can iterate over static requests via BT_TAG_ITER_STATIC_RQS, then
> tags->lock can be bypassed, but new helper is needed.
>
> But given it is only run once for each tick, I guess percpu counting
> might be fine too even in case of big machine.
>
>>
>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>> ---
>> Documentation/ABI/stable/sysfs-block | 8 ++++--
>> block/blk-core.c | 17 ++++++++----
>> block/blk-mq.c | 18 ++++++++++---
>> block/blk-sysfs.c | 40 ++++++++++++++++++++++++++--
>> block/blk.h | 4 ++-
>> block/genhd.c | 6 ++---
>> include/linux/blk-mq.h | 1 +
>> include/linux/blkdev.h | 3 +++
>> 8 files changed, 80 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index 1fe9a553c37b..e5fedecf7bdf 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -358,8 +358,12 @@ What: /sys/block/<disk>/queue/iostats
>> Date: January 2009
>> Contact: linux-block@...r.kernel.org
>> Description:
>> - [RW] This file is used to control (on/off) the iostats
>> - accounting of the disk.
>> + [RW] This file is used to control the iostats accounting of the
>> + disk. If this value is 0, iostats accounting is disabled; If
>> + this value is 1, iostats accounting is enabled, but io_ticks is
>> + accounted by sampling and the result is not accurate; If this
>> + value is 2, iostats accounting is enabled and io_ticks is
>> + accounted precisely, but there will be slightly overhead.
>
> IMO, this approach looks fine.
>
>>
>>
>> What: /sys/block/<disk>/queue/logical_block_size
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index fdf25b8d6e78..405883d606cd 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -935,14 +935,20 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
>> }
>> EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
>>
>> -void update_io_ticks(struct block_device *part, unsigned long now, bool end)
>> +void update_io_ticks(struct block_device *part, unsigned long now, bool end,
>> + bool precise)
>> {
>> unsigned long stamp;
>> again:
>> stamp = READ_ONCE(part->bd_stamp);
>> - if (unlikely(time_after(now, stamp))) {
>> - if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
>> + if (unlikely(time_after(now, stamp)) &&
>> + likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
>> + if (precise) {
>> + if (end || part_in_flight(part))
>> + __part_stat_add(part, io_ticks, now - stamp);
>
> Strictly speaking, `end` isn't need any more, but it can be thought
> as one optimization, given part_in_flight() is supposed to be non-zero
> in case of account_done.
>
>> + } else {
>> __part_stat_add(part, io_ticks, end ? now - stamp : 1);
>> + }
>> }
>> if (part->bd_partno) {
>> part = bdev_whole(part);
>> @@ -954,7 +960,8 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
>> unsigned long start_time)
>> {
>> part_stat_lock();
>> - update_io_ticks(bdev, start_time, false);
>> + update_io_ticks(bdev, start_time, false,
>> + blk_queue_precise_io_stat(bdev->bd_queue));
>
> blk_queue_precise_io_stat() can be moved into update_io_ticks()
> directly, and it should be fine given it is just done once in each
> tick.
Thanks for reviewing this patch! I'll update your suggestion in v2.
Kuai
>
> Thanks,
> Ming
>
> .
>
Powered by blists - more mailing lists