[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1960aea3-3daa-d899-7daf-39dbcc1b7a9d@huaweicloud.com>
Date: Tue, 16 Jan 2024 17:59:42 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Yu Kuai <yukuai1@...weicloud.com>, Ming Lei <ming.lei@...hat.com>
Cc: hch@....de, bvanassche@....org, axboe@...nel.dk,
linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
yi.zhang@...wei.com, yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH for-6.8/block] block: support to account io_ticks
precisely
Hi, Ming
在 2024/01/15 19:54, Yu Kuai 写道:
> Hi,
>
> 在 2024/01/15 19:38, Ming Lei 写道:
>> On Tue, Jan 09, 2024 at 03:13:32PM +0800, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@...wei.com>
>>>
>>> Currently, io_ticks is accounted based on sampling, specifically
>>> update_io_ticks() will always account io_ticks by 1 jiffies from
>>> bdev_start_io_acct()/blk_account_io_start(), and the result can be
>>> inaccurate, for example(HZ is 250):
>>>
>>> Test script:
>>> fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test
>>> -thinktime=4ms
>>>
>>> Test result: util is about 90%, while the disk is really idle.
>>
>> Just be curious, what is result with this patch? 0%?
I ran some tests on null_blk and found out that this patch actually
doesn't have any performance overhead, and I finially figure out that
it's true even in theroy.
Notice that there is a cmpxchg() in update_ticks():
update_io_ticks:
stamp = part->bd_stamp;
if (time_after(now, stamp))
if (try_cmpxchg())
__part_stat_add()
Hence only one task can pass cmpxchg in 1 jiffies, and part_stat_add()
be called at most once in 1 jiffies. Which means with this patch,
part_in_flight() will also be called at most once in 1 jiffies(not per
IO). And part_in_flight() once per jiffies really doesn't affect IO
performance at all.
Befor this cmpxchg():
part_round_stats:
if (part->stamp != now)
stats |= 1;
part_in_flight()
-> there can be lots of task here in 1 jiffies.
part_round_stats_single()
__part_stat_add()
part->stamp = now;
By the way, this cmpxchg() is added by commit 5b18b5a73760 ("block:
delete part_round_stats and switch to less precise counting"), hence
actually there is no need to switch to less precise counting in the
first place.
So I think I can just remove the switch and switch to precise io
accounting by default in the next version.
Please let me know what you think!
Thanks,
Kuai
>
> No, it's not 0%, this actually depends on how many IO really start from
> one jiffies and complete at the next jiffies. Given that the probability
> is related to IO latency, so the result should be relatively
> accurate(Around 10% in my environment). I think we can live with that
> unless we improve time precision from jiffies to ns.
>>
>>>
>>> In order to account io_ticks precisely, update_io_ticks() must know if
>>> there are IO inflight already, and this requires overhead slightly,
>>> hence precise io accounting is disabled by default, and user can enable
>>> it through sysfs entry.
>>>
>>> Noted that for rq-based devcie, part_stat_local_inc/dec() and
>>> part_in_flight() is used to track inflight instead of iterating tags,
>>> which is not supposed to be used in fast path because 'tags->lock' is
>>> grabbed in blk_mq_find_and_get_req().
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>>> ---
>>> Changes from RFC v1:
>>> - remove the new parameter for update_io_ticks();
>>> - simplify update_io_ticks();
>>> - use swith in queue_iostats_store();
>>> - add missing part_stat_local_dec() in blk_account_io_merge_request();
>>> Changes from RFC v2:
>>> - fix that precise is ignored for the first io in update_io_ticks();
>>>
>>> Documentation/ABI/stable/sysfs-block | 8 ++++--
>>> block/blk-core.c | 10 +++++--
>>> block/blk-merge.c | 3 ++
>>> block/blk-mq-debugfs.c | 2 ++
>>> block/blk-mq.c | 11 +++++++-
>>> block/blk-sysfs.c | 42 ++++++++++++++++++++++++++--
>>> block/blk.h | 1 +
>>> block/genhd.c | 2 +-
>>> include/linux/blk-mq.h | 1 +
>>> include/linux/blkdev.h | 3 ++
>>> 10 files changed, 74 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/ABI/stable/sysfs-block
>>> b/Documentation/ABI/stable/sysfs-block
>>> index 1fe9a553c37b..79027bf2661a 100644
>>> --- a/Documentation/ABI/stable/sysfs-block
>>> +++ b/Documentation/ABI/stable/sysfs-block
>>> @@ -358,8 +358,12 @@ What: /sys/block/<disk>/queue/iostats
>>> Date: January 2009
>>> Contact: linux-block@...r.kernel.org
>>> Description:
>>> - [RW] This file is used to control (on/off) the iostats
>>> - accounting of the disk.
>>> + [RW] This file is used to control the iostats accounting of the
>>> + disk. If this value is 0, iostats accounting is disabled; If
>>> + this value is 1, iostats accounting is enabled, but io_ticks is
>>> + accounted by sampling and the result is not accurate; If this
>>> + value is 2, iostats accounting is enabled and io_ticks is
>>> + accounted precisely, but there will be slightly more overhead.
>>> What: /sys/block/<disk>/queue/logical_block_size
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 9520ccab3050..c70dc311e3b7 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -954,11 +954,15 @@ EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
>>> void update_io_ticks(struct block_device *part, unsigned long now,
>>> bool end)
>>> {
>>> unsigned long stamp;
>>> + bool precise = blk_queue_precise_io_stat(part->bd_queue);
>>> again:
>>> stamp = READ_ONCE(part->bd_stamp);
>>> - if (unlikely(time_after(now, stamp))) {
>>> - if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
>>> - __part_stat_add(part, io_ticks, end ? now - stamp : 1);
>>> + if (unlikely(time_after(now, stamp)) &&
>>> + likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
>>> + if (end || (precise && part_in_flight(part)))
>>> + __part_stat_add(part, io_ticks, now - stamp);
>>> + else if (!precise)
>>> + __part_stat_add(part, io_ticks, 1);
>>
>> It should be better or readable to move 'bool precise' into the above
>> branch,
>> given we only need to read the flag once in each tick.
>>
>> Otherwise, this patch looks fine.
>
> Thanks for your advice, will change that in next version.
>
> Kuai
>>
>> Thanks,
>> Ming
>>
>> .
>>
>
> .
>
Powered by blists - more mailing lists