[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <98e42b31-1a67-0144-19d3-a4b66c5308fd@huaweicloud.com>
Date: Mon, 19 Feb 2024 17:03:52 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Yu Kuai <yukuai1@...weicloud.com>, Ming Lei <ming.lei@...hat.com>
Cc: hch@....de, bvanassche@....org, axboe@...nel.dk,
linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
yi.zhang@...wei.com, yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH for-6.8/block] block: support to account io_ticks
precisely
Hi, Ming
在 2024/01/16 17:59, Yu Kuai 写道:
> Hi, Ming
>
> 在 2024/01/15 19:54, Yu Kuai 写道:
>> Hi,
>>
>> 在 2024/01/15 19:38, Ming Lei 写道:
>>> On Tue, Jan 09, 2024 at 03:13:32PM +0800, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@...wei.com>
>>>>
>>>> Currently, io_ticks is accounted based on sampling, specifically
>>>> update_io_ticks() will always account io_ticks by 1 jiffies from
>>>> bdev_start_io_acct()/blk_account_io_start(), and the result can be
>>>> inaccurate, for example(HZ is 250):
>>>>
>>>> Test script:
>>>> fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test
>>>> -thinktime=4ms
>>>>
>>>> Test result: util is about 90%, while the disk is really idle.
>>>
>>> Just be curious, what is result with this patch? 0%?
>
> I ran some tests on null_blk and found out that this patch actually
> doesn't have any performance overhead, and I finially figure out that
> it's true even in theroy.
>
> Notice that there is a cmpxchg() in update_ticks():
>
> update_io_ticks:
> stamp = part->bd_stamp;
> if (time_after(now, stamp))
> if (try_cmpxchg())
> __part_stat_add()
>
> Hence only one task can pass cmpxchg in 1 jiffies, and part_stat_add()
> be called at most once in 1 jiffies. Which means with this patch,
> part_in_flight() will also be called at most once in 1 jiffies(not per
> IO). And part_in_flight() once per jiffies really doesn't affect IO
> performance at all.
>
> Befor this cmpxchg():
>
> part_round_stats:
> if (part->stamp != now)
> stats |= 1;
>
> part_in_flight()
> -> there can be lots of task here in 1 jiffies.
> part_round_stats_single()
> __part_stat_add()
> part->stamp = now;
>
> By the way, this cmpxchg() is added by commit 5b18b5a73760 ("block:
> delete part_round_stats and switch to less precise counting"), hence
> actually there is no need to switch to less precise counting in the
> first place.
>
> So I think I can just remove the switch and switch to precise io
> accounting by default in the next version.
>
> Please let me know what you think!
Are you still interested? And the overhead to switch to precise io
accounting is:
- per cpu add/dec for each IO for rq-based device;
- per cpu sum for each jiffies(1-4 ms).
Hence in theory, the overhead is quite small, and I already tested with
null-blk, there is no change with high IO pressure.
Thanks,
Kuai
>
> Thanks,
> Kuai
>
>>
>> No, it's not 0%, this actually depends on how many IO really start from
>> one jiffies and complete at the next jiffies. Given that the probability
>> is related to IO latency, so the result should be relatively
>> accurate(Around 10% in my environment). I think we can live with that
>> unless we improve time precision from jiffies to ns.
>>>
>>>>
>>>> In order to account io_ticks precisely, update_io_ticks() must know if
>>>> there are IO inflight already, and this requires overhead slightly,
>>>> hence precise io accounting is disabled by default, and user can enable
>>>> it through sysfs entry.
>>>>
>>>> Noted that for rq-based devcie, part_stat_local_inc/dec() and
>>>> part_in_flight() is used to track inflight instead of iterating tags,
>>>> which is not supposed to be used in fast path because 'tags->lock' is
>>>> grabbed in blk_mq_find_and_get_req().
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>>>> ---
>>>> Changes from RFC v1:
>>>> - remove the new parameter for update_io_ticks();
>>>> - simplify update_io_ticks();
>>>> - use swith in queue_iostats_store();
>>>> - add missing part_stat_local_dec() in
>>>> blk_account_io_merge_request();
>>>> Changes from RFC v2:
>>>> - fix that precise is ignored for the first io in update_io_ticks();
>>>>
>>>> Documentation/ABI/stable/sysfs-block | 8 ++++--
>>>> block/blk-core.c | 10 +++++--
>>>> block/blk-merge.c | 3 ++
>>>> block/blk-mq-debugfs.c | 2 ++
>>>> block/blk-mq.c | 11 +++++++-
>>>> block/blk-sysfs.c | 42
>>>> ++++++++++++++++++++++++++--
>>>> block/blk.h | 1 +
>>>> block/genhd.c | 2 +-
>>>> include/linux/blk-mq.h | 1 +
>>>> include/linux/blkdev.h | 3 ++
>>>> 10 files changed, 74 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/Documentation/ABI/stable/sysfs-block
>>>> b/Documentation/ABI/stable/sysfs-block
>>>> index 1fe9a553c37b..79027bf2661a 100644
>>>> --- a/Documentation/ABI/stable/sysfs-block
>>>> +++ b/Documentation/ABI/stable/sysfs-block
>>>> @@ -358,8 +358,12 @@ What: /sys/block/<disk>/queue/iostats
>>>> Date: January 2009
>>>> Contact: linux-block@...r.kernel.org
>>>> Description:
>>>> - [RW] This file is used to control (on/off) the iostats
>>>> - accounting of the disk.
>>>> + [RW] This file is used to control the iostats accounting of
>>>> the
>>>> + disk. If this value is 0, iostats accounting is disabled; If
>>>> + this value is 1, iostats accounting is enabled, but
>>>> io_ticks is
>>>> + accounted by sampling and the result is not accurate; If this
>>>> + value is 2, iostats accounting is enabled and io_ticks is
>>>> + accounted precisely, but there will be slightly more overhead.
>>>> What: /sys/block/<disk>/queue/logical_block_size
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 9520ccab3050..c70dc311e3b7 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -954,11 +954,15 @@ EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
>>>> void update_io_ticks(struct block_device *part, unsigned long now,
>>>> bool end)
>>>> {
>>>> unsigned long stamp;
>>>> + bool precise = blk_queue_precise_io_stat(part->bd_queue);
>>>> again:
>>>> stamp = READ_ONCE(part->bd_stamp);
>>>> - if (unlikely(time_after(now, stamp))) {
>>>> - if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
>>>> - __part_stat_add(part, io_ticks, end ? now - stamp : 1);
>>>> + if (unlikely(time_after(now, stamp)) &&
>>>> + likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
>>>> + if (end || (precise && part_in_flight(part)))
>>>> + __part_stat_add(part, io_ticks, now - stamp);
>>>> + else if (!precise)
>>>> + __part_stat_add(part, io_ticks, 1);
>>>
>>> It should be better or readable to move 'bool precise' into the above
>>> branch,
>>> given we only need to read the flag once in each tick.
>>>
>>> Otherwise, this patch looks fine.
>>
>> Thanks for your advice, will change that in next version.
>>
>> Kuai
>>>
>>> Thanks,
>>> Ming
>>>
>>> .
>>>
>>
>> .
>>
>
> .
>
Powered by blists - more mailing lists