linux-kernel - Re: [PATCH -next RFC] block: support to account io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d98316fa-3022-c7ca-d985-da10265592b3@huaweicloud.com>
Date: Wed, 3 Jan 2024 14:11:47 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Ming Lei <ming.lei@...hat.com>, Yu Kuai <yukuai1@...weicloud.com>
Cc: bvanassche@....org, hch@....de, axboe@...nel.dk,
 linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 yi.zhang@...wei.com, yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next RFC] block: support to account io_ticks precisely

Hi, Ming!

在 2024/01/03 12:02, Ming Lei 写道:
> On Tue, Dec 05, 2023 at 05:37:43PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@...wei.com>
>>
>> Currently, io_ticks is accounted based on sampling, specifically
>> update_io_ticks() will always account io_ticks by 1 jiffies from
>> bdev_start_io_acct()/blk_account_io_start(), and the result can be
>> inaccurate, for example(HZ is 250):
>>
>> Test script:
>> fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
>>
>> Test result: util is about 90%, while the disk is really idle.
>>
>> In order to account io_ticks precisely, update_io_ticks() must know if
>> there are IO inflight already, and this requires overhead slightly,
>> hence precise io accounting is disabled by default, and user can enable
>> it through sysfs entry.
> 
> Yeah, the trouble is from commit 5b18b5a73760 ("block: delete part_round_stats and
> switch to less precise counting"), and real reason is that IO inflight
> info is too expensive to maintain in fast path, and RH have got several customer
> complaint in this area too.
> 
>>
>> Noted that for rq-based devcie, part_stat_local_inc/dec() and
>> part_in_flight() is used to track inflight instead of iterating tags,
>> which is not supposed to be used in fast path because 'tags->lock' is
>> grabbed in blk_mq_find_and_get_req().
> 
> You can iterate over static requests via BT_TAG_ITER_STATIC_RQS, then
> tags->lock can be bypassed, but new helper is needed.
> 
> But given it is only run once for each tick, I guess percpu counting
> might be fine too even in case of big machine.
> 
>>
>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>> ---
>>   Documentation/ABI/stable/sysfs-block |  8 ++++--
>>   block/blk-core.c                     | 17 ++++++++----
>>   block/blk-mq.c                       | 18 ++++++++++---
>>   block/blk-sysfs.c                    | 40 ++++++++++++++++++++++++++--
>>   block/blk.h                          |  4 ++-
>>   block/genhd.c                        |  6 ++---
>>   include/linux/blk-mq.h               |  1 +
>>   include/linux/blkdev.h               |  3 +++
>>   8 files changed, 80 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index 1fe9a553c37b..e5fedecf7bdf 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -358,8 +358,12 @@ What:		/sys/block/<disk>/queue/iostats
>>   Date:		January 2009
>>   Contact:	linux-block@...r.kernel.org
>>   Description:
>> -		[RW] This file is used to control (on/off) the iostats
>> -		accounting of the disk.
>> +		[RW] This file is used to control the iostats accounting of the
>> +		disk. If this value is 0, iostats accounting is disabled; If
>> +		this value is 1, iostats accounting is enabled, but io_ticks is
>> +		accounted by sampling and the result is not accurate; If this
>> +		value is 2, iostats accounting is enabled and io_ticks is
>> +		accounted precisely, but there will be slightly overhead.
> 
> IMO, this approach looks fine.
> 
>>   
>>   
>>   What:		/sys/block/<disk>/queue/logical_block_size
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index fdf25b8d6e78..405883d606cd 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -935,14 +935,20 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
>>   }
>>   EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
>>   
>> -void update_io_ticks(struct block_device *part, unsigned long now, bool end)
>> +void update_io_ticks(struct block_device *part, unsigned long now, bool end,
>> +		     bool precise)
>>   {
>>   	unsigned long stamp;
>>   again:
>>   	stamp = READ_ONCE(part->bd_stamp);
>> -	if (unlikely(time_after(now, stamp))) {
>> -		if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
>> +	if (unlikely(time_after(now, stamp)) &&
>> +	    likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
>> +		if (precise) {
>> +			if (end || part_in_flight(part))
>> +				__part_stat_add(part, io_ticks, now - stamp);
> 
> Strictly speaking, `end` isn't need any more, but it can be thought
> as one optimization, given part_in_flight() is supposed to be non-zero
> in case of account_done.
> 
>> +		} else {
>>   			__part_stat_add(part, io_ticks, end ? now - stamp : 1);
>> +		}
>>   	}
>>   	if (part->bd_partno) {
>>   		part = bdev_whole(part);
>> @@ -954,7 +960,8 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
>>   				 unsigned long start_time)
>>   {
>>   	part_stat_lock();
>> -	update_io_ticks(bdev, start_time, false);
>> +	update_io_ticks(bdev, start_time, false,
>> +			blk_queue_precise_io_stat(bdev->bd_queue));
> 
> blk_queue_precise_io_stat() can be moved into update_io_ticks()
> directly, and it should be fine given it is just done once in each
> tick.

Thanks for reviewing this patch! I'll update your suggestion in v2.

Kuai

> 
> Thanks,
> Ming
> 
> .
>