linux-kernel - Re: [PATCH -next RFC] block: support to account io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZZTcYyveSE6Sl0Dl@fedora>
Date: Wed, 3 Jan 2024 12:02:43 +0800
From: Ming Lei <ming.lei@...hat.com>
To: Yu Kuai <yukuai1@...weicloud.com>
Cc: bvanassche@....org, hch@....de, axboe@...nel.dk,
	linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
	yukuai3@...wei.com, yi.zhang@...wei.com, yangerkun@...wei.com,
	ming.lei@...hat.com
Subject: Re: [PATCH -next RFC] block: support to account io_ticks precisely

On Tue, Dec 05, 2023 at 05:37:43PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@...wei.com>
> 
> Currently, io_ticks is accounted based on sampling, specifically
> update_io_ticks() will always account io_ticks by 1 jiffies from
> bdev_start_io_acct()/blk_account_io_start(), and the result can be
> inaccurate, for example(HZ is 250):
> 
> Test script:
> fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms
> 
> Test result: util is about 90%, while the disk is really idle.
> 
> In order to account io_ticks precisely, update_io_ticks() must know if
> there are IO inflight already, and this requires overhead slightly,
> hence precise io accounting is disabled by default, and user can enable
> it through sysfs entry.

Yeah, the trouble is from commit 5b18b5a73760 ("block: delete part_round_stats and
switch to less precise counting"), and real reason is that IO inflight
info is too expensive to maintain in fast path, and RH have got several customer
complaint in this area too.

> 
> Noted that for rq-based devcie, part_stat_local_inc/dec() and
> part_in_flight() is used to track inflight instead of iterating tags,
> which is not supposed to be used in fast path because 'tags->lock' is
> grabbed in blk_mq_find_and_get_req().

You can iterate over static requests via BT_TAG_ITER_STATIC_RQS, then
tags->lock can be bypassed, but new helper is needed.

But given it is only run once for each tick, I guess percpu counting
might be fine too even in case of big machine.

> 
> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
> ---
>  Documentation/ABI/stable/sysfs-block |  8 ++++--
>  block/blk-core.c                     | 17 ++++++++----
>  block/blk-mq.c                       | 18 ++++++++++---
>  block/blk-sysfs.c                    | 40 ++++++++++++++++++++++++++--
>  block/blk.h                          |  4 ++-
>  block/genhd.c                        |  6 ++---
>  include/linux/blk-mq.h               |  1 +
>  include/linux/blkdev.h               |  3 +++
>  8 files changed, 80 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 1fe9a553c37b..e5fedecf7bdf 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -358,8 +358,12 @@ What:		/sys/block/<disk>/queue/iostats
>  Date:		January 2009
>  Contact:	linux-block@...r.kernel.org
>  Description:
> -		[RW] This file is used to control (on/off) the iostats
> -		accounting of the disk.
> +		[RW] This file is used to control the iostats accounting of the
> +		disk. If this value is 0, iostats accounting is disabled; If
> +		this value is 1, iostats accounting is enabled, but io_ticks is
> +		accounted by sampling and the result is not accurate; If this
> +		value is 2, iostats accounting is enabled and io_ticks is
> +		accounted precisely, but there will be slightly overhead.

IMO, this approach looks fine.

>  
>  
>  What:		/sys/block/<disk>/queue/logical_block_size
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fdf25b8d6e78..405883d606cd 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -935,14 +935,20 @@ int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob,
>  }
>  EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
>  
> -void update_io_ticks(struct block_device *part, unsigned long now, bool end)
> +void update_io_ticks(struct block_device *part, unsigned long now, bool end,
> +		     bool precise)
>  {
>  	unsigned long stamp;
>  again:
>  	stamp = READ_ONCE(part->bd_stamp);
> -	if (unlikely(time_after(now, stamp))) {
> -		if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
> +	if (unlikely(time_after(now, stamp)) &&
> +	    likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
> +		if (precise) {
> +			if (end || part_in_flight(part))
> +				__part_stat_add(part, io_ticks, now - stamp);

Strictly speaking, `end` isn't need any more, but it can be thought
as one optimization, given part_in_flight() is supposed to be non-zero
in case of account_done.

> +		} else {
>  			__part_stat_add(part, io_ticks, end ? now - stamp : 1);
> +		}
>  	}
>  	if (part->bd_partno) {
>  		part = bdev_whole(part);
> @@ -954,7 +960,8 @@ unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op,
>  				 unsigned long start_time)
>  {
>  	part_stat_lock();
> -	update_io_ticks(bdev, start_time, false);
> +	update_io_ticks(bdev, start_time, false,
> +			blk_queue_precise_io_stat(bdev->bd_queue));

blk_queue_precise_io_stat() can be moved into update_io_ticks()
directly, and it should be fine given it is just done once in each
tick.

Thanks,
Ming