lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1960aea3-3daa-d899-7daf-39dbcc1b7a9d@huaweicloud.com>
Date: Tue, 16 Jan 2024 17:59:42 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Yu Kuai <yukuai1@...weicloud.com>, Ming Lei <ming.lei@...hat.com>
Cc: hch@....de, bvanassche@....org, axboe@...nel.dk,
 linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 yi.zhang@...wei.com, yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH for-6.8/block] block: support to account io_ticks
 precisely

Hi, Ming

在 2024/01/15 19:54, Yu Kuai 写道:
> Hi,
> 
> 在 2024/01/15 19:38, Ming Lei 写道:
>> On Tue, Jan 09, 2024 at 03:13:32PM +0800, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@...wei.com>
>>>
>>> Currently, io_ticks is accounted based on sampling, specifically
>>> update_io_ticks() will always account io_ticks by 1 jiffies from
>>> bdev_start_io_acct()/blk_account_io_start(), and the result can be
>>> inaccurate, for example(HZ is 250):
>>>
>>> Test script:
>>> fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test 
>>> -thinktime=4ms
>>>
>>> Test result: util is about 90%, while the disk is really idle.
>>
>> Just be curious, what is result with this patch? 0%?

I ran some tests on null_blk and found out that this patch actually
doesn't have any performance overhead, and I finially figure out that
it's true even in theroy.

Notice that there is a cmpxchg() in update_ticks():

update_io_ticks:
  stamp = part->bd_stamp;
  if (time_after(now, stamp))
   if (try_cmpxchg())
    __part_stat_add()

Hence only one task can pass cmpxchg in 1 jiffies, and part_stat_add() 
be called at most once in 1 jiffies. Which means with this patch,
part_in_flight() will also be called at most once in 1 jiffies(not per
IO). And part_in_flight() once per jiffies really doesn't affect IO
performance at all.

Befor this cmpxchg():

part_round_stats:
  if (part->stamp != now)
   stats |= 1;

  part_in_flight()
-> there can be lots of task here in 1 jiffies.
  part_round_stats_single()
   __part_stat_add()
  part->stamp = now;

By the way, this cmpxchg() is added by commit 5b18b5a73760 ("block:
delete part_round_stats and switch to less precise counting"), hence
actually there is no need to switch to less precise counting in the
first place.

So I think I can just remove the switch and switch to precise io
accounting by default in the next version.

Please let me know what you think!

Thanks,
Kuai

> 
> No, it's not 0%, this actually depends on how many IO really start from
> one jiffies and complete at the next jiffies. Given that the probability
> is related to IO latency, so the result should be relatively
> accurate(Around 10% in my environment). I think we can live with that
> unless we improve time precision from jiffies to ns.
>>
>>>
>>> In order to account io_ticks precisely, update_io_ticks() must know if
>>> there are IO inflight already, and this requires overhead slightly,
>>> hence precise io accounting is disabled by default, and user can enable
>>> it through sysfs entry.
>>>
>>> Noted that for rq-based devcie, part_stat_local_inc/dec() and
>>> part_in_flight() is used to track inflight instead of iterating tags,
>>> which is not supposed to be used in fast path because 'tags->lock' is
>>> grabbed in blk_mq_find_and_get_req().
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>>> ---
>>> Changes from RFC v1:
>>>   - remove the new parameter for update_io_ticks();
>>>   - simplify update_io_ticks();
>>>   - use swith in queue_iostats_store();
>>>   - add missing part_stat_local_dec() in blk_account_io_merge_request();
>>> Changes from RFC v2:
>>>   - fix that precise is ignored for the first io in update_io_ticks();
>>>
>>>   Documentation/ABI/stable/sysfs-block |  8 ++++--
>>>   block/blk-core.c                     | 10 +++++--
>>>   block/blk-merge.c                    |  3 ++
>>>   block/blk-mq-debugfs.c               |  2 ++
>>>   block/blk-mq.c                       | 11 +++++++-
>>>   block/blk-sysfs.c                    | 42 ++++++++++++++++++++++++++--
>>>   block/blk.h                          |  1 +
>>>   block/genhd.c                        |  2 +-
>>>   include/linux/blk-mq.h               |  1 +
>>>   include/linux/blkdev.h               |  3 ++
>>>   10 files changed, 74 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/Documentation/ABI/stable/sysfs-block 
>>> b/Documentation/ABI/stable/sysfs-block
>>> index 1fe9a553c37b..79027bf2661a 100644
>>> --- a/Documentation/ABI/stable/sysfs-block
>>> +++ b/Documentation/ABI/stable/sysfs-block
>>> @@ -358,8 +358,12 @@ What:        /sys/block/<disk>/queue/iostats
>>>   Date:        January 2009
>>>   Contact:    linux-block@...r.kernel.org
>>>   Description:
>>> -        [RW] This file is used to control (on/off) the iostats
>>> -        accounting of the disk.
>>> +        [RW] This file is used to control the iostats accounting of the
>>> +        disk. If this value is 0, iostats accounting is disabled; If
>>> +        this value is 1, iostats accounting is enabled, but io_ticks is
>>> +        accounted by sampling and the result is not accurate; If this
>>> +        value is 2, iostats accounting is enabled and io_ticks is
>>> +        accounted precisely, but there will be slightly more overhead.
>>>   What:        /sys/block/<disk>/queue/logical_block_size
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 9520ccab3050..c70dc311e3b7 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -954,11 +954,15 @@ EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
>>>   void update_io_ticks(struct block_device *part, unsigned long now, 
>>> bool end)
>>>   {
>>>       unsigned long stamp;
>>> +    bool precise = blk_queue_precise_io_stat(part->bd_queue);
>>>   again:
>>>       stamp = READ_ONCE(part->bd_stamp);
>>> -    if (unlikely(time_after(now, stamp))) {
>>> -        if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
>>> -            __part_stat_add(part, io_ticks, end ? now - stamp : 1);
>>> +    if (unlikely(time_after(now, stamp)) &&
>>> +        likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
>>> +        if (end || (precise && part_in_flight(part)))
>>> +            __part_stat_add(part, io_ticks, now - stamp);
>>> +        else if (!precise)
>>> +            __part_stat_add(part, io_ticks, 1);
>>
>> It should be better or readable to move 'bool precise' into the above 
>> branch,
>> given we only need to read the flag once in each tick.
>>
>> Otherwise, this patch looks fine.
> 
> Thanks for your advice, will change that in next version.
> 
> Kuai
>>
>> Thanks,
>> Ming
>>
>> .
>>
> 
> .
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ