lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <98e42b31-1a67-0144-19d3-a4b66c5308fd@huaweicloud.com>
Date: Mon, 19 Feb 2024 17:03:52 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Yu Kuai <yukuai1@...weicloud.com>, Ming Lei <ming.lei@...hat.com>
Cc: hch@....de, bvanassche@....org, axboe@...nel.dk,
 linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 yi.zhang@...wei.com, yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH for-6.8/block] block: support to account io_ticks
 precisely

Hi, Ming

在 2024/01/16 17:59, Yu Kuai 写道:
> Hi, Ming
> 
> 在 2024/01/15 19:54, Yu Kuai 写道:
>> Hi,
>>
>> 在 2024/01/15 19:38, Ming Lei 写道:
>>> On Tue, Jan 09, 2024 at 03:13:32PM +0800, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@...wei.com>
>>>>
>>>> Currently, io_ticks is accounted based on sampling, specifically
>>>> update_io_ticks() will always account io_ticks by 1 jiffies from
>>>> bdev_start_io_acct()/blk_account_io_start(), and the result can be
>>>> inaccurate, for example(HZ is 250):
>>>>
>>>> Test script:
>>>> fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test 
>>>> -thinktime=4ms
>>>>
>>>> Test result: util is about 90%, while the disk is really idle.
>>>
>>> Just be curious, what is result with this patch? 0%?
> 
> I ran some tests on null_blk and found out that this patch actually
> doesn't have any performance overhead, and I finially figure out that
> it's true even in theroy.
> 
> Notice that there is a cmpxchg() in update_ticks():
> 
> update_io_ticks:
>   stamp = part->bd_stamp;
>   if (time_after(now, stamp))
>    if (try_cmpxchg())
>     __part_stat_add()
> 
> Hence only one task can pass cmpxchg in 1 jiffies, and part_stat_add() 
> be called at most once in 1 jiffies. Which means with this patch,
> part_in_flight() will also be called at most once in 1 jiffies(not per
> IO). And part_in_flight() once per jiffies really doesn't affect IO
> performance at all.
> 
> Befor this cmpxchg():
> 
> part_round_stats:
>   if (part->stamp != now)
>    stats |= 1;
> 
>   part_in_flight()
> -> there can be lots of task here in 1 jiffies.
>   part_round_stats_single()
>    __part_stat_add()
>   part->stamp = now;
> 
> By the way, this cmpxchg() is added by commit 5b18b5a73760 ("block:
> delete part_round_stats and switch to less precise counting"), hence
> actually there is no need to switch to less precise counting in the
> first place.
> 
> So I think I can just remove the switch and switch to precise io
> accounting by default in the next version.
> 
> Please let me know what you think!

Are you still interested? And the overhead to switch to precise io
accounting is:

  - per cpu add/dec for each IO for rq-based device;
  - per cpu sum for each jiffies(1-4 ms).

Hence in theory, the overhead is quite small, and I already tested with
null-blk, there is no change with high IO pressure.

Thanks,
Kuai

> 
> Thanks,
> Kuai
> 
>>
>> No, it's not 0%, this actually depends on how many IO really start from
>> one jiffies and complete at the next jiffies. Given that the probability
>> is related to IO latency, so the result should be relatively
>> accurate(Around 10% in my environment). I think we can live with that
>> unless we improve time precision from jiffies to ns.
>>>
>>>>
>>>> In order to account io_ticks precisely, update_io_ticks() must know if
>>>> there are IO inflight already, and this requires overhead slightly,
>>>> hence precise io accounting is disabled by default, and user can enable
>>>> it through sysfs entry.
>>>>
>>>> Noted that for rq-based devcie, part_stat_local_inc/dec() and
>>>> part_in_flight() is used to track inflight instead of iterating tags,
>>>> which is not supposed to be used in fast path because 'tags->lock' is
>>>> grabbed in blk_mq_find_and_get_req().
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>>>> ---
>>>> Changes from RFC v1:
>>>>   - remove the new parameter for update_io_ticks();
>>>>   - simplify update_io_ticks();
>>>>   - use swith in queue_iostats_store();
>>>>   - add missing part_stat_local_dec() in 
>>>> blk_account_io_merge_request();
>>>> Changes from RFC v2:
>>>>   - fix that precise is ignored for the first io in update_io_ticks();
>>>>
>>>>   Documentation/ABI/stable/sysfs-block |  8 ++++--
>>>>   block/blk-core.c                     | 10 +++++--
>>>>   block/blk-merge.c                    |  3 ++
>>>>   block/blk-mq-debugfs.c               |  2 ++
>>>>   block/blk-mq.c                       | 11 +++++++-
>>>>   block/blk-sysfs.c                    | 42 
>>>> ++++++++++++++++++++++++++--
>>>>   block/blk.h                          |  1 +
>>>>   block/genhd.c                        |  2 +-
>>>>   include/linux/blk-mq.h               |  1 +
>>>>   include/linux/blkdev.h               |  3 ++
>>>>   10 files changed, 74 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/Documentation/ABI/stable/sysfs-block 
>>>> b/Documentation/ABI/stable/sysfs-block
>>>> index 1fe9a553c37b..79027bf2661a 100644
>>>> --- a/Documentation/ABI/stable/sysfs-block
>>>> +++ b/Documentation/ABI/stable/sysfs-block
>>>> @@ -358,8 +358,12 @@ What:        /sys/block/<disk>/queue/iostats
>>>>   Date:        January 2009
>>>>   Contact:    linux-block@...r.kernel.org
>>>>   Description:
>>>> -        [RW] This file is used to control (on/off) the iostats
>>>> -        accounting of the disk.
>>>> +        [RW] This file is used to control the iostats accounting of 
>>>> the
>>>> +        disk. If this value is 0, iostats accounting is disabled; If
>>>> +        this value is 1, iostats accounting is enabled, but 
>>>> io_ticks is
>>>> +        accounted by sampling and the result is not accurate; If this
>>>> +        value is 2, iostats accounting is enabled and io_ticks is
>>>> +        accounted precisely, but there will be slightly more overhead.
>>>>   What:        /sys/block/<disk>/queue/logical_block_size
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 9520ccab3050..c70dc311e3b7 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -954,11 +954,15 @@ EXPORT_SYMBOL_GPL(iocb_bio_iopoll);
>>>>   void update_io_ticks(struct block_device *part, unsigned long now, 
>>>> bool end)
>>>>   {
>>>>       unsigned long stamp;
>>>> +    bool precise = blk_queue_precise_io_stat(part->bd_queue);
>>>>   again:
>>>>       stamp = READ_ONCE(part->bd_stamp);
>>>> -    if (unlikely(time_after(now, stamp))) {
>>>> -        if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)))
>>>> -            __part_stat_add(part, io_ticks, end ? now - stamp : 1);
>>>> +    if (unlikely(time_after(now, stamp)) &&
>>>> +        likely(try_cmpxchg(&part->bd_stamp, &stamp, now))) {
>>>> +        if (end || (precise && part_in_flight(part)))
>>>> +            __part_stat_add(part, io_ticks, now - stamp);
>>>> +        else if (!precise)
>>>> +            __part_stat_add(part, io_ticks, 1);
>>>
>>> It should be better or readable to move 'bool precise' into the above 
>>> branch,
>>> given we only need to read the flag once in each tick.
>>>
>>> Otherwise, this patch looks fine.
>>
>> Thanks for your advice, will change that in next version.
>>
>> Kuai
>>>
>>> Thanks,
>>> Ming
>>>
>>> .
>>>
>>
>> .
>>
> 
> .
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ