linux-kernel - Re: 4.19 LTS high /proc/diskstats io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200325100736.GA3083079@kroah.com>
Date:   Wed, 25 Mar 2020 11:07:36 +0100
From:   Greg KH <gregkh@...uxfoundation.org>
To:     "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@...ia.com>
Cc:     "axboe@...nel.dk" <axboe@...nel.dk>,
        "osandov@...com" <osandov@...com>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>
Subject: Re: 4.19 LTS high /proc/diskstats io_ticks

On Wed, Mar 25, 2020 at 10:02:41AM +0000, Rantala, Tommi T. (Nokia - FI/Espoo) wrote:
> Hi,
> 
> Tools like sar and iostat are reporting abnormally high %util with 4.19.y
> running in VM (the disk is almost idle):
> 
>   $ sar -dp
>   Linux 4.19.107-1.x86_64   03/25/20    _x86_64_   (6 CPU)
> 
>   00:00:00        DEV       tps      ...     %util
>   00:10:00        vda      0.55      ...     98.07
>   ...
>   10:00:00        vda      0.44      ...     99.74
>   Average:        vda      0.48      ...     98.98
> 
> The numbers look reasonable for the partition:
> 
>   # iostat -x -p ALL 1 1
>   Linux 4.19.107-1.x86_64   03/25/20    _x86_64_  (6 CPU)
> 
>   avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             10.51    0.00    8.58    0.05    0.11   80.75
> 
>   Device            r/s     ...  %util
>   vda              0.02     ...  98.25
>   vda1             0.01     ...   0.09
> 
> 
> Lots of io_ticks in /proc/diskstats:
> 
> # cat /proc/uptime
> 45787.03 229321.29
> 
> # grep vda /proc/diskstats
>  253      0 vda 760 0 38498 731 28165 43212 1462928 157514 0 44690263
> 44812032 0 0 0 0
>  253      1 vda1 350 0 19074 293 26169 43212 1462912 154931 0 41560 150998
> 0 0 0 0
> 
> 
> Other people are apparently seeing this too with 4.19:
> https://kudzia.eu/b/2019/09/iostat-x-1-reporting-100-utilization-of-nearly-idle-nvme-drives/
> 
> 
> I also see this only in 4.19.y and bisected to this (based on the Fixes
> tag, this should have been taken to 4.14 too...):
> 
> commit 6131837b1de66116459ef4413e26fdbc70d066dc
> Author: Omar Sandoval <osandov@...com>
> Date:   Thu Apr 26 00:21:58 2018 -0700
> 
>   blk-mq: count allocated but not started requests in iostats inflight
> 
>   In the legacy block case, we increment the counter right after we
>   allocate the request, not when the driver handles it. In both the legacy
>   and blk-mq cases, part_inc_in_flight() is called from
>   blk_account_io_start() right after we've allocated the request. blk-mq
>   only considers requests started requests as inflight, but this is
>   inconsistent with the legacy definition and the intention in the code.
>   This removes the started condition and instead counts all allocated
>   requests.
> 
>   Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
>   Signed-off-by: Omar Sandoval <osandov@...com>
>   Signed-off-by: Jens Axboe <axboe@...nel.dk>
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c3621453ad87..5450cbc61f8d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -95,18 +95,15 @@ static void blk_mq_check_inflight(struct blk_mq_hw_ctx
> *hctx,
>  {
>         struct mq_inflight *mi = priv;
>  
> -       if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
> -               /*
> -                * index[0] counts the specific partition that was asked
> -                * for. index[1] counts the ones that are active on the
> -                * whole device, so increment that if mi->part is indeed
> -                * a partition, and not a whole device.
> -                */
> -               if (rq->part == mi->part)
> -                       mi->inflight[0]++;
> -               if (mi->part->partno)
> -                       mi->inflight[1]++;
> -       }
> +       /*
> +        * index[0] counts the specific partition that was asked for.
> index[1]
> +        * counts the ones that are active on the whole device, so
> increment
> +        * that if mi->part is indeed a partition, and not a whole device.
> +        */
> +       if (rq->part == mi->part)
> +               mi->inflight[0]++;
> +       if (mi->part->partno)
> +               mi->inflight[1]++;
>  }
>  
>  void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
> 
> 
> 
> If I get it right, when the disk is idle, and some request is allocated,
> part_round_stats() with this commit will now add all ticks between
> previous I/O and current time (now - part->stamp) to io_ticks.
> 
> Before the commit, part_round_stats() would only update part->stamp when
> called after request allocation.

So this is a "false" reporting?  there's really no load?

> Any thoughts how to best fix this in 4.19?
> I see the io_ticks accounting has been reworked in 5.0, do we need to
> backport those to 4.19, or any ill effects if this commit is reverted in
> 4.19?

Do you see this issue in 5.4?  What's keeping you from moving to 5.4.y?

And if this isn't a real issue, is that a problem too?

As you can test this, if you have a set of patches backported that could
resolve it, can you send them to us?

thanks,

greg k-h