linux-kernel - Re: iowait v.s. idle accounting is "inconsistent"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <26e7faef-7223-3ef8-d09c-e382223ce4fa@gmail.com>
Date:   Fri, 5 Jul 2019 14:37:44 +0100
From:   Alan Jenkins <alan.christopher.jenkins@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
        Doug Smythies <dsmythies@...us.net>, linux-pm@...r.kernel.org
Subject: Re: iowait v.s. idle accounting is "inconsistent" - iowait is too low

On 05/07/2019 12:38, Peter Zijlstra wrote:
> On Fri, Jul 05, 2019 at 12:25:46PM +0100, Alan Jenkins wrote:
>> Hi, scheduler experts!
>>
>> My cpu "iowait" time appears to be reported incorrectly.  Do you know why
>> this could happen?
> Because iowait is a magic random number that has no sane meaning.
> Personally I'd prefer to just delete the whole thing, except ABI :/
>
> Also see the comment near nr_iowait():
>
> /*
>   * IO-wait accounting, and how its mostly bollocks (on SMP).
>   *
>   * The idea behind IO-wait account is to account the idle time that we could
>   * have spend running if it were not for IO. That is, if we were to improve the
>   * storage performance, we'd have a proportional reduction in IO-wait time.
>   *
>   * This all works nicely on UP, where, when a task blocks on IO, we account
>   * idle time as IO-wait, because if the storage were faster, it could've been
>   * running and we'd not be idle.
>   *
>   * This has been extended to SMP, by doing the same for each CPU. This however
>   * is broken.
>   *
>   * Imagine for instance the case where two tasks block on one CPU, only the one
>   * CPU will have IO-wait accounted, while the other has regular idle. Even
>   * though, if the storage were faster, both could've ran at the same time,
>   * utilising both CPUs.
>   *
>   * This means, that when looking globally, the current IO-wait accounting on
>   * SMP is a lower bound, by reason of under accounting.
>   *
>   * Worse, since the numbers are provided per CPU, they are sometimes
>   * interpreted per CPU, and that is nonsensical. A blocked task isn't strictly
>   * associated with any one particular CPU, it can wake to another CPU than it
>   * blocked on. This means the per CPU IO-wait number is meaningless.
>   *
>   * Task CPU affinities can make all that even more 'interesting'.
>   */

Thanks. I take those as being different problems, but you mean there is 
not much demand (or point) to "fix" my issue.

>  (2) Compare running "dd" with "taskset -c 1":
>
> %Cpu1  :  0.3 us,  3.0 sy,  0.0 ni, 83.7 id, 12.6 wa,  0.0 hi,  0.3 si,  0.0 st 

                                       ^ non-zero idle time for Cpu1, despite the pinned IO hog.


The block layer recently decided they could break "disk busy%" reporting 
for slow devices (mechanical HDD), in order to reduce overheads for fast 
devices.  This means the summary view in "atop" now lacks any reliable 
indicator.

I suppose I need to look in "iotop".

The new /proc/pressure/io seems to have caveats related to the iowait 
issues... it seems even more complex to interpret for this case, and it 
does not seem to work how I think it does.[1]

Regards
Alan

[1] 
https://unix.stackexchange.com/questions/527342/why-does-the-new-linux-pressure-stall-information-for-io-not-show-as-100/