linux-kernel - Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAK1hOcO3tmzK7JnuNGYCeFBk=nEoo5ke0Tr58zi52ccv0CzEvQ@mail.gmail.com>
Date:	Thu, 3 Apr 2014 11:51:32 +0200
From:	Denys Vlasenko <vda.linux@...glemail.com>
To:	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Fernando Luis Vazquez Cao <fernando_b1@....ntt.co.jp>,
	Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Arjan van de Ven <arjan@...ux.intel.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
Subject: Re: [PATCH 1/2] nohz: use seqlock to avoid race on idle time stats v2

On Thu, Apr 3, 2014 at 9:02 AM, Hidetoshi Seto
<seto.hidetoshi@...fujitsu.com> wrote:
>>> [PROBLEM 2]: broken iowait accounting.
>>>
>>> As historical nature, cpu's idle time was accounted as either
>>> idle or iowait depending on the presence of tasks blocked by
>>> I/O. No one complain about it for a long time. However:
>>>
>>>   > Still trying to wrap my head around it, but conceptually
>>>   > get_cpu_iowait_time_us() doesn't make any kind of sense.
>>>   > iowait isn't per cpu since effectively tasks that aren't
>>>   > running aren't assigned a cpu (as Oleg already pointed out).
>>>   -- Peter Zijlstra
>>>
>>> Now some kernel folks realized that accounting iowait as per-cpu
>>> does not make sense in SMP world. When we were in traditional
>>> UP era, cpu is considered to be waiting I/O if it is idle while
>>> nr_iowait > 0. But in these days with SMP systems, tasks easily
>>> migrate from a cpu where they issued an I/O to another cpu where
>>> they are queued after I/O completion.
>>
>> However, if we would put ourselves into admin's seat, iowait
>> immediately starts to make sense: for admin, the system state
>> where a lot of CPU time is genuinely idle is qualitatively different
>> form the state where a lot of CPU time is "idle" because
>> we are I/O bound.
>>
>> Admins probably wouldn't insist that iowait accounting must be
>> very accurate. I would hazard to guess that admins would settle
>> for the following rules:
>>
>> * (idle + iowait) should accurately represent amount of time
>> CPUs were idle.
>> * both idle and iowait should never go backwards
>> * when system is truly idle, only idle should increase
>> * when system is truly I/O bound on all CPUs, only iowait should increase
>> * when the situation is in between of the above two cases,
>> both iowait and idle counters should grow. It's ok if they
>> represent idle/IO-bound ratio only approximately
>
> Yep. Admins are at the mercy of iowait value, though they know it
> is not accurate.
>
> Assume there are task X,Y,Z (X issues io, Y sleeps moderately,
> and Z has low priority):
>
> Case 1:
>   cpu A: <--run X--><--iowait--><--run X--><--iowait--><--run X ...
>   cpu B: <---run Y--><--run Z--><--run Y--><--run Z--><--run Y ...
>   io:               <-- io X -->           <-- io X -->
>
> Case 2:
>   cpu A: <--run X--><--run Z---><--run X--><--run Z---><--run X ...
>   cpu B: <---run Y---><--idle--><---run Y---><--idle--><--run Y ...
>   io:               <-- io X -->           <-- io X -->
>
> So case 1 tend to be iowait while case 2 is idle, despite
> almost same workloads. Then what should admins do...?

This happens with current code too, right?
No regression then.

>>> Back to NO_HZ mechanism. Totally terrible thing here is that
>>> observer need to handle duration "delta" without knowing that
>>> nr_iowait of sleeping cpu can be changed easily by migration
>>> even if cpu is sleeping.
>>
>> How about the following: when CPU enters idle, it remembers
>> in struct tick_sched->idle_active whether it was "truly" idle
>> or I/O bound: something like
>>
>> ts->idle_active = nr_iowait_cpu(smp_processor_id()) ? 2 : 1;
>>
>> Then, when we exit idle, we account entire idle period as
>> "true" idle or as iowait depending on ts->idle_active value,
>> regardless of what happened to I/O bound task (whether
>> it migrated or not).
>
> It will not be acceptable. CPU can sleep significantly long
> time after all I/O bound tasks are migrated. e.g.:
>
> cpu A: <-run X-><-- iowait ---... (few days) ...--><-run Z ..
> cpu B: <----run Y------><-run X->..
> io:             <-io X->

Does task migrate from an *idle* CPU? If yes, why?
Since its CPU is idle (i.e. immediately available
for it to be scheduled on),
I would imagine normally IO-blocked task stays
on its CPU's rq if it is idle.

> I agree on removing get_cpu_{idle,iowait}_time_us() (or marking
> it as obsolete) with some conditions.

Er?
My proposal does not eliminate or change
get_cpu_{idle,iowait}_time_us() API.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/