linux-kernel - Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime ratio in cputime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <dca42bb1-5480-cec0-0bc8-b5ac3c208177@linux.alibaba.com>
Date:   Tue, 24 Jul 2018 21:28:48 +0800
From:   Xunlei Pang <xlpang@...ux.alibaba.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>, tglx@...utronix.de,
        frederic@...nel.org, lcapitulino@...hat.com,
        torvalds@...ux-foundation.org, linux-kernel@...r.kernel.org,
        hpa@...or.com, tj@...nel.org, linux-tip-commits@...r.kernel.org
Subject: Re: [tip:sched/core] sched/cputime: Ensure accurate utime and stime
 ratio in cputime_adjust()

On 7/23/18 5:21 PM, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 12:08:36PM +0800, Xunlei Pang wrote:
>> The trace data corresponds to the last sample period:
>> trace entry 1:
>>              cat-20755 [022] d...  1370.106496: cputime_adjust: task
>> tick-based utime 362560000000 stime 2551000000, scheduler rtime 333060702626
>>              cat-20755 [022] d...  1370.106497: cputime_adjust: result:
>> old utime 330729718142 stime 2306983867, new utime 330733635372 stime
>> 2327067254
>>
>> trace entry 2:
>>              cat-20773 [005] d...  1371.109825: cputime_adjust: task
>> tick-based utime 362567000000 stime 3547000000, scheduler rtime 334063718912
>>              cat-20773 [005] d...  1371.109826: cputime_adjust: result:
>> old utime 330733635372 stime 2327067254, new utime 330827229702 stime
>> 3236489210
>>
>> 1) expected behaviour
>> Let's compare the last two trace entries(all the data below is in ns):
>> task tick-based utime: 362560000000->362567000000 increased 7000000
>> task tick-based stime: 2551000000  ->3547000000   increased 996000000
>> scheduler rtime:       333060702626->334063718912 increased 1003016286
>>
>> The application actually runs almost 100%sys at the moment, we can
>> use the task tick-based utime and stime increased to double check:
>> 996000000/(7000000+996000000) > 99%sys
>>
>> 2) the current cputime_adjust() inaccurate result
>> But for the current cputime_adjust(), we get the following adjusted
>> utime and stime increase in this sample period:
>> adjusted utime: 330733635372->330827229702 increased 93594330
>> adjusted stime: 2327067254  ->3236489210   increased 909421956
>>
>> so 909421956/(93594330+909421956)=91%sys as the shell script shows above.
>>
>> 3) root cause
>> The root cause of the issue is that the current cputime_adjust() always
>> passes the whole times to scale_stime() to split the whole utime and
>> stime. In this patch, we pass all the increased deltas in 1) within
>> user's sample period to scale_stime() instead and accumulate the
>> corresponding results to the previous saved adjusted utime and stime,
>> so guarantee the accurate usr and sys increase within the user sample
>> period.
> 
> But why it this a problem?
> 
> Since its sample based there's really nothing much you can guarantee.
> What if your test program were to run in userspace for 50% of the time
> but is so constructed to always be in kernel space when the tick
> happens?
> 
> Then you would 'expect' it to be 50% user and 50% sys, but you're also
> not getting that.
> 
> This stuff cannot be perfect, and the current code provides 'sensible'
> numbers over the long run for most programs. Why muck with it?
> 

Basically I am ok with the current implementation, except for one
scenario we've met: when kernel went wrong for some reason with 100% sys
suddenly for seconds(even trigger softlockup), the statistics monitor
didn't reflect the fact, which confused people. One example with our
per-cgroup top, we ever noticed "20% usr, 80% sys" displayed while in
fact the kernel was in some busy loop(100% sys) at that moment, and the
tick based time are of course all sys samples in such case.