linux-kernel - Re: [PATCH] sched/cputime: make scale

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190719134727.GV3463@hirez.programming.kicks-ass.net>
Date:   Fri, 19 Jul 2019 15:47:27 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Oleg Nesterov <oleg@...hat.com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Andrew Fox <afox@...hat.com>,
        Stephen Johnston <sjohnsto@...hat.com>,
        linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Stanislaw Gruszka <sgruszka@...hat.com>
Subject: Re: [PATCH] sched/cputime: make scale_stime() more precise

On Fri, Jul 19, 2019 at 01:03:49PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 18, 2019 at 03:18:34PM +0200, Oleg Nesterov wrote:
> > People report that utime and stime from /proc/<pid>/stat become very wrong
> > when the numbers are big enough. In particular, the monitored application
> > can run all the time in user-space but only stime grows.
> > 
> > This is because scale_stime() is very inaccurate. It tries to minimize the
> > relative error, but the absolute error can be huge.
> > 
> > Andrew wrote the test-case:
> > 
> > 	int main(int argc, char **argv)
> > 	{
> > 	    struct task_cputime c;
> > 	    struct prev_cputime p;
> > 	    u64 st, pst, cst;
> > 	    u64 ut, put, cut;
> > 	    u64 x;
> > 	    int i = -1; // one step not printed
> > 
> > 	    if (argc != 2)
> > 	    {
> > 		printf("usage: %s <start_in_seconds>\n", argv[0]);
> > 		return 1;
> > 	    }
> > 	    x = strtoull(argv[1], NULL, 0) * SEC;
> > 	    printf("start=%lld\n", x);
> > 
> > 	    p.stime = 0;
> > 	    p.utime = 0;
> > 
> > 	    while (i++ < NSTEPS)
> > 	    {
> > 		x += STEP;
> > 		c.stime = x;
> > 		c.utime = x;
> > 		c.sum_exec_runtime = x + x;
> > 		pst = cputime_to_clock_t(p.stime);
> > 		put = cputime_to_clock_t(p.utime);
> > 		cputime_adjust(&c, &p, &ut, &st);
> > 		cst = cputime_to_clock_t(st);
> > 		cut = cputime_to_clock_t(ut);
> > 		if (i)
> > 		    printf("ut(diff)/st(diff): %20lld (%4lld)  %20lld (%4lld)\n",
> > 			cut, cut - put, cst, cst - pst);
> > 	    }
> > 	}
> > 
> > For example,
> > 
> > 	$ ./stime 300000
> > 	start=300000000000000
> > 	ut(diff)/st(diff):            299994875 (   0)             300009124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300011124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300013124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300015124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300017124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300019124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300021124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300023124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300025124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300027124 (2000)
> > 	ut(diff)/st(diff):            299994875 (   0)             300029124 (2000)
> > 	ut(diff)/st(diff):            299996875 (2000)             300029124 (   0)
> > 	ut(diff)/st(diff):            299998875 (2000)             300029124 (   0)
> > 	ut(diff)/st(diff):            300000875 (2000)             300029124 (   0)
> > 	ut(diff)/st(diff):            300002875 (2000)             300029124 (   0)
> > 	ut(diff)/st(diff):            300004875 (2000)             300029124 (   0)
> > 	ut(diff)/st(diff):            300006875 (2000)             300029124 (   0)
> > 	ut(diff)/st(diff):            300008875 (2000)             300029124 (   0)
> > 	ut(diff)/st(diff):            300010875 (2000)             300029124 (   0)
> > 	ut(diff)/st(diff):            300012055 (1180)             300029944 ( 820)
> > 	ut(diff)/st(diff):            300012055 (   0)             300031944 (2000)
> > 	ut(diff)/st(diff):            300012055 (   0)             300033944 (2000)
> > 	ut(diff)/st(diff):            300012055 (   0)             300035944 (2000)
> > 	ut(diff)/st(diff):            300012055 (   0)             300037944 (2000)
> > 
> > shows the problem even when sum_exec_runtime is not that big: 300000 secs.
> > 
> > The new implementation of scale_stime() does the additional div64_u64_rem()
> > in a loop but see the comment, as long it is used by cputime_adjust() this
> > can happen only once.
> 
> That only shows something after long long staring :/ There's no words on
> what the output actually means or what would've been expected.
> 
> Also, your example is incomplete; the below is a test for scale_stime();
> from this we can see that the division results in too large a number,
> but, important for our use-case in cputime_adjust(), it is a step
> function (due to loss in precision) and for every plateau we shift
> runtime into the wrong bucket.

But I'm still confused, since in the long run, it should still end up
with a proportionally divided user/system, irrespective of some short
term wobblies.

So please, better articulate the problem.