linux-kernel - Re: [PATCH] sched/cputime: make scale

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190722200034.GJ6698@worktop.programming.kicks-ass.net>
Date:   Mon, 22 Jul 2019 22:00:34 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Stanislaw Gruszka <sgruszka@...hat.com>
Cc:     Oleg Nesterov <oleg@...hat.com>, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Andrew Fox <afox@...hat.com>,
        Stephen Johnston <sjohnsto@...hat.com>,
        linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH] sched/cputime: make scale_stime() more precise

On Mon, Jul 22, 2019 at 12:52:41PM +0200, Stanislaw Gruszka wrote:
> On Fri, Jul 19, 2019 at 01:03:49PM +0200, Peter Zijlstra wrote:
> > > shows the problem even when sum_exec_runtime is not that big: 300000 secs.
> > > 
> > > The new implementation of scale_stime() does the additional div64_u64_rem()
> > > in a loop but see the comment, as long it is used by cputime_adjust() this
> > > can happen only once.
> > 
> > That only shows something after long long staring :/ There's no words on
> > what the output actually means or what would've been expected.
> > 
> > Also, your example is incomplete; the below is a test for scale_stime();
> > from this we can see that the division results in too large a number,
> > but, important for our use-case in cputime_adjust(), it is a step
> > function (due to loss in precision) and for every plateau we shift
> > runtime into the wrong bucket.
> > 
> > Your proposed function works; but is atrocious, esp. on 32bit. That
> > said, before we 'fixed' it, it had similar horrible divisions in, see
> > commit 55eaa7c1f511 ("sched: Avoid cputime scaling overflow").
> > 
> > Included below is also an x86_64 implementation in 2 instructions.
> > 
> > I'm still trying see if there's anything saner we can do...
> 
> I was always proponent of removing scaling and export raw values
> and sum_exec_runtime. But that has obvious drawback, reintroduce
> 'top hiding' issue.

I think (but didn't grep) that we actually export sum_exec_runtime in
/proc/ *somewhere*.

> But maybe we can export raw values in separate file i.e.
> /proc/[pid]/raw_cpu_times ? So applications that require more precise
> cputime values for very long-living processes can use this file.

There are no raw cpu_times, there are system and user samples, and
samples are, by their very nature, an approximation. We just happen to
track the samples in TICK_NSEC resolution these days, but they're still
ticks (except on s390 and maybe other archs, which do time accounting in
the syscall path).

But I think you'll find x86 people are quite opposed to doing TSC reads
in syscall entry and exit :-)