linux-kernel - Re: Utime and stime are less when getrusage (RUSAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YKTZag/E8AaOtVT0@hirez.programming.kicks-ass.net>
Date:   Wed, 19 May 2021 11:24:58 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     "hasegawa-hitomi@...itsu.com" <hasegawa-hitomi@...itsu.com>
Cc:     "'mingo@...nel.org'" <mingo@...nel.org>,
        "'fweisbec@...il.com'" <fweisbec@...il.com>,
        "'tglx@...utronix.de'" <tglx@...utronix.de>,
        "'juri.lelli@...hat.com'" <juri.lelli@...hat.com>,
        "'vincent.guittot@...aro.org'" <vincent.guittot@...aro.org>,
        "'dietmar.eggemann@....com'" <dietmar.eggemann@....com>,
        "'rostedt@...dmis.org'" <rostedt@...dmis.org>,
        "'bsegall@...gle.com'" <bsegall@...gle.com>,
        "'mgorman@...e.de'" <mgorman@...e.de>,
        "'bristot@...hat.com'" <bristot@...hat.com>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is
 executed on a tickless CPU.

On Wed, May 19, 2021 at 06:30:36AM +0000, hasegawa-hitomi@...itsu.com wrote:
> Hi Ingo, Peter, Juri, and Vincent.
> 
> 
> > Your email is malformed.
> 
> I'm sorry. I was sent in the wrong format. I correct it and resend.
> Thank you, Peter, for pointing this out.
> 
> 
> I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU,
> the utime and stime I get are less than the actual time, unlike when I run
> getrusage(RUSAGE_SELF) on a single thread.
> This problem seems to be caused by the fact that se.sum_exec_runtime is not
> updated just before getting the information from 'current'.
> In the current implementation, task_cputime_adjusted() calls task_cputime() to
> get the 'current' utime and stime, then calls cputime_adjust() to adjust the
> sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless
> CPU, sum_exec_runtime is not updated periodically, so there seems to be a
> discrepancy with the actual time.
> Therefore, I think I should include a process to update se.sum_exec_runtime
> just before getting the information from 'current' (as in other processes
> except RUSAGE_THREAD). I'm thinking of the following improvement.
> 
> @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
>         if (who == RUSAGE_THREAD) {
> +               task_sched_runtime(current);
>                 task_cputime_adjusted(current, &utime, &stime);
> 
> Is there any possible problem with this?

Would be superfluous for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
architectures at the very least.

It also doesn't help any of the other callers, like for example procfs.

Something like the below ought to work and fix all variants I think. But
it does make the call significantly more expensive.

Looking at thread_group_cputime() that already does something like this,
but that's also susceptible to a variant of this very same issue; since
it doesn't call it unconditionally, nor on all tasks, so if current
isn't part of the threadgroup and/or another task is on a nohz_full cpu,
things will go wobbly again.

There's a note about syscall performance there, so clearly someone seems
to care about that aspect of things, but it does suck for nohz_full.

Frederic, didn't we have remote ticks that should help with this stuff?

And mostly I think the trade-off here is that if you run on nohz_full,
you're not expected to go do syscalls anyway (because they're sodding
expensive) and hence the accuracy of these sort of things is mostly
irrelevant.

So it might be the use-case is just fundamentally bonkers and we
shouldn't really bother fixing this.

Anyway?

---
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 872e481d5098..620871c8e4f8 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
 void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 {
 	struct task_cputime cputime = {
-		.sum_exec_runtime = p->se.sum_exec_runtime,
+		.sum_exec_runtime = task_sched_runtime(p),
 	};
 
 	task_cputime(p, &cputime.utime, &cputime.stime);