[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130416104007.GA620@redhat.com>
Date: Tue, 16 Apr 2013 12:40:07 +0200
From: Stanislaw Gruszka <sgruszka@...hat.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Peter Zijlstra <peterz@...radead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...nel.org>,
"H. Peter Anvin" <hpa@...or.com>,
Frédéric Weisbecker <fweisbec@...il.com>,
Steven Rostedt <rostedt@...dmis.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
linux-tip-commits@...r.kernel.org
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow
On Sat, Apr 13, 2013 at 11:44:54AM -0700, Linus Torvalds wrote:
> PS. This is the "Make sure 'total' fits in 32 bits first" version. Not
> really tested, but it's just changing the order of operations a bit.
>
> /* We know one of the values has a bit set in the high 32 bits */
> for (;;) {
> /* Make sure "rtime" is the bigger of stime/rtime */
> if (stime > rtime) {
> u64 tmp = rtime; rtime = stime; stime = tmp;
> }
>
> /* Make sure 'total' fits in 32 bits */
> if (total >> 32)
> goto drop_precision;
>
> /* Does rtime (and thus stime) fit in 32 bits? */
> if (!(rtime >> 32))
> break;
>
> /* Can we just balance rtime/stime rather than dropping bits? */
> if (stime >> 31)
> goto drop_precision;
>
> /* We can grow stime and shrink rtime and try to make them both fit */
> stime <<= 1;
> rtime >>= 1;
> continue;
>
> drop_precision:
> /* We drop from rtime, it has more bits than stime */
> rtime >>= 1;
> total >>= 1;
> }
It also also pass 0.1% relative error on my tests. Decreasing error
threshold to 0.02% failed. I didn't check other values or measure how
frequent 0.02% fail on each version, I assume this one is better :-)
So regarding relative error this algorithm is fine, there is no
multiplication overflow error which make scaled numbers bogus. Then
I looked on this algorithm regarding context how it is used ...
Raw stime/rtime/total values will increase in jiffies resolution,
so do scaled_stime if we do not drop precision. For bigger numbers,
since we drop precision, scaled_stime will grow in chunks. How big
the chunk depend on how overall big numbers are and stime/total ratio.
For example: stime = 0.5*total, 128 threads, after 1 year of CPU
execution chunk will be 1024 jiffies.
We use scaled stime value this way:
if (total)
stime = scale_stime(stime, rtime, total);
else
stime = rtime;
/*
* If the tick based count grows faster than the scheduler one,
* the result of the scaling may go backward.
* Let's enforce monotonicity.
*/
prev->stime = max(prev->stime, stime);
prev->utime = max(prev->utime, rtime - prev->stime);
*ut = prev->utime;
*st = prev->stime;
Since rtime increase, but scaled stime not, stime will be accounted
as prev->utime. Then after chunk jiffies, stime will grow and we will
get it accounted in prev->stime. As result we will account more cpu
time than actually process do. This error will accumulate depending
how frequently cputime_adjust(process) will be called.
As solution for this we could just stop accounting if prev->stime +
prev->utime are already bigger than rtime. For example:
rtime = nsecs_to_cputime(curr->sum_exec_runtime);
+ /*
+ * Update userspace visible utime/stime values only if actual execution
+ * time is bigger than already exported. Note that can happen, that we
+ * provided bigger values due to scaling inaccuracy on big numbers.
+ */
+ if (prev->stime + prev->utime >= rtime)
+ goto out;
+
if (total)
stime = scale_stime(stime, rtime, total);
else
@@ -573,6 +581,7 @@ static void cputime_adjust(struct task_cputime *curr,
prev->stime = max(prev->stime, stime);
prev->utime = max(prev->utime, rtime - prev->stime);
+out:
*ut = prev->utime;
*st = prev->stime;
}
This should help with erroneously accounting more CPU time than process
actually use. As disadvantage userspace will see CPU time increase in
chunks, but I think this is better than see values much bigger than
correct ones (and for 99.9% user cases it does not matter).
Stanislaw
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists