[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B05F835.10401@jp.fujitsu.com>
Date: Fri, 20 Nov 2009 11:00:21 +0900
From: Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
To: Stanislaw Gruszka <sgruszka@...hat.com>
CC: Peter Zijlstra <peterz@...radead.org>,
Spencer Candland <spencer@...ehost.com>,
Américo Wang <xiyou.wangcong@...il.com>,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
Oleg Nesterov <oleg@...hat.com>,
Balbir Singh <balbir@...ibm.com>
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2
Stanislaw Gruszka wrote:
> On Tue, Nov 17, 2009 at 02:24:48PM +0100, Peter Zijlstra wrote:
>>> Seems issue reported then was exactly the same as reported now by
>>> you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
>>> make probability of bug smaller and you did not note it until now.
>>>
>>> Could you please test this patch, if it solve all utime decrease
>>> problems for you:
>>>
>>> http://patchwork.kernel.org/patch/59795/
>>>
>>> If you confirm it work, I think we should apply it. Otherwise
>>> we need to go to propagate task_{u,s}time everywhere, which is not
>>> (my) preferred solution.
>> That patch will create another issue, it will allow a process to hide
>> from top by arranging to never run when the tick hits.
>
Yes, nowadays there are many threads on high speed hardware,
such process can exist all around, easier than before.
E.g. assume that there are 2 tasks:
Task A: interrupted by timer few times
(utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
=> total of runtime is 1 sec, but utime + stime is 100 ms
Task B: interrupted by timer many times
(utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
=> total of runtime is 10 ms, but utime + stime is 100 ms
You can see task_[su]time() works well for these tasks.
> What about that?
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1f8d028..9db1cbc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
> }
> utime = (cputime_t)temp;
>
> - p->prev_utime = max(p->prev_utime, utime);
> + p->prev_utime = max(p->prev_utime, max(p->utime, utime));
> return p->prev_utime;
> }
I think this makes things worse.
without this patch:
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 5 ms (= accurate)
with this patch:
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 50 ms (= not accurate)
Note that task_stime() calculates prev_stime using this prev_utime:
without this patch:
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 5 ms (= not accurate)
with this patch:
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 0 ms (= not accurate)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ce17760..8be5b75 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
> struct task_cputime cputime;
> cputime_t cutime, cstime;
>
> - thread_group_cputime(current, &cputime);
> spin_lock_irq(¤t->sighand->siglock);
> + thread_group_cputime(current, &cputime);
> cutime = current->signal->cutime;
> cstime = current->signal->cstime;
> spin_unlock_irq(¤t->sighand->siglock);
>
> It's on top of Hidetoshi patch and fix utime decrease problem
> on my system.
How about the stime decrease problem which can be caused by same
logic?
According to my labeling, there are 2 unresolved problem [1]
"thread_group_cputime() vs exit" and [2] "use of task_s/utime()".
Still I believe the real fix for this problem is combination of
above fix for do_sys_times() (for problem[1]) and (I know it is
not preferred, but for [2]) the following:
>> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
>> >> index 5c9dc22..e065b8a 100644
>> >> --- a/kernel/posix-cpu-timers.c
>> >> +++ b/kernel/posix-cpu-timers.c
>> >> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>> >>
>> >> t = tsk;
>> >> do {
>> >> - times->utime = cputime_add(times->utime, t->utime);
>> >> - times->stime = cputime_add(times->stime, t->stime);
>> >> + times->utime = cputime_add(times->utime, task_utime(t));
>> >> + times->stime = cputime_add(times->stime, task_stime(t));
>> >> times->sum_exec_runtime += t->se.sum_exec_runtime;
>> >>
>> >> t = next_thread(t);
Think about this diff, assuming task C is in same group of task A and B:
sys_times() on C while A and B are living returns:
(utime, stime)
= task_[su]time(C) + ([su]time(A)+[su]time(B)+...) + in_signal(exited)
= task_[su]time(C) + ( (50,50) + (50,50) +...) + in_signal(exited)
If A exited, it increases:
(utime, stime)
= task_[su]time(C) + ([su]time(B)+...) + in_signal(exited)+task_[su]time(A)
= task_[su]time(C) + ( (50,50) +...) + in_signal(exited)+(500,500)
Otherwise if B exited, it decreases:
(utime, stime)
= task_[su]time(C) + ([su]time(A)+...) + in_signal(exited)+task_[su]time(B)
= task_[su]time(C) + ( (50,50) +...) + in_signal(exited)+(5,5)
With this fix, sys_times() returns:
(utime, stime)
= task_[su]time(C) + (task_[su]time(A)+task_[su]time(B)+...) + in_signal(exited)
= task_[su]time(C) + ( (500,500) + (5,5) +...) + in_signal(exited)
> Are we not doing something nasty here?
>
> cputime_t utime = p->utime, total = utime + p->stime;
> u64 temp;
>
> /*
> * Use CFS's precise accounting:
> */
> temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);
>
> if (total) {
> temp *= utime;
> do_div(temp, total);
> }
> utime = (cputime_t)temp;
Not here, but doing do_div() for each thread could be said nasty.
I mean
__task_[su]time(sum(A, B, ...))
would be better than:
sum(task_[su]time(A)+task_[su]time(B)+...)
However it would bring another issue, because
__task_[su]time(sum(A, B, ...))
might not equal to
__task_[su]time(sum(B, ...)) + task_[su]time(A)
Thanks,
H.Seto
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists