linux-kernel - Re: [PATCH] fix granularity of task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B05F835.10401@jp.fujitsu.com>
Date:	Fri, 20 Nov 2009 11:00:21 +0900
From:	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
To:	Stanislaw Gruszka <sgruszka@...hat.com>
CC:	Peter Zijlstra <peterz@...radead.org>,
	Spencer Candland <spencer@...ehost.com>,
	Américo Wang <xiyou.wangcong@...il.com>,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Oleg Nesterov <oleg@...hat.com>,
	Balbir Singh <balbir@...ibm.com>
Subject: Re: [PATCH] fix granularity of task_u/stime(), v2

Stanislaw Gruszka wrote:
> On Tue, Nov 17, 2009 at 02:24:48PM +0100, Peter Zijlstra wrote:
>>> Seems issue reported then was exactly the same as reported now by
>>> you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
>>> make probability of bug smaller and you did not note it until now.  
>>>
>>> Could you please test this patch, if it solve all utime decrease
>>> problems for you:
>>>
>>> http://patchwork.kernel.org/patch/59795/
>>>
>>> If you confirm it work, I think we should apply it. Otherwise
>>> we need to go to propagate task_{u,s}time everywhere, which is not
>>> (my) preferred solution.
>> That patch will create another issue, it will allow a process to hide
>> from top by arranging to never run when the tick hits.
> 

Yes, nowadays there are many threads on high speed hardware,
such process can exist all around, easier than before.

E.g. assume that there are 2 tasks:

Task A: interrupted by timer few times
   (utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
    => total of runtime is 1 sec, but utime + stime is 100 ms

Task B: interrupted by timer many times
   (utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
    => total of runtime is 10 ms, but utime + stime is 100 ms

You can see task_[su]time() works well for these tasks.

> What about that?
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1f8d028..9db1cbc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
>  	}
>  	utime = (cputime_t)temp;
>  
> -	p->prev_utime = max(p->prev_utime, utime);
> +	p->prev_utime = max(p->prev_utime, max(p->utime, utime));
>  	return p->prev_utime;
>  }

I think this makes things worse.

 without this patch:
  Task A prev_utime: 500 ms (= accurate)
  Task B prev_utime: 5 ms (= accurate)
 with this patch:
  Task A prev_utime: 500 ms (= accurate)
  Task B prev_utime: 50 ms (= not accurate)

Note that task_stime() calculates prev_stime using this prev_utime:

 without this patch:
  Task A prev_stime: 500 ms (= accurate)
  Task B prev_stime: 5 ms (= not accurate)
 with this patch:
  Task A prev_stime: 500 ms (= accurate)
  Task B prev_stime: 0 ms (= not accurate)

>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ce17760..8be5b75 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
>  	struct task_cputime cputime;
>  	cputime_t cutime, cstime;
>  
> -	thread_group_cputime(current, &cputime);
>  	spin_lock_irq(&current->sighand->siglock);
> +	thread_group_cputime(current, &cputime);
>  	cutime = current->signal->cutime;
>  	cstime = current->signal->cstime;
>  	spin_unlock_irq(&current->sighand->siglock);
> 
> It's on top of Hidetoshi patch and fix utime decrease problem 
> on my system. 

How about the stime decrease problem which can be caused by same
logic?

According to my labeling, there are 2 unresolved problem [1]
"thread_group_cputime() vs exit" and [2] "use of task_s/utime()".

Still I believe the real fix for this problem is combination of
above fix for do_sys_times() (for problem[1]) and (I know it is
not preferred, but for [2]) the following:

>> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
>> >> index 5c9dc22..e065b8a 100644
>> >> --- a/kernel/posix-cpu-timers.c
>> >> +++ b/kernel/posix-cpu-timers.c
>> >> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>> >>  
>> >>  	t = tsk;
>> >>  	do {
>> >> -		times->utime = cputime_add(times->utime, t->utime);
>> >> -		times->stime = cputime_add(times->stime, t->stime);
>> >> +		times->utime = cputime_add(times->utime, task_utime(t));
>> >> +		times->stime = cputime_add(times->stime, task_stime(t));
>> >>  		times->sum_exec_runtime += t->se.sum_exec_runtime;
>> >>  
>> >>  		t = next_thread(t);

Think about this diff, assuming task C is in same group of task A and B:

 sys_times() on C while A and B are living returns:
  (utime, stime)
   = task_[su]time(C) + ([su]time(A)+[su]time(B)+...) + in_signal(exited)
   = task_[su]time(C) + ( (50,50)   +  (50,50)  +...) + in_signal(exited)
 If A exited, it increases:
  (utime, stime)
   = task_[su]time(C) + ([su]time(B)+...) + in_signal(exited)+task_[su]time(A)
   = task_[su]time(C) + (  (50,50)  +...) + in_signal(exited)+(500,500)
 Otherwise if B exited, it decreases:
  (utime, stime)
   = task_[su]time(C) + ([su]time(A)+...) + in_signal(exited)+task_[su]time(B)
   = task_[su]time(C) + (  (50,50)  +...) + in_signal(exited)+(5,5)

 With this fix, sys_times() returns:
  (utime, stime)
   = task_[su]time(C) + (task_[su]time(A)+task_[su]time(B)+...) + in_signal(exited)
   = task_[su]time(C) + (   (500,500)    +       (5,5)    +...) + in_signal(exited)
 
> Are we not doing something nasty here?
> 
>         cputime_t utime = p->utime, total = utime + p->stime;
>         u64 temp;
> 
>         /*
>          * Use CFS's precise accounting:
>          */
>         temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);
> 
>         if (total) {
>                 temp *= utime;
>                 do_div(temp, total);
>         }
> 	utime = (cputime_t)temp;

Not here, but doing do_div() for each thread could be said nasty.
I mean
  __task_[su]time(sum(A, B, ...))
would be better than:
  sum(task_[su]time(A)+task_[su]time(B)+...)

However it would bring another issue, because
  __task_[su]time(sum(A, B, ...))
might not equal to
  __task_[su]time(sum(B, ...)) + task_[su]time(A)


Thanks,
H.Seto

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/