linux-kernel - Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1285330688.2179.305.camel@holzheu-laptop>
Date:	Fri, 24 Sep 2010 14:18:08 +0200
From:	Michael Holzheu <holzheu@...ux.vnet.ibm.com>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Shailabh Nagar <nagar1234@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Venkatesh Pallipadi <venki@...gle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Suresh Siddha <suresh.b.siddha@...el.com>,
	John stultz <johnstul@...ibm.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...e.hu>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	linux-s390@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting

Hello Oleg,

On Thu, 2010-09-23 at 19:10 +0200, Oleg Nesterov wrote:
> Sorry, I didn't look at other patches, but this one looks strange
> to me...
> 
> On 09/23, Michael Holzheu wrote:
> >
> > Currently there are code pathes (e.g. for kthreads) where the consumed
> > CPU time is not accounted to the parents cumulative counters.
> 
> Could you explain more?

I think one place was "khelper" (kmod.c). It is created with
kernel_thread() and it exits without having accounted the times with
sys_wait() to the parent's ctimes (see second kernel_thread() invocation
below in kmod.c):

        if (wait == UMH_WAIT_PROC)
                pid = kernel_thread(wait_for_helper, sub_info,
                                    CLONE_FS | CLONE_FILES | SIGCHLD);
        else
                pid = kernel_thread(____call_usermodehelper, sub_info,
                                    CLONE_VFORK | SIGCHLD);

<snip>

> >  void release_task(struct task_struct * p)
> >  {
> >  	struct task_struct *leader;
> >  	int zap_leader;
> > +
> > +	if (!p->exit_accounting_done)
> > +		account_to_parent(p);
> >  repeat:
> >  	tracehook_prepare_release_task(p);
> >  	/* don't need to get the RCU readlock here - the process is dead and
> > @@ -1279,6 +1313,7 @@
> >  			psig->cmaxrss = maxrss;
> >  		task_io_accounting_add(&psig->ioac, &p->ioac);
> >  		task_io_accounting_add(&psig->ioac, &sig->ioac);
> > +		p->exit_accounting_done = 1;
> 
> Can't understand.
> 
> Suppose that a thread T exits and reaps itself (calls release_task).
> Now we call account_to_parent() which accounts T->signal->XXX + T->XXX.
> After that T calls __exit_signal and does T->signal->XXX += T->XXX.
> 
> If another thread exits it does the same and we account the already
> exited thread T again?
> 
> When the last thread exits, wait_task_zombie() accounts T->signal
> once again.
> 
> IOW, this looks like the over-accounting to me, no?

I think you are right and this patch is not correct here.

I had the wrong idea, because I thought that for every exited thread the
parent will do a sys_wait() and the thread's CPU times will be added to
the parent's ctime. This happens only for the thread group leader,
correct? Other threads just exit and add their times to the signal
struct of the process in __exit_signal(). When a sys_wait() is done for
a dead thread group leader, all the times that have been accumulated in
the signal struct are added to the ctime fields of the waiting parent.

I wanted to use the cumulative times (cutime, cstime, csttime) for ptop
in order to show all consumed CPU time in the last interval.

E.g. for the case that between ptop snapshot 1 and 2 a task "X" forked
and also exited, ptop can't see this task, because it is neither in
snapshot 1 nor in snapshot 2. But ptop can still show X's consumed CPU
time in the interval by subtracting the cumulative times of X's parent.

If a task "Y" is in snapshot 1, but not in snapshot 2, we search for the
parent of "Y" and calculate it's ctimes for the last interval as follows
(user time in this example):

parent->cuser_diff =
  parent->snap2->cutime - Y->snap1->utime -                 
  Y->snap1->cutime - parent->snap1->cutime

The result is the CPU time that Y consumed in the last interval.

Example (ptop):

                          VVVVV  VVVV VVVV
   pid     user  sys  ste cuser  csys cste  total Elap+ Name
   (#)      (%)  (%)  (%)   (%)   (%)  (%)    (%)  (hm) (str)
   8374   75.48 0.41 1.34  0.00  0.00 0.00  77.24  0:01 loop
>> 10093   0.17 0.30 0.00 25.90 38.19 0.52  65.07  0:00 make <<

ptop shows cuser, csys and cste for the last interval. In this example
it is the time that dead children of "make" consumed in the last
interval.

Ok, the problem is that I did not consider exiting threads that are no
thread group leaders. When they exit the ctime of the parent is not
updated. Instead the time is accumulated in the signal struct.

To fix this we could also add the signal_struct times (e.g. tguser,
tgsys and tgste) to taskstats. When a task "Z" exits (is in snapshot 1,
but not in snapshot 2), we first check, if the thread group leader is
still in snapshot 2. If this is the case, we do the following
calculation:

tgleader->tguser_diff =
  tgleader->snap2->sig->utime - Z->snap1->utime -
  tgleader->snap1->sig->utime

If add CPU time diffs, cummulated parent diffs and thread group CPU time
diffs, we should get again 100% of the consumed CPU time in the last
ptop interval.

Does this make sense?

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/