linux-kernel - Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1285692630.1866.74.camel@holzheu-laptop>
Date:	Tue, 28 Sep 2010 18:50:30 +0200
From:	Michael Holzheu <holzheu@...ux.vnet.ibm.com>
To:	balbir@...ux.vnet.ibm.com
Cc:	Shailabh Nagar <nagar1234@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Venkatesh Pallipadi <venki@...gle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Suresh Siddha <suresh.b.siddha@...el.com>,
	John stultz <johnstul@...ibm.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Oleg Nesterov <oleg@...hat.com>, Ingo Molnar <mingo@...e.hu>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	linux-s390@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC][PATCH 09/10] taskstats: Fix exit CPU time accounting

Hello Balbir,

On Tue, 2010-09-28 at 13:51 +0530, Balbir Singh wrote:
> * Michael Holzheu <holzheu@...ux.vnet.ibm.com> [2010-09-23 16:02:21]:
> 
> > Subject: [PATCH] taskstats: Fix exit CPU time accounting
> > 
> > From: Michael Holzheu <holzheu@...ux.vnet.ibm.com>
> > 
> > Currently there are code pathes (e.g. for kthreads) where the consumed
> > CPU time is not accounted to the parents cumulative counters.
> > Now CPU time is accounted to the parent, if the exit accounting has not
> > been done correctly.
> >
> 
> Does this impact account of the init process? Why do we care about
> accounting the time to the parent? In the case of tgid, all threads
> data makes sense. What is the benefit or gap we are trying to address
> in terms of lost data or accountability?

We care about the cumulative times because we wanted to write
a top command that can get 100% of all consumed CPU time in an
interval without using exit events.

I tried to write the idea down. Hopefully it is clear enough...

HOWTO calculate 100% consumed CPU time between two taskstats snapshots
======================================================================

In the following the idea of getting 100% of consumed CPU time between two
taskstats snapshots without using exit events is described. For simplicity we
use CPU-time as synonym for "user time", "system time" and "steal time".

In order to show the consumed CPU time in an interval a top tool has to:

* Collect snapshot 1 of all running tasks
* Wait interval
* Collect snapshot 2 of all running tasks

A snapshot contains the following data for each task:

 * time-task:    CPU time that has been consumed by task itself:
                 task->(u/s/st-time)
 * time-child:   CPU time that has been consumed by dead children of task:
                 task->signal->(cu/cs/cst-time)
 * time-thread:  CPU time that has been consumed by dead threads of
                 thread group of thread group leader:
                 task->signal->(u/s/st-time)

All consumed CPU time in the interval can be calculated as follows:
 
  For all tasks that are in snapshot 1 AND in snapshot 2:

    (time-task[2] - time-task[1]) +
    (time-child[2] - time-child[1]) +
    (time-thread[2] - time-thread[1] {for thread group leader})

  minus

  For all tasks that are in snapshot 1 but NOT in snapshot 2 (tasks that have
  been exited):

    time-task[1] +
    time-child[1] +
    time-thread[1] (if thread group has exited)

    We have to subtract those CPU times in order to get the CPU time
    of the exited tasks that has been consumed in the last interval.

To provide a consistent view, the top tool could show the following fields:
 * user:  task utime per interval
 * sys:   task stime per interval
 * ste:   task sttime per interval
 * cuser: utime of exited children per interval
 * csys:  stime of exited children per interval
 * cste:  sttime of exited children per interval
 * tuser: utime of exited threads per interval (only for thread group leader)
 * tsys:  stime of exited threads per interval (only for thread group leader)
 * tste:  sttime of exited threads per interval (only for thread group leader)
 * total: Sum of all above fields

If the top command notices that a PID disappeared between snapshot 1
and snapshot 2, it has to do the following:

If task is not the thread group leader (pid != tgid):
  Find its thread group leader and subtract the CPU times from snapshot 1
  of the dead task from the thread group leader's time-thread interval
  difference.
else
  Find its parent and subtract the CPU times from snapshot 1 of the dead child
  from the parents time-child interval difference.

Example output:
---------------
pid     user   sys  ste  cuser  csys cste tuser tsys tste total  Name
(#)      (%)   (%)  (%)    (%)   (%)  (%)   (%)  (%)  (%)   (%)  (str)
17944   0.10  0.01 0.00  54.29 14.36 0.22  0.00 0.00 0.00 68.98  make
18006   0.10  0.01 0.00  55.79 12.23 0.12  0.00 0.00 0.00 68.26  make
18041  48.18  1.51 0.29   0.00  0.00 0.00  0.00 0.00 0.00 49.98  cc1
...

The sum of all "total" CPU counters on a system that is 100% busy should
be exactly the number CPUs multiplied by the interval time. A good testcase
for this is to start a loop program for each CPU and then in parallel
starting a kernel build with "-j 5".

OPEN ISSUE:

A current problem with the Linux kernel is that CPU time can disappear,
if a child of a parent that ignores (SIGCHLD) dies.

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/