[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130902130744.GB2378@somewhere>
Date: Mon, 2 Sep 2013 15:07:45 +0200
From: Frederic Weisbecker <fweisbec@...il.com>
To: Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
Stanislaw Gruszka <sgruszka@...hat.com>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Borislav Petkov <bp@...en8.de>, linux-kernel@...r.kernel.org
Subject: Re: [sched next] overflowed cpu time for kernel threads in
/proc/PID/stat
On Mon, Sep 02, 2013 at 03:28:45PM +0300, Sergey Senozhatsky wrote:
> On (08/31/13 01:04), Frederic Weisbecker wrote:
> > > in cputime_adjust() `stime' is greater than `rtime', so `utime = rtime - stime'
> > > sets wrong value.
> >
> > But stime should always be below rtime due to the calculation done by scale_stime()
> > which roughly sums up to:
> >
> > stime = (stime / stime + utime) * rtime
> >
> > So this shouldn't happen.
> >
> > I couldn't manage to reproduce it though. Can you still trigger it with latest -next?
> >
>
> Hope this may help.
> I've added a silly check to make sure that `stime < rtime'
>
> @@ -579,6 +582,10 @@ static void cputime_adjust(struct task_cputime *curr,
> if (total) {
> stime = scale_stime((__force u64)stime,
> (__force u64)rtime, (__force u64)total);
> + if (stime > rtime) {
> + printk(KERN_ERR "Ooops: stime:%llu rtime:%llu\n", stime, rtime);
> + WARN_ON(1);
> + }
> utime = rtime - stime;
> } else {
> stime = rtime;
>
>
> and got the following results:
>
>
> [ 1295.311056] Ooops: stime:4622750929 rtime:4622750928
> [ 1295.311063] ------------[ cut here ]------------
> [ 1295.311072] WARNING: CPU: 1 PID: 2167 at kernel/sched/cputime.c:584 cputime_adjust+0x132/0x140()
> [ 1295.311123] CPU: 1 PID: 2167 Comm: top Tainted: G C 3.11.0-rc7-next-20130902-dbg-dirty #2
> [ 1295.311126] Hardware name: Acer Aspire 5741G /Aspire 5741G , BIOS V1.20 02/08/2011
> [ 1295.311128] 0000000000000009 ffff880099937c50 ffffffff8162fc85 0000000000000000
> [ 1295.311134] ffff880099937c88 ffffffff8104ee2d ffff8801530bb158 ffff880099937df0
> [ 1295.311139] 00000001138994d0 00000001138994d1 ffff880153118ff0 ffff880099937c98
> [ 1295.311145] Call Trace:
> [ 1295.311155] [<ffffffff8162fc85>] dump_stack+0x4e/0x82
> [ 1295.311162] [<ffffffff8104ee2d>] warn_slowpath_common+0x7d/0xa0
> [ 1295.311166] [<ffffffff8104ef0a>] warn_slowpath_null+0x1a/0x20
> [ 1295.311170] [<ffffffff8108db72>] cputime_adjust+0x132/0x140
> [ 1295.311175] [<ffffffff8108e941>] thread_group_cputime_adjusted+0x41/0x50
> [ 1295.311182] [<ffffffff811f3ac4>] do_task_stat+0x9c4/0xb50
> [ 1295.311188] [<ffffffff811f4904>] proc_tgid_stat+0x14/0x20
> [ 1295.311192] [<ffffffff811edac4>] proc_single_show+0x54/0xa0
> [ 1295.311198] [<ffffffff811a6d24>] seq_read+0x164/0x3d0
> [ 1295.311204] [<ffffffff811805e1>] vfs_read+0xa1/0x180
> [ 1295.311208] [<ffffffff8118119c>] SyS_read+0x4c/0xa0
> [ 1295.311213] [<ffffffff8163ad06>] tracesys+0xd4/0xd9
> [ 1295.311216] ---[ end trace bab1e899ff45eea2 ]---
>
> [ 1301.384396] Ooops: stime:4654072951 rtime:4654072950
> [ 1301.384401] ------------[ cut here ]------------
> [ 1301.384409] WARNING: CPU: 0 PID: 2167 at kernel/sched/cputime.c:584 cputime_adjust+0x132/0x140()
> [ 1301.384450] CPU: 0 PID: 2167 Comm: top Tainted: G WC 3.11.0-rc7-next-20130902-dbg-dirty #2
> [ 1301.384452] Hardware name: Acer Aspire 5741G /Aspire 5741G , BIOS V1.20 02/08/2011
> [ 1301.384454] 0000000000000009 ffff880099937c50 ffffffff8162fc85 0000000000000000
> [ 1301.384458] ffff880099937c88 ffffffff8104ee2d ffff8801530bb158 ffff880099937df0
> [ 1301.384462] 0000000115678476 0000000115678477 ffff880153118ff0 ffff880099937c98
> [ 1301.384466] Call Trace:
> [ 1301.384475] [<ffffffff8162fc85>] dump_stack+0x4e/0x82
> [ 1301.384481] [<ffffffff8104ee2d>] warn_slowpath_common+0x7d/0xa0
> [ 1301.384484] [<ffffffff8104ef0a>] warn_slowpath_null+0x1a/0x20
> [ 1301.384487] [<ffffffff8108db72>] cputime_adjust+0x132/0x140
> [ 1301.384491] [<ffffffff8108e941>] thread_group_cputime_adjusted+0x41/0x50
> [ 1301.384497] [<ffffffff811f3ac4>] do_task_stat+0x9c4/0xb50
> [ 1301.384501] [<ffffffff811f4904>] proc_tgid_stat+0x14/0x20
> [ 1301.384504] [<ffffffff811edac4>] proc_single_show+0x54/0xa0
> [ 1301.384509] [<ffffffff811a6d24>] seq_read+0x164/0x3d0
> [ 1301.384514] [<ffffffff811805e1>] vfs_read+0xa1/0x180
> [ 1301.384517] [<ffffffff8118119c>] SyS_read+0x4c/0xa0
> [ 1301.384521] [<ffffffff8163ad06>] tracesys+0xd4/0xd9
> [ 1301.384523] ---[ end trace bab1e899ff45eea3 ]---
>
> and so on.
>
>
> $ dmesg | grep Ooops
> [ 1295.311056] Ooops: stime:4622750929 rtime:4622750928
> [ 1301.384396] Ooops: stime:4654072951 rtime:4654072950
> [ 1307.438935] Ooops: stime:4687858107 rtime:4687858106
> [ 1313.493462] Ooops: stime:4724166945 rtime:4724166944
> [ 1316.521740] Ooops: stime:4741142233 rtime:4741142232
> [ 1325.605625] Ooops: stime:4793906690 rtime:4793906688
> [ 1331.648611] Ooops: stime:4308413629 rtime:4308413628
> [ 1337.708468] Ooops: stime:4510156747 rtime:4510156746
> [ 1340.744586] Ooops: stime:4880584390 rtime:4880584389
> [ 1343.773111] Ooops: stime:4442005879 rtime:4442005878
> [ 1343.773362] Ooops: stime:4558310943 rtime:4558310942
> [ 1349.815797] Ooops: stime:4927379646 rtime:4927379645
> [ 1352.836936] Ooops: stime:4942782836 rtime:4942782834
> [ 1358.885232] Ooops: stime:4979031940 rtime:4979031938
> [ 1364.977902] Ooops: stime:5019633914 rtime:5019633912
> [ 1364.978283] Ooops: stime:4601969101 rtime:4601969100
> [ 1364.978623] Ooops: stime:4826393815 rtime:4826393814
> [ 1371.034265] Ooops: stime:4894272175 rtime:4894272174
> [ 1374.077150] Ooops: stime:4688831829 rtime:4688831828
> [ 1374.077519] Ooops: stime:4916319315 rtime:4916319314
> [ 1377.110416] Ooops: stime:5084174048 rtime:5084174047
> [ 1377.111043] Ooops: stime:4949250198 rtime:4949250196
> [ 1380.137382] Ooops: stime:5096963048 rtime:5096963046
> [ 1380.137739] Ooops: stime:4975014880 rtime:4975014879
> [ 1383.161441] Ooops: stime:5002466396 rtime:5002466394
> [ 1389.212166] Ooops: stime:5140400038 rtime:5140400036
> [ 1392.239303] Ooops: stime:5082051028 rtime:5082051027
> [ 1395.270391] Ooops: stime:5170118636 rtime:5170118634
> [ 1398.303593] Ooops: stime:5184355318 rtime:5184355316
Thanks a lot Sergey for testing this further!
Interesting results, so rtime is always one or two units off stime after scaling.
Stanislaw made the scaling code with Linus and he has a better idea on the math guts
here.
Stanislaw, any idea?
>
>
> -ss
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists