linux-kernel - Re: [PATCH] time,virt: resync steal time when guest & host lose sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1471313483.32433.33.camel@redhat.com>
Date:	Mon, 15 Aug 2016 22:11:23 -0400
From:	Rik van Riel <riel@...hat.com>
To:	Wanpeng Li <kernellwp@...il.com>
Cc:	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Wanpeng Li <wanpeng.li@...mail.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Radim Krcmar <rkrcmar@...hat.com>,
	Mike Galbraith <efault@....de>
Subject: Re: [PATCH] time,virt: resync steal time when guest & host lose sync

On Tue, 2016-08-16 at 09:31 +0800, Wanpeng Li wrote:
> 2016-08-15 23:00 GMT+08:00 Rik van Riel <riel@...hat.com>:
> > On Mon, 2016-08-15 at 16:53 +0800, Wanpeng Li wrote:
> > > 2016-08-12 23:58 GMT+08:00 Rik van Riel <riel@...hat.com>:
> > > [...]
> > > > Wanpeng, does the patch below work for you?
> > > 
> > > It will break steal time for full dynticks guest, and there is a
> > > calltrace of thread_group_cputime_adjusted call stack, RIP is
> > > cputime_adjust+0xff/0x130.
> > 
> > How?  This patch is equivalent to passing ULONG_MAX to
> > steal_account_process_time, which you tried to no ill
> > effect before.
> 
> https://lkml.org/lkml/2016/6/8/404/ Paolo original suggested to add
> the max cputime limit to the vtime, when the cpu is running in nohz
> full mode and stop the tick, jiffies will be updated depends on clock
> source instead of clock event device in
> guest(tick_nohz_update_jiffies() callsite, ktime_get()), so it will
> not be affected by lost clock ticks, my patch keeps the limit for
> vtime and remove the limit to non-vtime. However, your patch removes
> the limit for both scenarios and results in the below calltrace for
> vtime.

I understand what it does.

What I would like to understand is WHY enforcing the limit
is the right thing when using vtime, and the wrong thing
in all other scenarios.

Can you explain why you change the limit to ULONG_MAX in
three call sites, but not in the last one?

What is different about the first three, versus the last
one?

Are you sure it should be changed in three places, and
not in eg. two?

This seems like something we should try to understand,
rather than patch up haphazardly.

The changelog of your patch could use an explanation of
why the change is the correct way to go.

> > 
> > Do you have the full call trace?

OK, so you are seeing a divide by zero in cputime_adjust.

Specifically, this would be scale_stime getting passed
a (utime + stime) that adds up to 0. Stranger still, that
only gets called if neither utime or stime is 0, meaning
that one of utime or stime is negative, at the exact same
magnitude as the other.

Looking at thread_group_cputime(), I see some room for
rounding errors.

        do {
                seq = nextseq;
                flags = read_seqbegin_or_lock_irqsave(&sig->stats_lock, 
&seq);
                times->utime = sig->utime;
                times->stime = sig->stime;
                times->sum_exec_runtime = sig->sum_sched_runtime;

                for_each_thread(tsk, t) {
                        task_cputime(t, &utime, &stime);
                        times->utime += utime;
                        times->stime += stime;
                        times->sum_exec_runtime +=
task_sched_runtime(t);
                }
                /* If lockless access failed, take the lock. */
                nextseq = 1;
        } while (need_seqretry(&sig->stats_lock, seq));

Specifically, task_cputime calls vtime_delta, which works
off jiffies, while task_sched_runtime works straight off
the sched clock.

I can see how this would lead to a non-zero ->sum_exec_runtime,
while ->utime and/or ->stime are still at zero. This is fine.

What I do not see is how ->utime or ->stime could end up negative,
which is what would be needed to hit that divide by zero.

Unless I am overlooking something...

This would be a good thing to debug.

> [    6.929856] divide error: 0000 [#1] SMP
> [    6.934217] Modules linked in:
> [    6.937759] CPU: 3 PID: 57 Comm: kworker/u8:1 Not tainted 4.7.0+
> #36
> [    6.946105] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [    6.953951] Workqueue: events_unbound
> call_usermodehelper_exec_work
> [    6.965726] task: ffff8e22b9785040 task.stack: ffff8e22b8b64000
> [    6.970820] RIP: 0010:[<ffffffff870c7b4f>]  [<ffffffff870c7b4f>]
> cputime_adjust+0xff/0x130
> [    6.981841] RSP: 0000:ffff8e22b8b67b78  EFLAGS: 00010887
> [    6.985946] RAX: a528afff5ad75000 RBX: ffff8e222e243c18 RCX:
> ffff8e22b8b67c28
> [    7.001166] RDX: 0000000000000000 RSI: 0000000000000296 RDI:
> 0000000000000000
> [    7.008758] RBP: ffff8e22b8b67ba8 R08: 00000000ffffffff R09:
> 00000000a528b000
> [    7.015653] R10: 0000000000000000 R11: 0000000000000000 R12:
> 000000000014a516
> [    7.021376] R13: ffff8e22b8b67bb8 R14: ffff8e222e243c28 R15:
> ffff8e22b8b67c20
> [    7.035498] FS:  0000000000000000(0000) GS:ffff8e22bac00000(0000)
> knlGS:0000000000000000
> [    7.054809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    7.066571] CR2: 00000000ffffffff CR3: 000000007ae06000 CR4:
> 00000000001406e0
> [    7.075162] Stack:
> [    7.090141]  ffff8e22b8b67c28 ffff8e222e371ac0 ffff8e22b8b67c20
> ffff8e22b8b67c28
> [    7.108512]  ffff8e222e371ac0 ffff8e22b8b67cc0 ffff8e22b8b67be8
> ffffffff870c8c01
> [    7.123025]  00000000000e0471 fffffffffffdcf32 000000000014a516
> ffff8e22b9785040
> [    7.140622] Call Trace:
> [    7.153076]  [<ffffffff870c8c01>]
> thread_group_cputime_adjusted+0x41/0x50
> [    7.160807]  [<ffffffff870913bf>] wait_consider_task+0xa4f/0xff0
> [    7.176449]  [<ffffffff87090fc1>] ? wait_consider_task+0x651/0xff0
> [    7.186281]  [<ffffffff87091a3f>] ? do_wait+0xdf/0x320
> [    7.226606]  [<ffffffff87091a7b>] do_wait+0x11b/0x320
> [    7.239670]  [<ffffffff87093014>] SyS_wait4+0x64/0xc0
> [    7.245385]  [<ffffffff87090180>] ? task_stopped_code+0x50/0x50
> [    7.255924]  [<ffffffff870a8470>]
> call_usermodehelper_exec_work+0x70/0xb0
> [    7.263011]  [<ffffffff870acbd0>] process_one_work+0x1e0/0x670
> [    7.273051]  [<ffffffff870acb51>] ? process_one_work+0x161/0x670
> [    7.277991]  [<ffffffff870ad18b>] worker_thread+0x12b/0x4a0
> [    7.286920]  [<ffffffff870ad060>] ? process_one_work+0x670/0x670
> [    7.291745]  [<ffffffff870b4011>] kthread+0x101/0x120
> [    7.296878]  [<ffffffff878c94cf>] ret_from_fork+0x1f/0x40
> [    7.306511]  [<ffffffff870b3f10>] ?
> kthread_create_on_node+0x250/0x250
> [    7.311985] Code: 4d 39 c8 76 c1 4c 89 d0 48 c1 e8 20 48 85 c0 74
> ca 4c 89 c0 49 d1 ea 4d 89 c8 48 d1 e8 49 89 c1 eb 9f 44 89 c8 31 d2
> 49 0f af c0 <49> f7 f2 4d 89 e2 48 39 f8 48 0f 42 c7 49 29 c2 4d 39
> d3
> 76 0b
> [    7.357565] RIP  [<ffffffff870c7b4f>] cputime_adjust+0xff/0x130
> [    7.364633]  RSP <ffff8e22b8b67b78>
> [    7.373247] ---[ end trace 76ca7475a22c5d43 ]---
-- 

All Rights Reversed.
Download attachment "signature.asc" of type "application/pgp-signature" (474 bytes)