linux-kernel - Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150921155224.GA12938@amt.cnet>
Date:	Mon, 21 Sep 2015 12:52:24 -0300
From:	Marcelo Tosatti <mtosatti@...hat.com>
To:	Radim Krčmář <rkrcmar@...hat.com>
Cc:	linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
	Paolo Bonzini <pbonzini@...hat.com>,
	Luiz Capitulino <lcapitulino@...hat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from
 PVCLOCK_COUNTS_FROM_ZERO.

On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> 2015-09-20 19:57-0300, Marcelo Tosatti:
> > On Fri, Sep 18, 2015 at 05:54:28PM +0200, Radim Krčmář wrote:
> >> This patch series will be disabling PVCLOCK_COUNTS_FROM_ZERO flag and is
> >> RFC because I haven't explored many potential problems or tested it.
> > 
> > The justification to disable PVCLOCK_COUNTS_FROM_ZERO is because you
> > haven't explored potential problems or tested it? Sorry can't parse it.
> > 
> >> 
> >> [1/2] uses a different algorithm in the guest to start counting from 0.
> >> [2/2] stops exposing PVCLOCK_COUNTS_FROM_ZERO in the hypervisor.
> >> 
> >> A viable alternative would be to implement opt-in features in kvm clock.
> >> 
> >> And because we probably only broke one old user (the infamous SLES 10), a
> >> workaround like this is also possible: (but I'd rather not do that)
> > 
> > Please describe why SLES 10 breaks in detail: the state of the guest and
> > the host before the patch, the state of the guest and host after the
> > patch.
> 
> 1) The guest periodically receives an interrupt that is handled by
>    main_timer_handler():
>    a) get time using the kvm clock:
>       1) write the address to MSR_KVM_SYSTEM_TIME
>       2) read tsc and pvclock (tsc_offset, system_time)
>       3) time = tsc - tsc_offset + system_time
>    b) compute time since the last main_timer_handler()
>    c) bump jiffies if enough time has elapsed
> 2) the guest wants to calibrate loops per jiffy [1]:
>    a) read tsc
>    b) loop till jiffies increase
>    c) compute lpj
> 
> Because (1a1) always resets the system_time to 0, we read the same value
> over and over so the condition for (1c) is never true and jiffies remain
> constant.  This is the problem.  A hang happens in (2b) as it is the
> first place that depends on jiffies.
> 
> > What does SLES10 expect?
> 
> That a write to MSR_KVM_SYSTEM_TIME does not reset the system time.
> 
> > Is it counting from zero that breaks SLES10?
> 
> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> while still keeping system time;  we used to allow that, which means an
> ABI breakage.  (And we can't even say that guest's behaviour is against
> the spec ...)

Because this behaviour was not defined.

Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
boot_vcpu_runs_old_kvmclock == false? 
The patch would be much simpler.

The problem is, "selecting one read as the initial point" is inherently
racy: that delta is relative to one moment (kvmclock read) at one vcpu,
but must be applied to all vcpus.

Besides:

	1) Stable sched clock in guest does not depend on
	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
	2) You rely on monotonicity across vcpus to perform 
 	   the 'minus delta that was read on vcpu0' calculation, but 
	   monotonicity across vcpus can fail during runtime
           (say host clocksource goes tsc->hpet due to tsc instability).


> 
> 
> ---
> 1: I also did diassembly, but the reproducer is easier to paste
>    (couldn't find debuginfo)
>    # qemu-kvm -nographic -kernel vmlinuz-2.6.16.60-0.85.1-default \
>     -serial stdio -monitor /dev/null -append 'console=ttyS0'
>   
>   and you can get a bit further when setting loops per jiffy manually,
>     -serial stdio -monitor /dev/null -append 'console=ttyS0 lpj=12345678'
> 
>   The dmesg for failing run is
>     Initializing CPU#0
>     PID hash table entries: 512 (order: 9, 16384 bytes)
>     kvm-clock: cpu 0, msr 0:3f6041, boot clock
>     kvm_get_tsc_khz: cpu 0, msr 0:e001
>     time.c: Using tsc for timekeeping HZ 250
>     time.c: Using 100.000000 MHz WALL KVM GTOD KVM timer.
>     time.c: Detected 2591.580 MHz processor.
>     Console: colour VGA+ 80x25
>     Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
>     Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
>     Checking aperture...
>     Nosave address range: 000000000009f000 - 00000000000a0000
>     Nosave address range: 00000000000a0000 - 00000000000f0000
>     Nosave address range: 00000000000f0000 - 0000000000100000
>     Memory: 124884k/130944k available (1856k kernel code, 5544k reserved, 812k data, 188k init)
>     [Infinitely querying kvm clock here ...]
> 
>   With '-cpu kvm64,-kvmclock', the next line is
>     Calibrating delay using timer specific routine.. 5199.75 BogoMIPS (lpj=10399519)
> 
>   With 'lpj=10399519',
>     Calibrating delay loop (skipped)... 5199.75 BogoMIPS preset
>     [Manages to get stuck later, in default_idle.]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/