linux-kernel - Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150921220039.GC2734@potion.brq.redhat.com>
Date:	Tue, 22 Sep 2015 00:00:39 +0200
From:	Radim Krčmář <rkrcmar@...hat.com>
To:	Marcelo Tosatti <mtosatti@...hat.com>
Cc:	linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
	Paolo Bonzini <pbonzini@...hat.com>,
	Luiz Capitulino <lcapitulino@...hat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from
 PVCLOCK_COUNTS_FROM_ZERO.

2015-09-21 17:53-0300, Marcelo Tosatti:
> On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
>> 2015-09-21 12:52-0300, Marcelo Tosatti:
>> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
>> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
>> >>> Is it counting from zero that breaks SLES10?
>> >> 
>> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
>> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
>> >> while still keeping system time;  we used to allow that, which means an
>> >> ABI breakage.  (And we can't even say that guest's behaviour is against
>> >> the spec ...)
>> > 
>> > Because this behaviour was not defined.
>> 
>> It is defined by implementation.
>> 
>> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
>> > boot_vcpu_runs_old_kvmclock == false? 
>> > The patch would be much simpler.
>> 
>> If you mean the hunk in cover letter, I don't like it because we presume
>> that no other guests were broken.
> 
> Yes patch 1.

(I'd call them: a hunk in cover letter [0/2], patch 1 = [1/2], and
 patch 2 = [2/2].)

> Do you have an example of another guest that is broken? 

Yes, the hot-plug in any relatively recent Linux guest.

> Really, assuming things about the value of the msr read when you
> write to the MSR is very awkward. That SuSE code is incorrect, it fails
> on other occasions as well (see
> 54750f2cf042c42b4223d67b1bd20138464bde0e).

Yeah, I even read the SUSE implementation, sad times.

>> I really don't like it 
> 
> Because it changes behaviour that was previously unspecified?

No, because it adds another exemption to already ugly code.
(Talking about the hunk in [0/2].)

>> so I thought about other problems with
>> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?
> 
> Can't unplug VCPU 0 i think.

You can.

>> It doesn't work well ;)
> 
> Can you hot-unplug vcpu 0? 

I can, I did, hot-plug bugged.

>> > The problem is, "selecting one read as the initial point" is inherently
>> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
>> > but must be applied to all vcpus.
>> 
>> I don't think that is a problem.
>> 
>> Kvmclock has a notion of a global system_time in nanoseconds (one value
>> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
>> to propagate system_time into guest VCPUs as precisely as possible with
>> the help of TSC.
>
>Different pairs of values (system_time, tsc reads) are fundamentally
>broken. This is why 
>
>commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
>    x86, paravirt: Add a global synchronization point for pvclock
>
>Exists.
>
>Then to guarantee monotonicity you need to stop those updates
>(or perform the pair update in lock step):
>
>commit d828199e84447795c6669ff0e6c6d55eb9beeff6
>    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag

(Thanks for the commits, split values are the core design that avoids
 modifying rdtsc, so I'll be grad to read its details.)

>> > 	2) You rely on monotonicity across vcpus to perform 
>> > 	   the 'minus delta that was read on vcpu0' calculation, but 
>> > 	   monotonicity across vcpus can fail during runtime
>> >            (say host clocksource goes tsc->hpet due to tsc instability).
>> 
>> That could be a problem, but I'm adding a VCPU independent constant to
>> all reads -- does the new code rely on monoticity in places where the
>> old one didn't?
> 
> The problem is overflow:
> kvmclock non-monotonic across vcpus AND delta subtraction:
> 
> ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
> 1          1                      1                         
> 2          2                      2                        1                

KVM sched clock not used before this point (I even moved the code to
make sure), so there is no problem with monotonicity.

> 3          3                      0                        2            -1
                                                                          ^^^
-1 is the overflow.  Very unlikely to ever happen in Linux, as we now
boot other VCPUs much later than the KVM clock configuration and it can
only happen if VCPU1 is running as the reconfiguration takes place, but
a simple

  if (vcpu[x].sched_clock <= global_sched_clock_offset)
  	return 0;

would take care of it.  The time would stand still for a while, which is
not a huge problem for boot-only scenario.  Other VCPUs couldn't be
reading KVM sched before it was configured, so only operations that
happen before vcpu1sched_clock goes positive are affected.
(We have other problems if the window can be long.)

> 4          4                      1                        3             0
> 5          5                      2                        4             1
> 6          6                      3                        5             2
> 7          7                      4                        6             3
> 
> ptime is "physical time".
> 
> I prefer that the host counts from zero (there are less problems to
> handle).

The main advantage of a hypervisor solution for me is one saved
subtraction and comparison on every sched_clock().

> Again, that SuSE patch is incorrect and a huge hack.

I'm not disputing that.  (Which doesn't justify the breakage.)

> An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
> requests the feature.

Yes, and the alternative needs new paravirt interface.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/