linux-kernel - Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from PVCLOCK_COUNTS_FROM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150921205312.GA17800@amt.cnet>
Date:	Mon, 21 Sep 2015 17:53:12 -0300
From:	Marcelo Tosatti <mtosatti@...hat.com>
To:	Radim Krčmář <rkrcmar@...hat.com>
Cc:	linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
	Paolo Bonzini <pbonzini@...hat.com>,
	Luiz Capitulino <lcapitulino@...hat.com>
Subject: Re: [RFC PATCH 0/2] kvmclock: fix ABI breakage from
 PVCLOCK_COUNTS_FROM_ZERO.

On Mon, Sep 21, 2015 at 10:00:27PM +0200, Radim Krčmář wrote:
> 2015-09-21 12:52-0300, Marcelo Tosatti:
> > On Mon, Sep 21, 2015 at 05:12:10PM +0200, Radim Krčmář wrote:
> >> 2015-09-20 19:57-0300, Marcelo Tosatti:
> >>> Is it counting from zero that breaks SLES10?
> >> 
> >> Not by itself, treating MSR_KVM_SYSTEM_TIME as one-shot initializer did.
> >> The guest wants to write to MSR_KVM_SYSTEM_TIME as much as it likes to,
> >> while still keeping system time;  we used to allow that, which means an
> >> ABI breakage.  (And we can't even say that guest's behaviour is against
> >> the spec ...)
> > 
> > Because this behaviour was not defined.
> 
> It is defined by implementation.
> 
> > Can't you just condition PVCLOCK_COUNTS_FROM_ZERO behaviour on
> > boot_vcpu_runs_old_kvmclock == false? 
> > The patch would be much simpler.
> 
> 
> If you mean the hunk in cover letter, I don't like it because we presume
> that no other guests were broken.

Yes patch 1.

Do you have an example of another guest that is broken? 

Really, assuming things about the value of the msr read when you
write to the MSR is very awkward. That SuSE code is incorrect, it fails
on other occasions as well (see
54750f2cf042c42b4223d67b1bd20138464bde0e).

> I really don't like it 

Because it changes behaviour that was previously unspecified?

> so I thought about other problems with
> PVCLOCK_COUNTS_FROM_ZERO ... have you tried to hot-replug VCPU 0?

Can't unplug VCPU 0 i think.

> It doesn't work well ;)

Can you hot-unplug vcpu 0? 

> We don't want to guess what the guest wants so I'd go for the opt-in
> paravirt feature unless counting from zero can be done in guest alone.

The problem it is tricky (to do the counting inside the guest).

Its much cleaner and less complicated if the host starts counting from
zero.

> 
> > The problem is, "selecting one read as the initial point" is inherently
> > racy: that delta is relative to one moment (kvmclock read) at one vcpu,
> > but must be applied to all vcpus.
> 
> I don't think that is a problem.
> 
> Kvmclock has a notion of a global system_time in nanoseconds (one value
> that defines the time, assigned with VM ioctl KVM_SET_CLOCK) and tries
> to propagate system_time into guest VCPUs as precisely as possible with
> the help of TSC.

Different pairs of values (system_time, tsc reads) are fundamentally
broken. This is why 

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
    x86, paravirt: Add a global synchronization point for pvclock

Exists.

Then to guarantee monotonicity you need to stop those updates
(or perform the pair update in lock step):

commit d828199e84447795c6669ff0e6c6d55eb9beeff6
    KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag


> sched_clock uses kvmclock to get nanoseconds since the system was
> brought up and [1/2] only works with this abstracted ns count.
> A poorly synchronized initial read is irrelevant because all VCPUs will
> be using the same constant offset.
> (We can never know the precise time anyway.)
> 
> > Besides:
> > 
> > 	1) Stable sched clock in guest does not depend on
> > 	   KVM_FEATURE_CLOCKSOURCE_STABLE_BIT.
> 
> Yes, thanks, I will remove that requirement in v1.  (We'd need to
> improve a loss of PVCLOCK_TSC_STABLE_BIT otherwise anyway.)
> 
> Because the clutchy dependency on PVCLOCK_TSC_STABLE_BIT is going away,
> there is now one possible unsigned overflow: in case the clock was very
> imprecise and VCPU1 managed to get smaller system_time than VCPU0 at the
> time of initialization.  It's very unlikely that we'll ever reach legal
> overflow so I can add a condition there.
> 
> > 	2) You rely on monotonicity across vcpus to perform 
> > 	   the 'minus delta that was read on vcpu0' calculation, but 
> > 	   monotonicity across vcpus can fail during runtime
> >            (say host clocksource goes tsc->hpet due to tsc instability).
> 
> That could be a problem, but I'm adding a VCPU independent constant to
> all reads -- does the new code rely on monoticity in places where the
> old one didn't?

The problem is overflow:
kvmclock non-monotonic across vcpus AND delta subtraction:

ptime | vcpu0kvmclock time | vcpu0sched_clock | vcpu1kvmclock time | vcpu1sched_clock
1	     1			   1 		        
2            2		           2			1		
3	     3			   0			2           -1
4	     4			   1			3	     0
5	     5			   2			4	     1
6	     6			   3			5	     2
7	     7			   4			6	     3

ptime is "physical time".

I prefer that the host counts from zero (there are less problems to
handle).

Again, that SuSE patch is incorrect and a huge hack.

An alternative is to enable PVCLOCK_COUNTS_FROM_ZERO _if_ the guest
requests the feature.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/