linux-kernel - Re: [patch 3/3] PTP: add kvm PTP driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170116193343.GB4639@amt.cnet>
Date:   Mon, 16 Jan 2017 17:33:46 -0200
From:   Marcelo Tosatti <mtosatti@...hat.com>
To:     Radim Krcmar <rkrcmar@...hat.com>
Cc:     kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
        Paolo Bonzini <pbonzini@...hat.com>,
        Richard Cochran <richardcochran@...il.com>,
        Miroslav Lichvar <mlichvar@...hat.com>
Subject: Re: [patch 3/3] PTP: add kvm PTP driver

On Mon, Jan 16, 2017 at 06:46:49PM +0100, Radim Krcmar wrote:
> 2017-01-16 15:04-0200, Marcelo Tosatti:
> > On Mon, Jan 16, 2017 at 05:26:53PM +0100, Radim Krcmar wrote:
> >> 2017-01-13 15:40-0200, Marcelo Tosatti:
> >> > On Fri, Jan 13, 2017 at 04:56:58PM +0100, Radim Krcmar wrote:
> >> > > 2017-01-13 10:01-0200, Marcelo Tosatti:
> >> >> > +		version = pvclock_read_begin(src);
> >> >> > +
> >> >> > +		ret = kvm_hypercall2(KVM_HC_CLOCK_OFFSET,
> >> >> > +				     clock_off_gpa,
> >> >> > +				     KVM_CLOCK_OFFSET_WALLCLOCK);
> >> >> > +		if (ret != 0) {
> >> >> > +			pr_err("clock offset hypercall ret %lu\n", ret);
> >> >> > +			spin_unlock(&kvm_ptp_lock);
> >> >> > +			preempt_enable_notrace();
> >> >> > +			return -EOPNOTSUPP;
> >> >> > +		}
> >> >> > +
> >> >> > +		tspec.tv_sec = clock_off.sec;
> >> >> > +		tspec.tv_nsec = clock_off.nsec;
> >> >> > +
> >> >> > +		delta = rdtsc_ordered() - clock_off.tsc;
> >> >> > +
> >> >> > +		offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
> >> >> > +					     src->tsc_shift);
> >> >> > +
> >> >> > +	} while (pvclock_read_retry(src, version));
> >> >> > +
> >> >> > +	preempt_enable_notrace();
> >> >> > +
> >> >> > +	tspec.tv_nsec = tspec.tv_nsec + offset;
> >> >> > +
> >> >> > +	spin_unlock(&kvm_ptp_lock);
> >> >> > +
> >> >> > +	if (tspec.tv_nsec >= NSEC_PER_SEC) {
> >> >> > +		u64 secs = tspec.tv_nsec;
> >> >> > +
> >> >> > +		tspec.tv_nsec = do_div(secs, NSEC_PER_SEC);
> >> >> > +		tspec.tv_sec += secs;
> >> >> > +	}
> >> >> > +
> >> >> > +	memcpy(ts, &tspec, sizeof(struct timespec64));
> >> >> 
> >> >> But the whole idea is of improving the time by reading tsc a bit later
> >> >> is just weird ... why is it better to provide
> >> >> 
> >> >>   tsc + x, time + tsc_delta_to_time(x)
> >> >> 
> >> >> than just
> >> >> 
> >> >>  tsc, time
> >> >> 
> >> >> ?
> >> > 
> >> > Because you want to calculate the value of the host realtime clock 
> >> > at the moment of ptp_kvm_gettime.
> >> > 
> >> > We do:
> >> > 
> >> > 	1. kvm_hypercall.
> >> > 	2. get {sec, nsec, guest_tsc}.
> >> > 	3. kvm_hypercall returns.
> >> > 	4. delay = rdtsc() - guest_tsc.
> >> > 
> >> > Where delay is the delta (measured with the TSC) between points 2 and 4.
> >> 
> >> I see now ... the PTP interface is just not good for our purposes.
> >> We don't return {sec, nsec, guest_tsc}, we just return {sec, nsec} at
> >> some random time in the past.  And to make it a bit more accurate, you
> >> add a best-effort delta before returning, which makes sense.
> > 
> > Not random time in the past. We return {sec, nsec} from the host
> > realtime at the moment the user ran the hypercall. 
> 
> That is what we want, but {sec, nsec} is not anchored to any running
> time, so it is unavoidably late when we return it.
> 
> > Since PTP is very accurate, that "a bit more" counts, yes.
> > 
> >> When we have to depend on pvclock, what are the advantages of not using
> >> the existing pvclock API for wall clock?
> >> (You mentioned some extensions.)
> >> 
> >>   struct pvclock_wall_clock {
> >>   	u32   version;
> >>   	u32   sec;
> >>   	u32   nsec;
> >>   } __attribute__((__packed__));
> >
> >> It gives the wall clock when pvclock was 0, so you just add current
> >> kvmclock and get the host wall clock.  
> > 
> > Well, no. For one, the TSC part of kvmclock: 
> > 
> > 	kvmclock-read = system_timestamp + convert-to-1GHz(rdtsc() - tsc_timestamp)
> > 				           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Drifts relative to UTC. This part can be large.
> > The guests NTP is responsible for fixing
> > that drift of the guests realtime clock (talking about current setup, 
> > without KVM PTP driver).
> > 
> > Now, we want very high precision (less than 1us) for this
> > driver. Very small TSC drifts on a large delta defeat the purpose.
> 
> True.
> 
> >> Without a VM exit.
> > 
> > Huge performance is not an issue. Accuracy (how different from the host
> > realtime clock our "approximation" of the host realtime clock) is.
> 
> Ah, ok.
> 
> >> And how often is ptp_kvm_gettime() usually called?
> > 
> > The PTP_SYS_OFFSET ioctl calls the following code in a loop:
> > 
> > struct ptp_sys_offset {
> >         unsigned int n_samples; /* Desired number of measurements. */
> >         unsigned int rsv[3];    /* Reserved for future use. */
> >         /*
> >          * Array of interleaved system/phc time stamps. The kernel
> >          * will provide 2*n_samples + 1 time stamps, with the last
> >          * one as a system time stamp.
> >          */
> >         struct ptp_clock_time ts[2 * PTP_MAX_SAMPLES + 1];
> > };
> > 
> > #define PTP_MAX_SAMPLES 25 /* Maximum allowed offset measurement
> > samples. */
> > 
> >         case PTP_SYS_OFFSET:
> >                 sysoff = memdup_user((void __user *)arg,
> > sizeof(*sysoff));
> >                 if (IS_ERR(sysoff)) {
> >                         err = PTR_ERR(sysoff);
> >                         sysoff = NULL;
> >                         break;
> >                 }
> >                 if (sysoff->n_samples > PTP_MAX_SAMPLES) {
> >                         err = -EINVAL;
> >                         break;
> >                 }
> >                 pct = &sysoff->ts[0];
> >                 for (i = 0; i < sysoff->n_samples; i++) {
> >                         getnstimeofday64(&ts);
> >                         pct->sec = ts.tv_sec;
> >                         pct->nsec = ts.tv_nsec;
> >                         pct++;
> >                         ptp->info->gettime64(ptp->info, &ts);
> >                         pct->sec = ts.tv_sec;
> >                         pct->nsec = ts.tv_nsec;
> >                         pct++;
> 
> Hm, this loop alternates between system time and ptp time and I'd guess
> that userspace then manipulates deltas of these two readings and applies
> the result to system time.

Yes.

> I'm not even sure that pvclock delta improves the situation anymore --
> it adds an offset to all gettime64() reads, but that can be computed
> from multiple datapoints.  (And the delta also adds uncertainty to
> results.)

Sorry: the value returned from ->gettime64 should be the value of the 
clock at that instant.

If you remove the pvclock delta (lets assume its 2000ns), then you'll
sync getnstimeofday64 to a clock thats 2000ns behind the host clock.

Does that make sense?

> >                 }
> >                 getnstimeofday64(&ts);
> >                 pct->sec = ts.tv_sec;
> >                 pct->nsec = ts.tv_nsec;
> > 
> > How often that ioctl is called depends on the parameters of the Chrony
> > PHC code. Initially (to determine the clock difference Chrony should call it 
> > more frequently, later on it should call it less frequency).
> > 
> > Perhaps once every second initially (the ioctl). I'll confirm with the
> > exact value for my setup and reply to this email.
> 
> Thanks.

Attached. 1551 hypercalls in 38 secs is about 40 times per second.
The ioctl is calling ->gettime64 10 per ioctl, so thats about 4 
ioctls per sec.


View attachment "trace.txt" of type "text/plain" (185239 bytes)