linux-kernel - Re: [RFC PATCH 25/33] Implement timekeeping for Xen

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1153795784.23416.58.camel@cog.beaverton.ibm.com>
Date:	Mon, 24 Jul 2006 19:49:44 -0700
From:	john stultz <johnstul@...ibm.com>
To:	Chris Wright <chrisw@...s-sol.org>
Cc:	linux-kernel@...r.kernel.org, virtualization@...ts.osdl.org,
	xen-devel@...ts.xensource.com,
	Jeremy Fitzhardinge <jeremy@...p.org>, Andi Kleen <ak@...e.de>,
	Andrew Morton <akpm@...l.org>,
	Rusty Russell <rusty@...tcorp.com.au>,
	Zachary Amsden <zach@...are.com>,
	Jeremy Fitzhardinge <jeremy@...source.com>
Subject: Re: [RFC PATCH 25/33] Implement timekeeping for Xen

On Tue, 2006-07-18 at 00:00 -0700, Chris Wright wrote:
> plain text document attachment (xen-time)
> Use the hypervisor as the basis of the guest's time.  This means that
> the hypervisor wallclock time is used to emulate the cmos clock, and
> set the inital system clock at boot.
> 
> It also registers a Xen clocksource, so the system clock is kept in
> sync with the hypervisor's clock.

Interesting. The Andi has been bugging me for a similarly designed
per-cpu TSC clocksource, but just for generic use. I'm a little
skeptical that it will be 100% without error, since anything dealing w/
the TSCs have been nothing but trouble in my mind, but this looks like a
good proving ground for the concept.

It was mentioned to me that the clocksource approach helped cleanup some
of the xen time changes (is that really true? :), but there were still
some outstanding issues (time inconsistencies, perhaps?). I'm just
curious if there are any details about the issues there, or if I
misunderstood?


> This does not implement setting the hypervisor clock; nor does it
> implement non-independent time, so hypervisor wallclock changes will
> not affect the guest.

Hmmm. I'm not sure if I understood that last line or not. I guess I need
to think a bit about CLOCK_REALTIME vs CLOCK_MONOTONIC wrt
hypervisiors. 

I guess the question is "who owns time?" the guest OS (does it have its
own CLOCK_REALTIME, independent of other guests?) or does the
hypervisor? What does NTPd running on a guest actually adjust?


Anyway, some comments and boneheaded questions below.


> diff -r 2657dae4b5cd drivers/xen/core/time.c
> --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> +++ b/drivers/xen/core/time.c	Tue Jul 18 03:40:46 2006 -0400
> @@ -0,0 +1,362 @@
> +/* 
> + * Xen-specific time functions
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/time.h>
> +#include <linux/interrupt.h>
> +#include <linux/clocksource.h>
> +#include <linux/kernel_stat.h>
> +
> +#include <asm/arch_hooks.h>
> +#include <asm/hypervisor.h>
> +
> +#include <xen/evtchn.h>
> +#include <xen/interface/xen.h>
> +#include <xen/interface/vcpu.h>
> +
> +#include "mach_time.h"
> +#include "do_timer.h"
> +
> +
> +/* Permitted clock jitter, in nsecs, beyond which a warning will be printed. */
> +static unsigned long permitted_clock_jitter = 10000000UL; /* 10ms */
> +static int __init __permitted_clock_jitter(char *str)
> +{
> +	permitted_clock_jitter = simple_strtoul(str, NULL, 0);
> +	return 1;
> +}
> +__setup("permitted_clock_jitter=", __permitted_clock_jitter);

permitted_clock_jitter is a little vague and might get confused w/ the
NTP notion of jitter. Is there a better name, or could we get a xen_
prefix there?


> +/* These are perodically updated in shared_info, and then copied here. */
> +struct shadow_time_info {
> +	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
> +	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
> +	u32 tsc_to_nsec_mul;
> +	u32 tsc_to_usec_mul;

Hmmm. Keeping separate cycle->usec and cycle->nsec multipliers is an
interesting optimization. I'd even consider it for the generic
clocksource code, but I suspect recalculating the independent adjustment
factors for both kills the performance benefit.  Have you actually
compaired against the cost of the /1000 going from nsec to usec?

> +	int tsc_shift;
> +	u32 version;

Errr.. Why is a version value necessary?

> +
> +static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
> +
> +/* Keep track of last time we did processing/updating of jiffies and xtime. */
> +static u64 processed_system_time;   /* System time (ns) at last processing. */
> +static DEFINE_PER_CPU(u64, processed_system_time);

Errr. That would confuse me right off. Global and per-cpu values having
the same name?


> +/* How much CPU time was spent blocked and how much was 'stolen'? */
> +static DEFINE_PER_CPU(u64, processed_stolen_time);
> +static DEFINE_PER_CPU(u64, processed_blocked_time);

These seem like more generic accounting structures. Surely other
virtualized arches have something similar? Something that should be
looked into.


> +/* Current runstate of each CPU (updated automatically by the hypervisor). */
> +static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate);
> +
> +/* Must be signed, as it's compared with s64 quantities which can be -ve. */
> +#define NS_PER_TICK (1000000000LL/HZ)
> +
> +/*
> + * Reads a consistent set of time-base values from Xen, into a shadow data
> + * area.
> + */
> +static void get_time_values_from_xen(void)
> +{
> +	struct shared_info      *s = HYPERVISOR_shared_info;
> +	struct vcpu_time_info   *src;
> +	struct shadow_time_info *dst;
> +
> +	src = &s->vcpu_info[smp_processor_id()].time;
> +	dst = &per_cpu(shadow_time, smp_processor_id());
> +
> +	do {
> +		dst->version = src->version;
> +		rmb();
> +		dst->tsc_timestamp     = src->tsc_timestamp;
> +		dst->system_timestamp  = src->system_time;
> +		dst->tsc_to_nsec_mul   = src->tsc_to_system_mul;
> +		dst->tsc_shift         = src->tsc_shift;
> +		rmb();
> +	} while ((src->version & 1) | (dst->version ^ src->version));
> +
> +	dst->tsc_to_usec_mul = dst->tsc_to_nsec_mul / 1000;
> +}
> +
> +static inline int time_values_up_to_date(int cpu)
> +{
> +	struct vcpu_time_info   *src;
> +	struct shadow_time_info *dst;
> +
> +	src = &HYPERVISOR_shared_info->vcpu_info[cpu].time;
> +	dst = &per_cpu(shadow_time, cpu);
> +
> +	rmb();
> +	return (dst->version == src->version);
> +}
> +
> +/*
> + * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
> + * yielding a 64-bit result.
> + */
> +static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
> +{
> +	u64 product;
> +#ifdef __i386__
> +	u32 tmp1, tmp2;
> +#endif
> +
> +	if (shift < 0)
> +		delta >>= -shift;
> +	else
> +		delta <<= shift;

I think there is a shift_right() macro that can avoid this.

Also I'm not sure I follow why you shift before multiply instead of
multiply before shift? Does that not hurt your precision?

> +#ifdef __i386__
> +	__asm__ (
> +		"mul  %5       ; "
> +		"mov  %4,%%eax ; "
> +		"mov  %%edx,%4 ; "
> +		"mul  %5       ; "
> +		"xor  %5,%5    ; "
> +		"add  %4,%%eax ; "
> +		"adc  %5,%%edx ; "
> +		: "=A" (product), "=r" (tmp1), "=r" (tmp2)
> +		: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
> +#elif __x86_64__
> +	__asm__ (
> +		"mul %%rdx ; shrd $32,%%rdx,%%rax"
> +		: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
> +#else
> +#error implement me!
> +#endif
> +
> +	return product;
> +}

I think we need some generic mul_llxl_ll() wrappers here.


> +
> +static u64 get_nsec_offset(struct shadow_time_info *shadow)
> +{
> +	u64 now, delta;
> +	rdtscll(now);
> +	delta = now - shadow->tsc_timestamp;
> +	return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
> +}

get_nsec_offset is a little generic for a name. I know xen_ prefixes
everywhere are irritating, but maybe something a little more specific
would be a good idea.


> +
> +void do_timer_interrupt_hook(struct pt_regs *regs)
> +{
> +	s64 delta, delta_cpu, stolen, blocked;
> +	u64 sched_time;
> +	int i, cpu = smp_processor_id();
> +	struct shadow_time_info *shadow = &per_cpu(shadow_time, cpu);
> +	struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu);
> +
> +	do {
> +		get_time_values_from_xen();
> +
> +		/* Obtain a consistent snapshot of elapsed wallclock cycles. */
> +		delta = delta_cpu =
> +			shadow->system_timestamp + get_nsec_offset(shadow);
> +		delta     -= processed_system_time;
> +		delta_cpu -= per_cpu(processed_system_time, cpu);
> +
> +		/*
> +		 * Obtain a consistent snapshot of stolen/blocked cycles. We
> +		 * can use state_entry_time to detect if we get preempted here.
> +		 */
> +		do {
> +			sched_time = runstate->state_entry_time;
> +			barrier();
> +			stolen = runstate->time[RUNSTATE_runnable] +
> +				runstate->time[RUNSTATE_offline] -
> +				per_cpu(processed_stolen_time, cpu);
> +			blocked = runstate->time[RUNSTATE_blocked] -
> +				per_cpu(processed_blocked_time, cpu);
> +			barrier();
> +		} while (sched_time != runstate->state_entry_time);
> +	} while (!time_values_up_to_date(cpu));
> +
> +	if ((unlikely(delta < -(s64)permitted_clock_jitter) ||
> +	     unlikely(delta_cpu < -(s64)permitted_clock_jitter))
> +	    && printk_ratelimit()) {
> +		printk("Timer ISR/%d: Time went backwards: "
> +		       "delta=%lld delta_cpu=%lld shadow=%lld "
> +		       "off=%lld processed=%lld cpu_processed=%lld\n",
> +		       cpu, delta, delta_cpu, shadow->system_timestamp,
> +		       (s64)get_nsec_offset(shadow),
> +		       processed_system_time,
> +		       per_cpu(processed_system_time, cpu));
> +		for (i = 0; i < num_online_cpus(); i++)
> +			printk(" %d: %lld\n", i,
> +			       per_cpu(processed_system_time, i));
> +	}
> +
> +	/* System-wide jiffy work. */
> +	while (delta >= NS_PER_TICK) {
> +		delta -= NS_PER_TICK;
> +		processed_system_time += NS_PER_TICK;
> +		do_timer(regs);
> +	}
> +	/*
> +	 * Account stolen ticks.
> +	 * HACK: Passing NULL to account_steal_time()
> +	 * ensures that the ticks are accounted as stolen.
> +	 */
> +	if ((stolen > 0) && (delta_cpu > 0)) {
> +		delta_cpu -= stolen;
> +		if (unlikely(delta_cpu < 0))
> +			stolen += delta_cpu; /* clamp local-time progress */
> +		do_div(stolen, NS_PER_TICK);
> +		per_cpu(processed_stolen_time, cpu) += stolen * NS_PER_TICK;
> +		per_cpu(processed_system_time, cpu) += stolen * NS_PER_TICK;
> +		account_steal_time(NULL, (cputime_t)stolen);
> +	}
> +
> +	/*
> +	 * Account blocked ticks.
> +	 * HACK: Passing idle_task to account_steal_time()
> +	 * ensures that the ticks are accounted as idle/wait.
> +	 */
> +	if ((blocked > 0) && (delta_cpu > 0)) {
> +		delta_cpu -= blocked;
> +		if (unlikely(delta_cpu < 0))
> +			blocked += delta_cpu; /* clamp local-time progress */
> +		do_div(blocked, NS_PER_TICK);
> +		per_cpu(processed_blocked_time, cpu) += blocked * NS_PER_TICK;
> +		per_cpu(processed_system_time, cpu)  += blocked * NS_PER_TICK;
> +		account_steal_time(idle_task(cpu), (cputime_t)blocked);
> +	}
> +
> +	update_process_times(user_mode_vm(regs));
> +}
> +
> +static cycle_t xen_clocksource_read(void)
> +{
> +	struct shadow_time_info *shadow = &per_cpu(shadow_time, smp_processor_id());
> +
> +	get_time_values_from_xen();
> +
> +	return shadow->system_timestamp + get_nsec_offset(shadow);
> +}

Does get_time_values_from_xen() really need to be called on every
clocksource_read call?


> +static void xen_get_wallclock(struct timespec *ts)
> +{
> +	const struct shared_info *s = HYPERVISOR_shared_info;
> +	u32 version;
> +	u64 delta;
> +	struct timespec now;
> +
> +	/* get wallclock at system boot */
> +	do {
> +		version = s->wc_version;
> +		rmb();
> +		now.tv_sec  = s->wc_sec;
> +		now.tv_nsec = s->wc_nsec;
> +		rmb();
> +	} while ((s->wc_version & 1) | (version ^ s->wc_version));
> +
> +	delta = xen_clocksource_read();	/* time since system boot */
> +	delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
> +
> +	now.tv_nsec = do_div(delta, NSEC_PER_SEC);
> +	now.tv_sec = delta;
> +
> +	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
> +}
> +
> +unsigned long mach_get_cmos_time(void)
> +{
> +	struct timespec ts;
> +
> +	xen_get_wallclock(&ts);
> +
> +	return ts.tv_sec;
> +}
> +
> +int mach_set_rtc_mmss(unsigned long now)
> +{
> +	/* do nothing for domU */
> +	return -1;
> +}
> +
> +static void init_cpu_khz(void)
> +{
> +	u64 __cpu_khz = 1000000ULL << 32;
> +	struct vcpu_time_info *info;
> +	info = &HYPERVISOR_shared_info->vcpu_info[0].time;
> +	do_div(__cpu_khz, info->tsc_to_system_mul);
> +	if (info->tsc_shift < 0)
> +		cpu_khz = __cpu_khz << -info->tsc_shift;
> +	else
> +		cpu_khz = __cpu_khz >> info->tsc_shift;
> +}

Err.. That could use some comments. 


> +static struct clocksource xen_clocksource = {
> +	.name = "xen",
> +	.rating = 400,
> +	.read = xen_clocksource_read,
> +	.mask = ~0,
> +	.mult = 1,		/* time directly in nanoseconds */
> +	.shift = 0,
> +	.is_continuous = 1
> +};

Hmmm. The 1/0 mul/shift pair is interesting. Is it expected that NTP
does not ever adjust this clocksource? If not the clocksource_adjust()
function won't do well with this at all, so you might consider something
like:
#define XEN_SHIFT 22
.mult = 1<<XEN_SHIFT
.shift = XEN_SHIFT


> +static void init_missing_ticks_accounting(int cpu)
> +{
> +	struct vcpu_register_runstate_memory_area area;
> +	struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu);
> +
> +	memset(runstate, 0, sizeof(*runstate));
> +
> +	area.addr.v = runstate;
> +	HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, cpu, &area);
> +
> +	per_cpu(processed_blocked_time, cpu) =
> +		runstate->time[RUNSTATE_blocked];
> +	per_cpu(processed_stolen_time, cpu) =
> +		runstate->time[RUNSTATE_runnable] +
> +		runstate->time[RUNSTATE_offline];
> +}

Again, this accounting seems like it could be generically useful.


> +__init void time_init_hook(void)
> +{
> +	get_time_values_from_xen();
> +
> +	processed_system_time = per_cpu(shadow_time, 0).system_timestamp;
> +	per_cpu(processed_system_time, 0) = processed_system_time;
> +
> +	init_cpu_khz();
> +	printk(KERN_INFO "Xen reported: %u.%03u MHz processor.\n",
> +	       cpu_khz / 1000, cpu_khz % 1000);
> +
> +	/* Cannot request_irq() until kmem is initialised. */
> +	late_time_init = setup_cpu0_timer_irq;
> +
> +	init_missing_ticks_accounting(0);
> +
> +	clocksource_register(&xen_clocksource);
> +
> +	/* Set initial system time with full resolution */
> +	xen_get_wallclock(&xtime);
> +	set_normalized_timespec(&wall_to_monotonic,
> +				-xtime.tv_sec, -xtime.tv_nsec);
> +}

Some mention of which functions require to hold what on xtime_lock would
be useful as well (applies to this function as well as the previous ones
already commented on).


My only thoughts after looking at it: Using nanoseconds as a primary
unit is often easier to work with, but less efficient.  So rather then
keeping a tsc_timestamp + system_timestamp in two different units, why
not keep a calculated TSC base that includes the "cycles since boot"
which is adjusted in the same manner internally to Xen as the
system_timestamp is. Then let the timekeeping code do the conversion for
you.

I haven't fully thought about what else it would affect in the above (I
realize stolen_time, etc is in nsecs), but it might be something to
consider.

Am I making any sense or just babbling?

thanks
-john

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/