[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1153795784.23416.58.camel@cog.beaverton.ibm.com>
Date: Mon, 24 Jul 2006 19:49:44 -0700
From: john stultz <johnstul@...ibm.com>
To: Chris Wright <chrisw@...s-sol.org>
Cc: linux-kernel@...r.kernel.org, virtualization@...ts.osdl.org,
xen-devel@...ts.xensource.com,
Jeremy Fitzhardinge <jeremy@...p.org>, Andi Kleen <ak@...e.de>,
Andrew Morton <akpm@...l.org>,
Rusty Russell <rusty@...tcorp.com.au>,
Zachary Amsden <zach@...are.com>,
Jeremy Fitzhardinge <jeremy@...source.com>
Subject: Re: [RFC PATCH 25/33] Implement timekeeping for Xen
On Tue, 2006-07-18 at 00:00 -0700, Chris Wright wrote:
> plain text document attachment (xen-time)
> Use the hypervisor as the basis of the guest's time. This means that
> the hypervisor wallclock time is used to emulate the cmos clock, and
> set the inital system clock at boot.
>
> It also registers a Xen clocksource, so the system clock is kept in
> sync with the hypervisor's clock.
Interesting. The Andi has been bugging me for a similarly designed
per-cpu TSC clocksource, but just for generic use. I'm a little
skeptical that it will be 100% without error, since anything dealing w/
the TSCs have been nothing but trouble in my mind, but this looks like a
good proving ground for the concept.
It was mentioned to me that the clocksource approach helped cleanup some
of the xen time changes (is that really true? :), but there were still
some outstanding issues (time inconsistencies, perhaps?). I'm just
curious if there are any details about the issues there, or if I
misunderstood?
> This does not implement setting the hypervisor clock; nor does it
> implement non-independent time, so hypervisor wallclock changes will
> not affect the guest.
Hmmm. I'm not sure if I understood that last line or not. I guess I need
to think a bit about CLOCK_REALTIME vs CLOCK_MONOTONIC wrt
hypervisiors.
I guess the question is "who owns time?" the guest OS (does it have its
own CLOCK_REALTIME, independent of other guests?) or does the
hypervisor? What does NTPd running on a guest actually adjust?
Anyway, some comments and boneheaded questions below.
> diff -r 2657dae4b5cd drivers/xen/core/time.c
> --- /dev/null Thu Jan 01 00:00:00 1970 +0000
> +++ b/drivers/xen/core/time.c Tue Jul 18 03:40:46 2006 -0400
> @@ -0,0 +1,362 @@
> +/*
> + * Xen-specific time functions
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/time.h>
> +#include <linux/interrupt.h>
> +#include <linux/clocksource.h>
> +#include <linux/kernel_stat.h>
> +
> +#include <asm/arch_hooks.h>
> +#include <asm/hypervisor.h>
> +
> +#include <xen/evtchn.h>
> +#include <xen/interface/xen.h>
> +#include <xen/interface/vcpu.h>
> +
> +#include "mach_time.h"
> +#include "do_timer.h"
> +
> +
> +/* Permitted clock jitter, in nsecs, beyond which a warning will be printed. */
> +static unsigned long permitted_clock_jitter = 10000000UL; /* 10ms */
> +static int __init __permitted_clock_jitter(char *str)
> +{
> + permitted_clock_jitter = simple_strtoul(str, NULL, 0);
> + return 1;
> +}
> +__setup("permitted_clock_jitter=", __permitted_clock_jitter);
permitted_clock_jitter is a little vague and might get confused w/ the
NTP notion of jitter. Is there a better name, or could we get a xen_
prefix there?
> +/* These are perodically updated in shared_info, and then copied here. */
> +struct shadow_time_info {
> + u64 tsc_timestamp; /* TSC at last update of time vals. */
> + u64 system_timestamp; /* Time, in nanosecs, since boot. */
> + u32 tsc_to_nsec_mul;
> + u32 tsc_to_usec_mul;
Hmmm. Keeping separate cycle->usec and cycle->nsec multipliers is an
interesting optimization. I'd even consider it for the generic
clocksource code, but I suspect recalculating the independent adjustment
factors for both kills the performance benefit. Have you actually
compaired against the cost of the /1000 going from nsec to usec?
> + int tsc_shift;
> + u32 version;
Errr.. Why is a version value necessary?
> +
> +static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
> +
> +/* Keep track of last time we did processing/updating of jiffies and xtime. */
> +static u64 processed_system_time; /* System time (ns) at last processing. */
> +static DEFINE_PER_CPU(u64, processed_system_time);
Errr. That would confuse me right off. Global and per-cpu values having
the same name?
> +/* How much CPU time was spent blocked and how much was 'stolen'? */
> +static DEFINE_PER_CPU(u64, processed_stolen_time);
> +static DEFINE_PER_CPU(u64, processed_blocked_time);
These seem like more generic accounting structures. Surely other
virtualized arches have something similar? Something that should be
looked into.
> +/* Current runstate of each CPU (updated automatically by the hypervisor). */
> +static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate);
> +
> +/* Must be signed, as it's compared with s64 quantities which can be -ve. */
> +#define NS_PER_TICK (1000000000LL/HZ)
> +
> +/*
> + * Reads a consistent set of time-base values from Xen, into a shadow data
> + * area.
> + */
> +static void get_time_values_from_xen(void)
> +{
> + struct shared_info *s = HYPERVISOR_shared_info;
> + struct vcpu_time_info *src;
> + struct shadow_time_info *dst;
> +
> + src = &s->vcpu_info[smp_processor_id()].time;
> + dst = &per_cpu(shadow_time, smp_processor_id());
> +
> + do {
> + dst->version = src->version;
> + rmb();
> + dst->tsc_timestamp = src->tsc_timestamp;
> + dst->system_timestamp = src->system_time;
> + dst->tsc_to_nsec_mul = src->tsc_to_system_mul;
> + dst->tsc_shift = src->tsc_shift;
> + rmb();
> + } while ((src->version & 1) | (dst->version ^ src->version));
> +
> + dst->tsc_to_usec_mul = dst->tsc_to_nsec_mul / 1000;
> +}
> +
> +static inline int time_values_up_to_date(int cpu)
> +{
> + struct vcpu_time_info *src;
> + struct shadow_time_info *dst;
> +
> + src = &HYPERVISOR_shared_info->vcpu_info[cpu].time;
> + dst = &per_cpu(shadow_time, cpu);
> +
> + rmb();
> + return (dst->version == src->version);
> +}
> +
> +/*
> + * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
> + * yielding a 64-bit result.
> + */
> +static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
> +{
> + u64 product;
> +#ifdef __i386__
> + u32 tmp1, tmp2;
> +#endif
> +
> + if (shift < 0)
> + delta >>= -shift;
> + else
> + delta <<= shift;
I think there is a shift_right() macro that can avoid this.
Also I'm not sure I follow why you shift before multiply instead of
multiply before shift? Does that not hurt your precision?
> +#ifdef __i386__
> + __asm__ (
> + "mul %5 ; "
> + "mov %4,%%eax ; "
> + "mov %%edx,%4 ; "
> + "mul %5 ; "
> + "xor %5,%5 ; "
> + "add %4,%%eax ; "
> + "adc %5,%%edx ; "
> + : "=A" (product), "=r" (tmp1), "=r" (tmp2)
> + : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
> +#elif __x86_64__
> + __asm__ (
> + "mul %%rdx ; shrd $32,%%rdx,%%rax"
> + : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
> +#else
> +#error implement me!
> +#endif
> +
> + return product;
> +}
I think we need some generic mul_llxl_ll() wrappers here.
> +
> +static u64 get_nsec_offset(struct shadow_time_info *shadow)
> +{
> + u64 now, delta;
> + rdtscll(now);
> + delta = now - shadow->tsc_timestamp;
> + return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
> +}
get_nsec_offset is a little generic for a name. I know xen_ prefixes
everywhere are irritating, but maybe something a little more specific
would be a good idea.
> +
> +void do_timer_interrupt_hook(struct pt_regs *regs)
> +{
> + s64 delta, delta_cpu, stolen, blocked;
> + u64 sched_time;
> + int i, cpu = smp_processor_id();
> + struct shadow_time_info *shadow = &per_cpu(shadow_time, cpu);
> + struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu);
> +
> + do {
> + get_time_values_from_xen();
> +
> + /* Obtain a consistent snapshot of elapsed wallclock cycles. */
> + delta = delta_cpu =
> + shadow->system_timestamp + get_nsec_offset(shadow);
> + delta -= processed_system_time;
> + delta_cpu -= per_cpu(processed_system_time, cpu);
> +
> + /*
> + * Obtain a consistent snapshot of stolen/blocked cycles. We
> + * can use state_entry_time to detect if we get preempted here.
> + */
> + do {
> + sched_time = runstate->state_entry_time;
> + barrier();
> + stolen = runstate->time[RUNSTATE_runnable] +
> + runstate->time[RUNSTATE_offline] -
> + per_cpu(processed_stolen_time, cpu);
> + blocked = runstate->time[RUNSTATE_blocked] -
> + per_cpu(processed_blocked_time, cpu);
> + barrier();
> + } while (sched_time != runstate->state_entry_time);
> + } while (!time_values_up_to_date(cpu));
> +
> + if ((unlikely(delta < -(s64)permitted_clock_jitter) ||
> + unlikely(delta_cpu < -(s64)permitted_clock_jitter))
> + && printk_ratelimit()) {
> + printk("Timer ISR/%d: Time went backwards: "
> + "delta=%lld delta_cpu=%lld shadow=%lld "
> + "off=%lld processed=%lld cpu_processed=%lld\n",
> + cpu, delta, delta_cpu, shadow->system_timestamp,
> + (s64)get_nsec_offset(shadow),
> + processed_system_time,
> + per_cpu(processed_system_time, cpu));
> + for (i = 0; i < num_online_cpus(); i++)
> + printk(" %d: %lld\n", i,
> + per_cpu(processed_system_time, i));
> + }
> +
> + /* System-wide jiffy work. */
> + while (delta >= NS_PER_TICK) {
> + delta -= NS_PER_TICK;
> + processed_system_time += NS_PER_TICK;
> + do_timer(regs);
> + }
> + /*
> + * Account stolen ticks.
> + * HACK: Passing NULL to account_steal_time()
> + * ensures that the ticks are accounted as stolen.
> + */
> + if ((stolen > 0) && (delta_cpu > 0)) {
> + delta_cpu -= stolen;
> + if (unlikely(delta_cpu < 0))
> + stolen += delta_cpu; /* clamp local-time progress */
> + do_div(stolen, NS_PER_TICK);
> + per_cpu(processed_stolen_time, cpu) += stolen * NS_PER_TICK;
> + per_cpu(processed_system_time, cpu) += stolen * NS_PER_TICK;
> + account_steal_time(NULL, (cputime_t)stolen);
> + }
> +
> + /*
> + * Account blocked ticks.
> + * HACK: Passing idle_task to account_steal_time()
> + * ensures that the ticks are accounted as idle/wait.
> + */
> + if ((blocked > 0) && (delta_cpu > 0)) {
> + delta_cpu -= blocked;
> + if (unlikely(delta_cpu < 0))
> + blocked += delta_cpu; /* clamp local-time progress */
> + do_div(blocked, NS_PER_TICK);
> + per_cpu(processed_blocked_time, cpu) += blocked * NS_PER_TICK;
> + per_cpu(processed_system_time, cpu) += blocked * NS_PER_TICK;
> + account_steal_time(idle_task(cpu), (cputime_t)blocked);
> + }
> +
> + update_process_times(user_mode_vm(regs));
> +}
> +
> +static cycle_t xen_clocksource_read(void)
> +{
> + struct shadow_time_info *shadow = &per_cpu(shadow_time, smp_processor_id());
> +
> + get_time_values_from_xen();
> +
> + return shadow->system_timestamp + get_nsec_offset(shadow);
> +}
Does get_time_values_from_xen() really need to be called on every
clocksource_read call?
> +static void xen_get_wallclock(struct timespec *ts)
> +{
> + const struct shared_info *s = HYPERVISOR_shared_info;
> + u32 version;
> + u64 delta;
> + struct timespec now;
> +
> + /* get wallclock at system boot */
> + do {
> + version = s->wc_version;
> + rmb();
> + now.tv_sec = s->wc_sec;
> + now.tv_nsec = s->wc_nsec;
> + rmb();
> + } while ((s->wc_version & 1) | (version ^ s->wc_version));
> +
> + delta = xen_clocksource_read(); /* time since system boot */
> + delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
> +
> + now.tv_nsec = do_div(delta, NSEC_PER_SEC);
> + now.tv_sec = delta;
> +
> + set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
> +}
> +
> +unsigned long mach_get_cmos_time(void)
> +{
> + struct timespec ts;
> +
> + xen_get_wallclock(&ts);
> +
> + return ts.tv_sec;
> +}
> +
> +int mach_set_rtc_mmss(unsigned long now)
> +{
> + /* do nothing for domU */
> + return -1;
> +}
> +
> +static void init_cpu_khz(void)
> +{
> + u64 __cpu_khz = 1000000ULL << 32;
> + struct vcpu_time_info *info;
> + info = &HYPERVISOR_shared_info->vcpu_info[0].time;
> + do_div(__cpu_khz, info->tsc_to_system_mul);
> + if (info->tsc_shift < 0)
> + cpu_khz = __cpu_khz << -info->tsc_shift;
> + else
> + cpu_khz = __cpu_khz >> info->tsc_shift;
> +}
Err.. That could use some comments.
> +static struct clocksource xen_clocksource = {
> + .name = "xen",
> + .rating = 400,
> + .read = xen_clocksource_read,
> + .mask = ~0,
> + .mult = 1, /* time directly in nanoseconds */
> + .shift = 0,
> + .is_continuous = 1
> +};
Hmmm. The 1/0 mul/shift pair is interesting. Is it expected that NTP
does not ever adjust this clocksource? If not the clocksource_adjust()
function won't do well with this at all, so you might consider something
like:
#define XEN_SHIFT 22
.mult = 1<<XEN_SHIFT
.shift = XEN_SHIFT
> +static void init_missing_ticks_accounting(int cpu)
> +{
> + struct vcpu_register_runstate_memory_area area;
> + struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu);
> +
> + memset(runstate, 0, sizeof(*runstate));
> +
> + area.addr.v = runstate;
> + HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, cpu, &area);
> +
> + per_cpu(processed_blocked_time, cpu) =
> + runstate->time[RUNSTATE_blocked];
> + per_cpu(processed_stolen_time, cpu) =
> + runstate->time[RUNSTATE_runnable] +
> + runstate->time[RUNSTATE_offline];
> +}
Again, this accounting seems like it could be generically useful.
> +__init void time_init_hook(void)
> +{
> + get_time_values_from_xen();
> +
> + processed_system_time = per_cpu(shadow_time, 0).system_timestamp;
> + per_cpu(processed_system_time, 0) = processed_system_time;
> +
> + init_cpu_khz();
> + printk(KERN_INFO "Xen reported: %u.%03u MHz processor.\n",
> + cpu_khz / 1000, cpu_khz % 1000);
> +
> + /* Cannot request_irq() until kmem is initialised. */
> + late_time_init = setup_cpu0_timer_irq;
> +
> + init_missing_ticks_accounting(0);
> +
> + clocksource_register(&xen_clocksource);
> +
> + /* Set initial system time with full resolution */
> + xen_get_wallclock(&xtime);
> + set_normalized_timespec(&wall_to_monotonic,
> + -xtime.tv_sec, -xtime.tv_nsec);
> +}
Some mention of which functions require to hold what on xtime_lock would
be useful as well (applies to this function as well as the previous ones
already commented on).
My only thoughts after looking at it: Using nanoseconds as a primary
unit is often easier to work with, but less efficient. So rather then
keeping a tsc_timestamp + system_timestamp in two different units, why
not keep a calculated TSC base that includes the "cycles since boot"
which is adjusted in the same manner internally to Xen as the
system_timestamp is. Then let the timekeeping code do the conversion for
you.
I haven't fully thought about what else it would affect in the above (I
realize stolen_time, etc is in nsecs), but it might be something to
consider.
Am I making any sense or just babbling?
thanks
-john
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists