[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20100415122836.27f1e255.randy.dunlap@oracle.com>
Date: Thu, 15 Apr 2010 12:28:36 -0700
From: Randy Dunlap <randy.dunlap@...cle.com>
To: Glauber Costa <glommer@...hat.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org, avi@...hat.com
Subject: Re: [PATCH 5/5] add documentation about kvmclock
On Thu, 15 Apr 2010 14:37:28 -0400 Glauber Costa wrote:
> This patch adds a new file, kvm/kvmclock.txt, describing
> the mechanism we use in kvmclock.
>
> Signed-off-by: Glauber Costa <glommer@...hat.com>
> ---
> Documentation/kvm/kvmclock.txt | 138 ++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 138 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/kvm/kvmclock.txt
>
> diff --git a/Documentation/kvm/kvmclock.txt b/Documentation/kvm/kvmclock.txt
> new file mode 100644
> index 0000000..21008bb
> --- /dev/null
> +++ b/Documentation/kvm/kvmclock.txt
> @@ -0,0 +1,138 @@
> +KVM Paravirtual Clocksource driver
> +Glauber Costa, Red Hat Inc.
> +==================================
> +
> +1. General Description
> +=======================
> +
...
> +
> +2. kvmclock basics
> +===========================
> +
> +When supported by the hypervisor, guests can register a memory page
> +to contain kvmclock data. This page has to be present in guest's address space
> +throughout its whole life. The hypervisor continues to write to it until it is
> +explicitly disabled or the guest is turned off.
> +
> +2.1 kvmclock availability
> +-------------------------
> +
> +Guests that want to take advantage of kvmclock should first check its
> +availability through cpuid.
> +
> +kvm features are presented to the guest in leaf 0x40000001. Bit 3 indicates
> +the present of kvmclock. Bit 0 indicates that kvmclock is present, but the
presence
but it's confusing. Is it bit 3 or bit 0? They seem to indicate the same thing.
> +old MSR set must be used. See section 2.3 for details.
"old MSR set": what does this mean?
> +
> +2.2 kvmclock functionality
> +--------------------------
> +
> +Two MSRs are provided by the hypervisor, controlling kvmclock operation:
> +
> + * MSR_KVM_WALL_CLOCK, value 0x4b564d00 and
> + * MSR_KVM_SYSTEM_TIME, value 0x4b564d01.
> +
> +The first one is only used in rare situations, like boot-time and a
> +suspend-resume cycle. Data is disposable, and after used, the guest
> +may use it for something else. This is hardly a hot path for anything.
> +The Hypervisor fills in the address provided through this MSR with the
> +following structure:
> +
> +struct pvclock_wall_clock {
> + u32 version;
> + u32 sec;
> + u32 nsec;
> +} __attribute__((__packed__));
> +
> +Guest should only trust data to be valid when version haven't changed before
has not
> +and after reads of sec and nsec. Besides not changing, it has to be an even
> +number. Hypervisor may write an odd number to version field to indicate that
> +an update is in progress.
> +
> +MSR_KVM_SYSTEM_TIME, on the other hand, has persistent data, and is
> +constantly updated by the hypervisor with time information. The data
> +written in this MSR contains two pieces of information: the address in which
> +the guests expects time data to be present 4-byte aligned or'ed with an
> +enabled bit. If one wants to shutdown kvmclock, it just needs to write
> +anything that has 0 as its last bit.
> +
> +Time information presented by the hypervisor follows the structure:
> +
> +struct pvclock_vcpu_time_info {
> + u32 version;
> + u32 pad0;
> + u64 tsc_timestamp;
> + u64 system_time;
> + u32 tsc_to_system_mul;
> + s8 tsc_shift;
> + u8 pad[3];
> +} __attribute__((__packed__));
> +
> +The version field plays the same role as with the one in struct
> +pvclock_wall_clock. The other fields, are:
> +
> + a. tsc_timestamp: the guest-visible tsc (result of rdtsc + tsc_offset) of
> + this cpu at the moment we recorded system_time. Note that some time is
CPU (please)
> + inevitably spent between system_time and tsc_timestamp measurements.
> + Guests can subtract this quantity from the current value of tsc to obtain
> + a delta to be added to system_time
to system_time.
> +
> + b. system_time: this is the most recent host-time we could be provided with.
> + host gets it through ktime_get_ts, using whichever clocksource is
> + registered at the moment
moment.
> +
> + c. tsc_to_system_mul: this is the number that tsc delta has to be multiplied
> + by in order to obtain time in nanoseconds. Hypervisor is free to change
> + this value in face of events like cpu frequency change, pcpu migration,
CPU
> + etc.
> +
> + d. tsc_shift: guests must shift
missing text??
> +
> +With this information available, guest calculates current time as:
> +
> + T = kt + to_nsec(tsc - tsc_0)
> +
> +2.3 Compatibility MSRs
> +----------------------
> +
> +Guests running on top of older hypervisors may have to use a different set of
> +MSRs. This is because originally, kvmclock MSRs were exported within a
> +reserved range by accident. Guests should check cpuid leaf 0x40000001 for the
> +presence of kvmclock. If bit 3 is disabled, but bit 0 is enabled, guests can
> +have access to kvmclock functionality through
> +
> + * MSR_KVM_WALL_CLOCK_OLD, value 0x11 and
> + * MSR_KVM_SYSTEM_TIME_OLD, value 0x12.
> +
> +Note, however, that this is deprecated.
> +
> +3. Migration
> +============
> +
> +Two ioctls are provided to aid the task of migration:
> +
> + * KVM_GET_CLOCK and
> + * KVM_SET_CLOCK
> +
> +Their aim is to control an offset that can be summed to system_time, in order
> +to guarantee monotonicity on the time over guest migration. Source host
> +executes KVM_GET_CLOCK, obtaining the last valid timestamp in this host, while
> +destination sets it with KVM_SET_CLOCK. It's the destination responsibility to
> +never return time that is less than that.
---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists