[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45A0E586.50806@qumranet.com>
Date: Sun, 07 Jan 2007 14:20:22 +0200
From: Avi Kivity <avi@...ranet.com>
To: Ingo Molnar <mingo@...e.hu>
CC: kvm-devel <kvm-devel@...ts.sourceforge.net>,
linux-kernel@...r.kernel.org
Subject: Re: [announce] [patch] KVM paravirtualization for Linux
Ingo Molnar wrote:
> i'm pleased to announce the first release of paravirtualized KVM (Linux
> under Linux), which includes support for the hardware cr3-cache feature
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
>
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
>
> http://redhat.com/~mingo/kvm-paravirt-patches/
>
> Some aspects of the code are still a bit ad-hoc and incomplete, but the
> code is stable enough in my testing and i'd like to have some feedback.
>
> Firstly, here are some numbers:
>
> 2-task context-switch performance (in microseconds, lower is better):
>
> native: 1.11
> ----------------------------------
> Qemu: 61.18
> KVM upstream: 53.01
> KVM trunk: 6.36
> KVM trunk+paravirt/cr3: 1.60
>
> i.e. 2-task context-switch performance is faster by a factor of 4, and
> is now quite close to native speed!
>
Very impressive! The gain probably comes not only from avoiding the
vmentry/vmexit, but also from avoiding the flushing of the global page
tlb entries.
> "hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
>
> native: 0.25
> ----------------------------------
> Qemu: 7.8
> KVM upstream: 2.8
> KVM trunk: 0.55
> KVM paravirt/cr3: 0.36
>
> almost twice as fast.
>
> "hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
>
> native: 0.9
> ----------------------------------
> Qemu: 35.2
> KVM upstream: 9.4
> KVM trunk: 2.8
> KVM paravirt/cr3: 2.2
>
> still a 30% improvement - which isnt too bad considering that 200 tasks
> are context-switching in this workload and the cr3 cache in current CPUs
> is only 4 entries.
>
This is a little too good to be true. Were both runs with the same
KVM_NUM_MMU_PAGES?
I'm also concerned that at this point in time the cr3 optimizations will
only show an improvement in microbenchmarks. In real life workloads a
context switch is usually preceded by an I/O, and with the current sorry
state of kvm I/O the context switch time would be dominated by the I/O time.
> the patchset does the following:
>
> - it provides an ad-hoc paravirtualization hypercall API between a Linux
> guest and a Linux host. (this will be replaced with a proper
> hypercall later on.)
>
> - using the hypercall API it utilizes the "cr3 target cache" feature in
> Intel VMX CPUs, and extends KVM to make use of that cache. This
> feature allows the avoidance of expensive VM exits into hypervisor
> context. (The guest needs to be 'aware' and the cache has to be
> shared between the guest and the hypervisor. So fully emulated OSs
> wont benefit from this feature.)
>
> - a few simpler paravirtualization changes are done for Linux guests: IO
> port delays do not cause a VM exit anymore, the i8259A IRQ controller
> code got simplified (this will be replaced with a proper, hypercall
> based and host-maintained IRQ controller implementation) and TLB
> flushes are more efficient, because no cr3 reads happen which would
> otherwise cause a VM exit. These changes have a visible effect
> already: they reduce qemu's CPU usage when a guest idles in HLT, by
> about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has
> HZ=1000)
>
> Paravirtualization is triggered via the kvm_paravirt=1 boot option (for
> now, this too is ad-hoc) - if that is passed then the KVM guest will
> probe for paravirtualization availability on the hypervisor side - and
> will use it if found. (If the guest does not find KVM-paravirt support
> on the hypervisor side then it will continue as a fully emulated guest.)
>
> Issues: i only tested this on 32-bit VMX. (64-bit should work with not
> too many changes, the paravirt.c bits can be carried over to 64-bit
> almost as-is. But i didnt want to spread the code too wide.)
>
> Comments, suggestions are welcome!
>
> Ingo
>
Some comments below on the code.
> +
> +/*
> + * Special, register-to-cr3 instruction based hypercall API
> + * variant to the KVM host. This utilizes the cr3 filter capability
> + * of the hardware - if this works out then no VM exit happens,
> + * if a VM exit happens then KVM will get the virtual address too.
> + */
> +static void kvm_write_cr3(unsigned long guest_cr3)
> +{
> + struct kvm_vcpu_para_state *para_state = &get_cpu_var(para_state);
> + struct kvm_cr3_cache *cache = ¶_state->cr3_cache;
> + int idx;
> +
> + /*
> + * Check the cache (maintained by the host) for a matching
> + * guest_cr3 => host_cr3 mapping. Use it if found:
> + */
> + for (idx = 0; idx < cache->max_idx; idx++) {
> + if (cache->entry[idx].guest_cr3 == guest_cr3) {
> + /*
> + * Cache-hit: we load the cached host-CR3 value.
> + * This never causes any VM exit. (if it does then the
> + * hypervisor could do nothing with this instruction
> + * and the guest OS would be aborted)
> + */
> + asm volatile("movl %0, %%cr3"
> + : : "r" (cache->entry[idx].host_cr3));
> + goto out;
> + }
> + }
> +
> + /*
> + * Cache-miss. Load the guest-cr3 value into cr3, which will
> + * cause a VM exit to the hypervisor, which then loads the
> + * host cr3 value and updates the cr3_cache.
> + */
> + asm volatile("movl %0, %%cr3" : : "r" (guest_cr3));
> +out:
> + put_cpu_var(para_state);
> +}
Well, you did say it was ad-hoc. For reference, this is how I see the
hypercall API:
- A virtual pci device exports a page through the pci rom interface.
The page contains the hypercall code approriate for the current cpu.
This allows migration to work across different cpu vendors.
- In case the pci rom is discovered too late in the boot process, the
address (gpa) can also be exported via a kvm-specific msr.
- Guest/host communications is by guest physical addressed, as the
virtual->physical translation is much cheaper on the guest (__pa() vs a
page table walk).
>
> +
> +/*
> + * Simplified i8259A controller handling:
> + */
> +static void mask_and_ack_kvm(unsigned int irq)
> +{
> + unsigned int irqmask = 1 << irq;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&i8259A_lock, flags);
> + cached_irq_mask |= irqmask;
> +
> + if (irq & 8) {
> + outb(cached_slave_mask, PIC_SLAVE_IMR);
> + outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> + outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI'
> to master-IRQ2 */
> + } else {
> + outb(cached_master_mask, PIC_MASTER_IMR);
> + /* 'Specific EOI' to master: */
> + outb(0x60+irq, PIC_MASTER_CMD);
> + }
> + spin_unlock_irqrestore(&i8259A_lock, flags);
> +}
Any reason this can't be applied to mainline? There's probably no
downside to native, and it would benefit all virtualization solutions
equally.
> Index: linux/drivers/kvm/kvm.h
> ===================================================================
> --- linux.orig/drivers/kvm/kvm.h
> +++ linux/drivers/kvm/kvm.h
> @@ -165,7 +165,7 @@ struct kvm_mmu {
> int root_level;
> int shadow_root_level;
>
> - u64 *pae_root;
> + u64 *pae_root[KVM_CR3_CACHE_SIZE];
hmm. wouldn't it be simpler to have pae_root always point at the
current root?
>
> Index: linux/drivers/kvm/mmu.c
> ===================================================================
> --- linux.orig/drivers/kvm/mmu.c
> +++ linux/drivers/kvm/mmu.c
> @@ -779,7 +779,7 @@ static int nonpaging_map(struct kvm_vcpu
>
> static void mmu_free_roots(struct kvm_vcpu *vcpu)
> {
> - int i;
> + int i, j;
> struct kvm_mmu_page *page;
>
> #ifdef CONFIG_X86_64
> @@ -793,21 +793,40 @@ static void mmu_free_roots(struct kvm_vc
> return;
> }
> #endif
> - for (i = 0; i < 4; ++i) {
> - hpa_t root = vcpu->mmu.pae_root[i];
> + /*
> + * Skip to the next cr3 filter entry and free it (if it's occupied):
> + */
> + vcpu->cr3_cache_idx++;
> + if (unlikely(vcpu->cr3_cache_idx >= vcpu->cr3_cache_limit))
> + vcpu->cr3_cache_idx = 0;
>
> - ASSERT(VALID_PAGE(root));
> - root &= PT64_BASE_ADDR_MASK;
> - page = page_header(root);
> - --page->root_count;
> - vcpu->mmu.pae_root[i] = INVALID_PAGE;
> + j = vcpu->cr3_cache_idx;
> + /*
> + * Clear the guest-visible entry:
> + */
> + if (vcpu->para_state) {
> + vcpu->para_state->cr3_cache.entry[j].guest_cr3 = 0;
> + vcpu->para_state->cr3_cache.entry[j].host_cr3 = 0;
> + }
> + ASSERT(vcpu->mmu.pae_root[j]);
> + if (VALID_PAGE(vcpu->mmu.pae_root[j][0])) {
> + vcpu->guest_cr3_gpa[j] = INVALID_PAGE;
> + for (i = 0; i < 4; ++i) {
> + hpa_t root = vcpu->mmu.pae_root[j][i];
> +
> + ASSERT(VALID_PAGE(root));
> + root &= PT64_BASE_ADDR_MASK;
> + page = page_header(root);
> + --page->root_count;
> + vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> + }
> }
> vcpu->mmu.root_hpa = INVALID_PAGE;
> }
You keep the page directories pinned here. This can be a problem if a
guest frees a page directory, and then starts using it as a regular
page. kvm sometimes chooses not to emulate a write to a guest page
table, but instead to zap it, which is impossible when the page is
freed. You need to either unpin the page when that happens, or add a
hypercall to let kvm know when a page directory is freed.
There is also a minor problem that changes to the pgd aren't caught by
kvm. It doesn't hurt much as this is PV and we can relax the guest/host
contract a little.
>
> static int alloc_mmu_pages(struct kvm_vcpu *vcpu)
> {
> struct page *page;
> - int i;
> + int i, j;
>
> ASSERT(vcpu);
>
> @@ -1227,17 +1261,22 @@ static int alloc_mmu_pages(struct kvm_vc
> ++vcpu->kvm->n_free_mmu_pages;
> }
>
> - /*
> - * When emulating 32-bit mode, cr3 is only 32 bits even on x86_64.
> - * Therefore we need to allocate shadow page tables in the first
> - * 4GB of memory, which happens to fit the DMA32 zone.
> - */
> - page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> - if (!page)
> - goto error_1;
> - vcpu->mmu.pae_root = page_address(page);
> - for (i = 0; i < 4; ++i)
> - vcpu->mmu.pae_root[i] = INVALID_PAGE;
> + for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> + /*
> + * When emulating 32-bit mode, cr3 is only 32 bits even on
> + * x86_64. Therefore we need to allocate shadow page tables
> + * in the first 4GB of memory, which happens to fit the DMA32
> + * zone:
> + */
> + page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> + if (!page)
> + goto error_1;
> +
> + ASSERT(!vcpu->mmu.pae_root[j]);
> + vcpu->mmu.pae_root[j] = page_address(page);
> + for (i = 0; i < 4; ++i)
> + vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> + }
Since a pae root uses just 32 bytes, you can store all cache entries in
a single page. Not that it matters much.
> #define ENABLE_INTERRUPTS_SYSEXIT \
> Index: linux/include/linux/kvm.h
> ===================================================================
> --- linux.orig/include/linux/kvm.h
> +++ linux/include/linux/kvm.h
> @@ -238,4 +238,44 @@ struct kvm_dirty_log {
>
> #define KVM_DUMP_VCPU _IOW(KVMIO, 250, int /* vcpu_slot */)
>
> +
> +#define KVM_CR3_CACHE_SIZE 4
> +
> +struct kvm_cr3_cache_entry {
> + u64 guest_cr3;
> + u64 host_cr3;
> +};
> +
> +struct kvm_cr3_cache {
> + struct kvm_cr3_cache_entry entry[KVM_CR3_CACHE_SIZE];
> + u32 max_idx;
> +};
> +
> +/*
> + * Per-VCPU descriptor area shared between guest and host. Writable to
> + * both guest and host. Registered with the host by the guest when
> + * a guest acknowledges paravirtual mode.
> + */
> +struct kvm_vcpu_para_state {
> + /*
> + * API version information for compatibility. If there's any support
> + * mismatch (too old host trying to execute too new guest) then
> + * the host will deny entry into paravirtual mode. Any other
> + * combination (new host + old guest and new host + new guest)
> + * is supposed to work - new host versions will support all old
> + * guest API versions.
> + */
> + u32 guest_version;
> + u32 host_version;
> + u32 size;
> + u32 __pad_00;
> +
> + struct kvm_cr3_cache cr3_cache;
> +
> +} __attribute__ ((aligned(PAGE_SIZE)));
> +
> +#define KVM_PARA_API_VERSION 1
> +
> +#define KVM_API_MAGIC 0x87654321
> +
<linux/kvm.h> is the vmm userspace interface. The guest/host interface
should probably go somewhere else.
--
error compiling committee.c: too many arguments to function
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists