lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070107174416.GA14607@elte.hu>
Date:	Sun, 7 Jan 2007 18:44:17 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Avi Kivity <avi@...ranet.com>
Cc:	kvm-devel <kvm-devel@...ts.sourceforge.net>,
	linux-kernel@...r.kernel.org
Subject: Re: [announce] [patch] KVM paravirtualization for Linux


* Avi Kivity <avi@...ranet.com> wrote:

> >2-task context-switch performance (in microseconds, lower is better):
> >
> > native:                       1.11
> > ----------------------------------
> > Qemu:                        61.18
> > KVM upstream:                53.01
> > KVM trunk:                    6.36
> > KVM trunk+paravirt/cr3:       1.60
> >
> >i.e. 2-task context-switch performance is faster by a factor of 4, and 
> >is now quite close to native speed!
> >  
> 
> Very impressive!  The gain probably comes not only from avoiding the 
> vmentry/vmexit, but also from avoiding the flushing of the global page 
> tlb entries.

90% of the win comes from the avoidance of the VM exit. To quantify this 
more precisely i have added an artificial __flush_tlb_global() call to 
after switch_to(), just to see how much impact an extra global flush has 
on the native kernel. Context-switch cost went from 1.11 usecs to 1.65 
usecs. Then i added a __flush_tlb(), which made the cost go to 1.75, 
which means that the global flush component is at around 0.5 usecs.

> >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
> >
> > native:                       0.25
> > ----------------------------------
> > Qemu:                         7.8
> > KVM upstream:                 2.8
> > KVM trunk:                    0.55
> > KVM paravirt/cr3:             0.36
> >
> >almost twice as fast.
> >
> >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
> >
> > native:                       0.9
> > ----------------------------------
> > Qemu:                        35.2
> > KVM upstream:                 9.4
> > KVM trunk:                    2.8
> > KVM paravirt/cr3:             2.2
> >
> >still a 30% improvement - which isnt too bad considering that 200 tasks 
> >are context-switching in this workload and the cr3 cache in current CPUs 
> >is only 4 entries.
> >  
> 
> This is a little too good to be true.  Were both runs with the same 
> KVM_NUM_MMU_PAGES?

yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
run should have been labeled as: 'cr3 tree with paravirt turned off'. 
That's not completely 'trunk' but close to it, and all other changes 
(like elimination of unnecessary TLB flushes) are fairly applied to 
both.

i also did a run with much less MMU cache pages of 256, and hackbench 1 
stayed the same, while hackbench 5 numbers started fluctuating badly (i 
think that workload if trashing the MMU cache badly).

> I'm also concerned that at this point in time the cr3 optimizations 
> will only show an improvement in microbenchmarks.  In real life 
> workloads a context switch is usually preceded by an I/O, and with the 
> current sorry state of kvm I/O the context switch time would be 
> dominated by the I/O time.

oh, i agreed completely - but in my opinion accelerating virtual I/O is 
really easy. Accelerating the context-switch path (and basic syscall 
overhead like KVM does) is /hard/. So i wanted to see whether KVM runs 
well in all the hard cases, before looking at the low hanging 
performance fruits in the I/O area =B-)

also note that there's lots of internal reasons why application 
workloads can be heavily context-switching - it's not just I/O that 
generates them. (pipes, critical sections / futexes, etc.) So having 
near-native performance for context-switches is very important.

> >+    if (irq & 8) {
> >+        outb(cached_slave_mask, PIC_SLAVE_IMR);
> >+        outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> >+        outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' 
> >to master-IRQ2 */
> >+    } else {
> >+        outb(cached_master_mask, PIC_MASTER_IMR);
> >+        /* 'Specific EOI' to master: */
> >+        outb(0x60+irq, PIC_MASTER_CMD);
> >+    }
> >+    spin_unlock_irqrestore(&i8259A_lock, flags);
> >+}
> 
> Any reason this can't be applied to mainline?  There's probably no 
> downside to native, and it would benefit all virtualization solutions 
> equally.

this is legacy stuff ...

> >-    u64 *pae_root;
> >+    u64 *pae_root[KVM_CR3_CACHE_SIZE];
> 
> hmm.  wouldn't it be simpler to have pae_root always point at the 
> current root?

does that guarantee that it's available? I wanted to 'pin' the root 
itself this way, to make sure that if a guest switches to it via the 
cache, that it's truly available and a valid root. cr3 addresses are 
non-virtual so this is the only mechanism available to guarantee that 
the host-side memory truly contains a root pagetable.

> >+            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+        }
> >     }
> >     vcpu->mmu.root_hpa = INVALID_PAGE;
> > }
> 
> You keep the page directories pinned here. [...]

yes.

> [...]  This can be a problem if a guest frees a page directory, and 
> then starts using it as a regular page.  kvm sometimes chooses not to 
> emulate a write to a guest page table, but instead to zap it, which is 
> impossible when the page is freed.  You need to either unpin the page 
> when that happens, or add a hypercall to let kvm know when a page 
> directory is freed.

the cache is zapped upon pagefaults anyway, so unpinning ought to be 
possible. Which one would you prefer?

> >-    for (i = 0; i < 4; ++i)
> >-        vcpu->mmu.pae_root[i] = INVALID_PAGE;
> >+    for (j = 0; j < KVM_CR3_CACHE_SIZE; j++) {
> >+        /*
> >+         * When emulating 32-bit mode, cr3 is only 32 bits even on
> >+         * x86_64. Therefore we need to allocate shadow page tables
> >+         * in the first 4GB of memory, which happens to fit the DMA32
> >+         * zone:
> >+         */
> >+        page = alloc_page(GFP_KERNEL | __GFP_DMA32);
> >+        if (!page)
> >+            goto error_1;
> >+
> >+        ASSERT(!vcpu->mmu.pae_root[j]);
> >+        vcpu->mmu.pae_root[j] = page_address(page);
> >+        for (i = 0; i < 4; ++i)
> >+            vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+    }
> 
> Since a pae root uses just 32 bytes, you can store all cache entries 
> in a single page.  Not that it matters much.

yeah - i wanted to extend the current code in a safe way, before 
optimizing it.

> >+#define KVM_API_MAGIC 0x87654321
> >+
> 
> <linux/kvm.h> is the vmm userspace interface.  The guest/host 
> interface should probably go somewhere else.

yeah. kvm_para.h?

	Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ