[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5051CDDD.6040103@redhat.com>
Date: Thu, 13 Sep 2012 15:13:17 +0300
From: Avi Kivity <avi@...hat.com>
To: habanero@...ux.vnet.ibm.com
CC: Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
Peter Zijlstra <peterz@...radead.org>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Ingo Molnar <mingo@...hat.com>, Rik van Riel <riel@...hat.com>,
KVM <kvm@...r.kernel.org>, chegu vinod <chegu_vinod@...com>,
LKML <linux-kernel@...r.kernel.org>, X86 <x86@...nel.org>,
Gleb Natapov <gleb@...hat.com>,
Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>
Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On 09/11/2012 09:27 PM, Andrew Theurer wrote:
>
> So, having both is probably not a good idea. However, I feel like
> there's more work to be done. With no over-commit (10 VMs), total
> throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some
> overhead, but a reduction to ~4500 is still terrible. By contrast,
> 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> host). We still have what appears to be scalability problems, but now
> it's not so much in runqueue locks for yield_to(), but now
> get_pid_task():
>
> perf on host:
>
> 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
> 11.60% 115686 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
> 10.28% 102522 qemu-system-x86 [kernel.kallsyms] [k] yield_to
> 9.17% 91507 qemu-system-x86 [kvm] [k] kvm_vcpu_on_spin
> 7.74% 77257 qemu-system-x86 [kvm] [k] kvm_vcpu_yield_to
> 3.56% 35476 qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
> 3.00% 29951 qemu-system-x86 [kvm] [k] __vcpu_run
> 2.93% 29268 qemu-system-x86 [kvm_intel] [k] vmx_vcpu_run
> 2.88% 28783 qemu-system-x86 [kvm] [k] vcpu_enter_guest
> 2.59% 25827 qemu-system-x86 [kernel.kallsyms] [k] __schedule
> 1.40% 13976 qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock_irq
> 1.28% 12823 qemu-system-x86 [kernel.kallsyms] [k] resched_task
> 1.14% 11376 qemu-system-x86 [kvm_intel] [k] vmcs_writel
> 0.85% 8502 qemu-system-x86 [kernel.kallsyms] [k] pick_next_task_fair
> 0.53% 5315 qemu-system-x86 [kernel.kallsyms] [k] native_write_msr_safe
> 0.46% 4553 qemu-system-x86 [kernel.kallsyms] [k] native_load_tr_desc
>
> get_pid_task() uses some rcu fucntions, wondering how scalable this
> is.... I tend to think of rcu as -not- having issues like this... is
> there a rcu stat/tracing tool which would help identify potential
> problems?
It's not, it's the atomics + cache line bouncing. We're basically
guaranteed to bounce here.
Here we're finally paying for the ioctl() based interface. A syscall
based interface would have a 1:1 correspondence between vcpus and tasks,
so these games would be unnecessary.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists