[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120322153205.GA28570@linux.vnet.ibm.com>
Date: Thu, 22 Mar 2012 21:02:05 +0530
From: Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
To: Ingo Molnar <mingo@...e.hu>
Cc: Peter Zijlstra <peterz@...radead.org>,
Mike Galbraith <efault@....de>,
Suresh Siddha <suresh.b.siddha@...el.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
Paul Turner <pjt@...gle.com>
Subject: Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
* Ingo Molnar <mingo@...e.hu> [2012-03-06 10:14:11]:
> > I did some experiments with volanomark and it does turn out to
> > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy
> > benchmark that I am dealing with (Trade) benefits from it.
>
> Does volanomark still do yield(), thereby invoking a random
> shuffle of thread scheduling and pretty much voluntarily
> ejecting itself from most scheduler performance considerations?
>
> If it uses a real locking primitive such as futexes then its
> performance matters more.
Some more interesting results on more recent tip kernel.
Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
Kernel : tip (HEAD at ee415e2)
guest VM : 2.6.18 linux kernel based enterprise guest
Benchmarks are run in two scenarios:
1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
various cgroups) are run on host. Cgroup setup is as below:
/sys (cpu.shares = 1024, hosts all system tasks)
/libvirt (cpu.shares = 20000)
/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
/libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs)
/libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs)
/libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs)
/libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs)
First BM (bare metal) scenario:
tip tip + patch
volano 1 0.955 (4.5% degradation)
sysbench [n1] 1 0.9984 (0.16% degradation)
tbench 1 [n2] 1 0.9096 (9% degradation)
Now the more interesting VM scenario:
tip tip + patch
volano 1 1.29 (29% improvement)
sysbench [n3] 1 2 (100% improvement)
tbench 1 [n4] 1 1.07 (7% improvement)
tbench 8 [n5] 1 1.26 (26% improvement)
httperf [n6] 1 1.05 (5% improvement)
Trade 1 1.31 (31% improvement)
Notes:
n1. sysbench was run with 16 threads.
n2. tbench was run on localhost with 1 client
n3. sysbench was run with 8 threads
n4. tbench was run on localhost with 1 client
n5. tbench was run over network with 8 clients
n6. httperf was run as with burst-length of 100 and wsess of 100,500,0
So the patch seems to be a wholesome win when VCPU threads are waking
up (in a highly contended environment). One reason could be that any assumption
of better cache hits by running (vcpu) threads on its prev_cpu may not
be fully correct as vcpu threads could represent many different threads
internally?
Anyway, there are degradations as well, considering which I see several
possibilities:
1. Do balance-on-wake for vcpu threads only.
2. Document tuning possibility to improve performance in virtualized
environment:
- Either via sched_domain flags (disable SD_WAKE_AFFINE
at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)
- Or via a new sched_feat(BALANCE_WAKE) tunable
Any other thoughts or suggestions for more experiments?
--
Balance threads on wakeup to let it run on least-loaded CPU in the same
cache domain as its prev_cpu (or cur_cpu if wake_affine() test obliges).
Signed-off-by: Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
---
include/linux/topology.h | 4 ++--
kernel/sched/fair.c | 5 ++++-
2 files changed, 6 insertions(+), 3 deletions(-)
Index: current/include/linux/topology.h
===================================================================
--- current.orig/include/linux/topology.h
+++ current/include/linux/topology.h
@@ -96,7 +96,7 @@ int arch_update_cpu_topology(void);
| 1*SD_BALANCE_NEWIDLE \
| 1*SD_BALANCE_EXEC \
| 1*SD_BALANCE_FORK \
- | 0*SD_BALANCE_WAKE \
+ | 1*SD_BALANCE_WAKE \
| 1*SD_WAKE_AFFINE \
| 1*SD_SHARE_CPUPOWER \
| 0*SD_POWERSAVINGS_BALANCE \
@@ -129,7 +129,7 @@ int arch_update_cpu_topology(void);
| 1*SD_BALANCE_NEWIDLE \
| 1*SD_BALANCE_EXEC \
| 1*SD_BALANCE_FORK \
- | 0*SD_BALANCE_WAKE \
+ | 1*SD_BALANCE_WAKE \
| 1*SD_WAKE_AFFINE \
| 0*SD_PREFER_LOCAL \
| 0*SD_SHARE_CPUPOWER \
Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2766,7 +2766,10 @@ select_task_rq_fair(struct task_struct *
prev_cpu = cpu;
new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ if (idle_cpu(new_cpu))
+ goto unlock;
+ sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+ cpu = prev_cpu;
}
while (sd) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists