linux-kernel - Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4FD9D72E.1010606@linux.vnet.ibm.com>
Date:	Thu, 14 Jun 2012 17:51:02 +0530
From:	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
To:	Avi Kivity <avi@...hat.com>
CC:	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Jeremy Fitzhardinge <jeremy@...p.org>,
	Greg Kroah-Hartman <gregkh@...e.de>,
	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Marcelo Tosatti <mtosatti@...hat.com>, X86 <x86@...nel.org>,
	Gleb Natapov <gleb@...hat.com>, Ingo Molnar <mingo@...hat.com>,
	Attilio Rao <attilio.rao@...rix.com>,
	Virtualization <virtualization@...ts.linux-foundation.org>,
	Xen Devel <xen-devel@...ts.xensource.com>,
	linux-doc@...r.kernel.org, KVM <kvm@...r.kernel.org>,
	Andi Kleen <andi@...stfloor.org>,
	Stefano Stabellini <stefano.stabellini@...citrix.com>,
	Stephan Diestelhorst <stephan.diestelhorst@....com>,
	LKML <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Thomas Gleixner <tglx@...utronix.de>,
	"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>
Subject: Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

On 05/30/2012 04:56 PM, Raghavendra K T wrote:
> On 05/16/2012 08:49 AM, Raghavendra K T wrote:
>> On 05/14/2012 12:15 AM, Raghavendra K T wrote:
>>> On 05/07/2012 08:22 PM, Avi Kivity wrote:
>>>
>>> I could not come with pv-flush results (also Nikunj had clarified that
>>> the result was on NOn PLE
>>>
>>>> I'd like to see those numbers, then.
>>>>
>>>> Ingo, please hold on the kvm-specific patches, meanwhile.
> [...]
>> To summarise,
>> with 32 vcpu guest with nr thread=32 we get around 27% improvement. In
>> very low/undercommitted systems we may see very small improvement or
>> small acceptable degradation ( which it deserves).
>>
>
> For large guests, current value SPIN_THRESHOLD, along with ple_window
> needed some of research/experiment.
>
> [Thanks to Jeremy/Nikunj for inputs and help in result analysis ]
>
> I started with debugfs spinlock/histograms, and ran experiments with 32,
> 64 vcpu guests for spin threshold of 2k, 4k, 8k, 16k, and 32k with
> 1vm/2vm/4vm for kernbench, sysbench, ebizzy, hackbench.
> [ spinlock/histogram gives logarithmic view of lockwait times ]
>
> machine: PLE machine with 32 cores.
>
> Here is the result summary.
> The summary includes 2 part,
> (1) %improvement w.r.t 2K spin threshold,
> (2) improvement w.r.t sum of histogram numbers in debugfs (that gives
> rough indication of contention/cpu time wasted)
>
> For e.g 98% for 4k threshold kbench 1 vm would imply, there is a 98%
> reduction in sigma(histogram values) compared to 2k case
>
> Result for 32 vcpu guest
> ==========================
> +----------------+-----------+-----------+-----------+-----------+
> | Base-2k | 4k | 8k | 16k | 32k |
> +----------------+-----------+-----------+-----------+-----------+
> | kbench-1vm | 44 | 50 | 46 | 41 |
> | SPINHisto-1vm | 98 | 99 | 99 | 99 |
> | kbench-2vm | 25 | 45 | 49 | 45 |
> | SPINHisto-2vm | 31 | 91 | 99 | 99 |
> | kbench-4vm | -13 | -27 | -2 | -4 |
> | SPINHisto-4vm | 29 | 66 | 95 | 99 |
> +----------------+-----------+-----------+-----------+-----------+
> | ebizzy-1vm | 954 | 942 | 913 | 915 |
> | SPINHisto-1vm | 96 | 99 | 99 | 99 |
> | ebizzy-2vm | 158 | 135 | 123 | 106 |
> | SPINHisto-2vm | 90 | 98 | 99 | 99 |
> | ebizzy-4vm | -13 | -28 | -33 | -37 |
> | SPINHisto-4vm | 83 | 98 | 99 | 99 |
> +----------------+-----------+-----------+-----------+-----------+
> | hbench-1vm | 48 | 56 | 52 | 64 |
> | SPINHisto-1vm | 92 | 95 | 99 | 99 |
> | hbench-2vm | 32 | 40 | 39 | 21 |
> | SPINHisto-2vm | 74 | 96 | 99 | 99 |
> | hbench-4vm | 27 | 15 | 3 | -57 |
> | SPINHisto-4vm | 68 | 88 | 94 | 97 |
> +----------------+-----------+-----------+-----------+-----------+
> | sysbnch-1vm | 0 | 0 | 1 | 0 |
> | SPINHisto-1vm | 76 | 98 | 99 | 99 |
> | sysbnch-2vm | -1 | 3 | -1 | -4 |
> | SPINHisto-2vm | 82 | 94 | 96 | 99 |
> | sysbnch-4vm | 0 | -2 | -8 | -14 |
> | SPINHisto-4vm | 57 | 79 | 88 | 95 |
> +----------------+-----------+-----------+-----------+-----------+
>
> result for 64 vcpu guest
> =========================
> +----------------+-----------+-----------+-----------+-----------+
> | Base-2k | 4k | 8k | 16k | 32k |
> +----------------+-----------+-----------+-----------+-----------+
> | kbench-1vm | 1 | -11 | -25 | 31 |
> | SPINHisto-1vm | 3 | 10 | 47 | 99 |
> | kbench-2vm | 15 | -9 | -66 | -15 |
> | SPINHisto-2vm | 2 | 11 | 19 | 90 |
> +----------------+-----------+-----------+-----------+-----------+
> | ebizzy-1vm | 784 | 1097 | 978 | 930 |
> | SPINHisto-1vm | 74 | 97 | 98 | 99 |
> | ebizzy-2vm | 43 | 48 | 56 | 32 |
> | SPINHisto-2vm | 58 | 93 | 97 | 98 |
> +----------------+-----------+-----------+-----------+-----------+
> | hbench-1vm | 8 | 55 | 56 | 62 |
> | SPINHisto-1vm | 18 | 69 | 96 | 99 |
> | hbench-2vm | 13 | -14 | -75 | -29 |
> | SPINHisto-2vm | 57 | 74 | 80 | 97 |
> +----------------+-----------+-----------+-----------+-----------+
> | sysbnch-1vm | 9 | 11 | 15 | 10 |
> | SPINHisto-1vm | 80 | 93 | 98 | 99 |
> | sysbnch-2vm | 3 | 3 | 4 | 2 |
> | SPINHisto-2vm | 72 | 89 | 94 | 97 |
> +----------------+-----------+-----------+-----------+-----------+
>
>  From this, value around 4k-8k threshold seem to be optimal one. [ This
> is amost inline with ple_window default ]
> (lower the spin threshold, we would cover lesser % of spinlocks, that
> would result in more halt_exit/wakeups.
>
> [ www.xen.org/files/xensummitboston08/LHP.pdf also has good graphical
> detail on covering spinlock waits ]
>
> After 8k threshold, we see no more contention but that would mean we
> have wasted lot of cpu time in busy waits.
>
> Will get a PLE machine again, and 'll continue experimenting with
> further tuning of SPIN_THRESHOLD.

Sorry for delayed response. Was doing too much of analysis and
experiments.

Continued my experiment, with spin threshold. unfortunately could
not settle between which one of 4k/8k threshold is better, since it
depends on load and type of workload.

Here is the result for 32 vcpu guest for sysbench and kernebench for 4 
8GB RAM vms on same PLE machine with:

1x: benchmark running on 1 guest
2x: same benchmark running on 2 guest and so on

1x run is taken over 8*3 run averages
2x run was taken with 4*3 runs
3x run was with 6*3
4x run was with 4*3


kernbench
=========
total job=2* number of vcpus
kernbench -f -H -M -o $total_job


+------------+------------+-----------+---------------+---------+
| base       |  pv_4k     | %impr     |   pv_8k       | %impr   |
+------------+------------+-----------+---------------+---------+
| 49.98      |  49.147475 | 1.69393   |   50.575567   | -1.17758|
| 106.0051   |  96.668325 | 9.65857   |   91.62165    | 15.6987 |
| 189.82067  |  181.839   | 4.38942   |   188.8595    | 0.508934|
+------------+------------+-----------+---------------+---------+

sysbench
===========
Ran with  num_thread=2* number of vcpus

sysbench --num-threads=$num_thread --max-requests=100000 --test=oltp 
--oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run

32 vcpu
-------

+------------+------------+-----------+---------------+---------+
| base       |  pv_4k     | %impr     |   pv_8k       | %impr   |
+------------+------------+-----------+---------------+---------+
| 16.4109    |  12.109988 | 35.5154   |   12.658113   | 29.6473 |
| 14.232712  |  13.640387 | 4.34244   |   14.16485    | 0.479087|
| 23.49685   |  23.196375 | 1.29535   |   19.024871   | 23.506  |
+------------+------------+-----------+---------------+---------+

and observations are:

1) 8k threshold does better for medium overcommit. But PLE has more
control rather than pv spinlock.

2) 4k does well for no overcommit and high overcommit cases. and also,
for non PLE machine this helps rather than 8k. in medium overcommit
cases, we see less performance benefits due to increase in halt exits

I 'll continue my analysis.
Also I have come-up with directed yield patch where we do directed
yield in vcpu block path, instead of blind schedule. will do some more
experiment with that and post as an RFC.

Let me know if you have any comments/suggestions.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/