linux-kernel - Re: [PATCH] locking/osq: Drop the overload of osq lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <D35549A9-AC84-4BAC-AE17-EC98E3162A46@linux.vnet.ibm.com>
Date:	Sun, 26 Jun 2016 15:08:20 +0800
From:	panxinhui <xinhui@...ux.vnet.ibm.com>
To:	Boqun Feng <boqun.feng@...il.com>
Cc:	panxinhui <xinhui@...ux.vnet.ibm.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Pan Xinhui <xinhui.pan@...ux.vnet.ibm.com>,
	linux-kernel@...r.kernel.org, mingo@...hat.com, dave@...olabs.net,
	will.deacon@....com, Waiman.Long@....com, benh@...nel.crashing.org
Subject: Re: [PATCH] locking/osq: Drop the overload of osq lock


> 在 2016年6月26日，14:59，Boqun Feng <boqun.feng@...il.com> 写道：
> 
> On Sun, Jun 26, 2016 at 02:10:57PM +0800, Boqun Feng wrote:
>> On Sun, Jun 26, 2016 at 01:21:04PM +0800, panxinhui wrote:
>>> 
>>>> 在 2016年6月26日，03:20，Peter Zijlstra <peterz@...radead.org> 写道：
>>>> 
>>>> On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
>>>>>>> Would that not have issues where the owner cpu is kept running but the
>>>>>>> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
>>>>>>> case we too want to stop spinning.
>>>>>>> 
>>>>>> 
>>>>> do  you mean that the spinner detect itself had yield out during the
>>>>> big spin loop?
>>>>> 
>>>>> It is very possible to happen.  BUT if spinner(on this vcpu) yield
>>>>> out, the next spinner would break the spin loop.  AND if spinner
>>>>> detect itself yield out once, it’s very possible to get the osq lock
>>>>> soon as long as the ower vcpu is running.
>>>>> 
>>>>> SO I think we need just check the owner vcpu’s yield_count.
>>>> 
>>>> I had a quick look at KVM and it looks like it only has
>>>> kvm_cpu::preempted, which would suggest the interface boqun proposed.
>>>> 
>>>> We'll have to look at many of the other virt platforms as well to see
>>>> what they can do.
>>>> 
>>>> We could also provide _both_ interfaces and a platform can implement
>>>> whichever variant (or both) it can.
>>>> 
>>> the kvm code on ppc has implemented  yield_count.
>>> It let me feel a little relaxed. :)
>>> 
>>> looks like we could  introduce the interface like below.
>>> 
>>> bool vcpu_is_preempted(int cpu)
>>> {
>>> 	return arch_vcpu_is_preempted(cpu);
>>> }
>>> 
>>> #ifdef arch_vcpu_has_yield_count 
>>> bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
>>> {
>>> 	return arch_get_vcpu_yield_count() != yield_count;
>>> }
>>> 
>>> #else
>>> bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
>>> {
>>> 	/*just let called know it is preepmpted*/
>>> 	return vcpu_is_preempted(cpu);
>>> }
>>> #endif
>>> 
>> 
>> Unfortunately, on PPC, we compile pseries code along with powernv code
>> in a kernel binary, therefore we need to wire the proper primitives at
>> runtime.
>> 
>> I've cooked the following patch, please note I haven't test this, this
>> is just a POC.
>> 
>> Regards,
>> Boqun
>> 
>> ---------------->8
>> virt: Introduce vcpu preemption detection functions
>> 
>> Lock holder preemption is an serious problem in virtualized environment
>> for locking. Different archs and hypervisors employ different ways to
>> try to solve the problem in different locking mechanisms. And recently
>> Pan Xinhui found a significant performance drop in osq_lock(), which
>> could be solved by providing a mechanism to detect the preemption of
>> vcpu.
>> 
>> Therefore this patch introduces several vcpu preemption detection
>> primitives, which allows locking primtives to save the time of
>> unnecesarry spinning. These primitives act as an abstract layer between
>> locking functions and arch or hypervisor related code.
>> 
>> There are two sets of primitives:
>> 
>> 1.	vcpu_preempt_count() and vcpu_has_preempted(), they must be used
>> 	pairwisely in a same preempt disable critical section. And they
>> 	could detect whether a vcpu preemption happens between them.
>> 
>> 2.	vcpu_is_preempted(), used to check whether other cpu's vcpu is
>> 	preempted.
>> 
>> This patch also implements those primitives on pseries and wire them up.
>> 
>> Signed-off-by: Boqun Feng <boqun.feng@...il.com>
>> ---
>> arch/powerpc/platforms/pseries/Kconfig |  1 +
>> arch/powerpc/platforms/pseries/setup.c | 28 +++++++++++
>> include/linux/vcpu_preempt.h           | 90 ++++++++++++++++++++++++++++++++++
>> 3 files changed, 119 insertions(+)
>> create mode 100644 include/linux/vcpu_preempt.h
>> 
>> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
>> index bec90fb30425..7d24c3e48878 100644
>> --- a/arch/powerpc/platforms/pseries/Kconfig
>> +++ b/arch/powerpc/platforms/pseries/Kconfig
>> @@ -21,6 +21,7 @@ config PPC_PSERIES
>> 	select HOTPLUG_CPU if SMP
>> 	select ARCH_RANDOM
>> 	select PPC_DOORBELL
>> +	select HAS_VCPU_PREEMPTION_DETECTION
>> 	default y
>> 
>> config PPC_SPLPAR
>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>> index 9883bc7ea007..5d4aed54e039 100644
>> --- a/arch/powerpc/platforms/pseries/setup.c
>> +++ b/arch/powerpc/platforms/pseries/setup.c
>> @@ -42,6 +42,7 @@
>> #include <linux/of.h>
>> #include <linux/of_pci.h>
>> #include <linux/kexec.h>
>> +#include <linux/vcpu_preempt.h>
>> 
>> #include <asm/mmu.h>
>> #include <asm/processor.h>
>> @@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
>> 	of_pci_check_probe_only();
>> }
>> 
>> +struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
>> +EXPORT_SYMBOL(vcpu_preempt_ops);
>> +
>> +static long pseries_vcpu_preempt_count(void)
>> +{
>> +	return be32_to_cpu(get_lppaca()->yield_count);
>> +}
>> +
>> +static bool pseries_vcpu_is_preempted(int cpu)
>> +{
>> +	return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
>> +}
>> +
>> +static bool pseries_vcpu_has_preempted(long vpc)
>> +{
>> +	return pseries_vcpu_preempt_count() != vpc;
>> +}
>> +
>> +static void __init pseries_setup_vcpu_preempt_ops(void)
>> +{
>> +	vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
>> +	vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
>> +	vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
>> +}
>> +
>> static void __init pSeries_setup_arch(void)
>> {
>> 	set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
>> @@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
>> 				"%ld\n", rc);
>> 		}
>> 	}
>> +
>> +	pseries_setup_vcpu_preempt_ops();
>> }
>> 
>> static int __init pSeries_init_panel(void)
>> diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
>> new file mode 100644
>> index 000000000000..4a88414bb83f
>> --- /dev/null
>> +++ b/include/linux/vcpu_preempt.h
>> @@ -0,0 +1,90 @@
>> +/*
>> + * Primitives for checking the vcpu preemption from the guest.
>> + */
>> +
>> +static long __vcpu_preempt_count(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +static bool __vcpu_has_preempted(long vpc)
>> +{
>> +	return false;
>> +}
>> +
>> +static bool __vcpu_is_preempted(int cpu)
>> +{
>> +	return false;
>> +}
>> +
>> +struct vcpu_preempt_ops {
>> +	/*
>> +	 * Get the current vcpu's "preempt count", which is going to use for
>> +	 * checking whether the current vcpu has ever been preempted
>> +	 */
>> +	long (*preempt_count)(void);
>> +
>> +	/*
>> +	 * Return whether a vcpu is preempted
>> +	 */
>> +	bool (*is_preempted)(int cpu);
>> +
>> +	/*
>> +	 * Given a "vcpu preempt count", Return whether a vcpu preemption ever
>> +	 * happened after the .preempt_count() was called.
>> +	 */
>> +	bool (*has_preempted)(long vpc);
>> +};
>> +
>> +struct vcpu_preempt_ops vcpu_preempt_ops;
>> +
> 
> This should be:
> 
> extern struct vcpu_preempt_ops vcpu_preempt_ops;
> 
> And I tested this one along with modified version of Xinhui's patch.
> 
> The test showed that even in a not over-committed system, the overhead
> of mutex_lock() could go down from 2.58%(w/o patches) to 0.87%(w/
> patches).
> 
> Xinhui, I did also modifiy the title and commit log of your patch, feel
> free to take anything into your patch.
> 

thanks boqun, I will review your patch.

> Regards,
> Boqun
> 
> ----------------->8
> locking/osq: Stop spinning in osq_lock if related vcpus get preempted
> 
> In a virtualized environment, a cpu may hold the osq_lock while its vcpu
> gets preempted, which makes the spinning of other cpus in the osq
> pointless, because it may be quite a while for the holder to get back to
> run and notify its successor to go.
> 
> To fix this, we use the vcpu preemption detection mechanism to check
> whether the holder is preempted out, if so we just stop spinning in the
> osq_lock(). Morever, if the spinner has ever been preempted, it also
> indicates that the spining in osq_lock() is pointless, so we detect this
> and stop spinning too.
> 
> Signed-off-by: Pan Xinhui <xinhui.pan@...ux.vnet.ibm.com>
> ---
> kernel/locking/osq_lock.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 05a37857ab55..bed5a56a6aed 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -1,6 +1,7 @@
> #include <linux/percpu.h>
> #include <linux/sched.h>
> #include <linux/osq_lock.h>
> +#include <linux/vcpu_preempt.h>
> 
> /*
>  * An MCS like lock especially tailored for optimistic spinning for sleeping
> @@ -87,6 +88,8 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> 	struct optimistic_spin_node *prev, *next;
> 	int curr = encode_cpu(smp_processor_id());
> 	int old;
> +	int loops;
> +	long vpc;
> 
> 	node->locked = 0;
> 	node->next = NULL;
> @@ -106,6 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> 	node->prev = prev;
> 	WRITE_ONCE(prev->next, node);
> 
> +	old = old - 1;
> +	vpc = vcpu_preempt_count();
> +
> 	/*
> 	 * Normally @prev is untouchable after the above store; because at that
> 	 * moment unlock can proceed and wipe the node element from stack.
> @@ -118,8 +124,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> 	while (!READ_ONCE(node->locked)) {
> 		/*
> 		 * If we need to reschedule bail... so we can block.
> +		 * An over-committed guest with more vCPUs than pCPUs
> +		 * might fall in this loop and cause a huge overload.
> +		 * This is because vCPU A(prev) hold the osq lock and yield out,
> +		 * vCPU B(node) wait ->locked to be set, IOW, wait till
> +		 * vCPU A run and unlock the osq lock.
> +		 * NOTE that vCPU A and vCPU B might run on same physical cpu.
> 		 */
> -		if (need_resched())
> +		if (need_resched() || vcpu_is_preempted(old) || vcpu_has_preempted(vpc))
> 			goto unqueue;
> 

the prev might change, so we need  read node->prev every loop, then check vcpu preempted.

> 		cpu_relax_lowlatency();