linux-kernel - Re: [PATCH v2] locking/osq: Use optimized spinning loop for arm64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <108e58d2-56f8-5ee2-23a8-f1260e428195@huawei.com>
Date:   Mon, 13 Jan 2020 16:32:14 +0800
From:   yezengruan <yezengruan@...wei.com>
To:     Waiman Long <longman@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Will Deacon <will.deacon@....com>,
        "Catalin Marinas" <catalin.marinas@....com>
CC:     <linux-kernel@...r.kernel.org>,
        <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [PATCH v2] locking/osq: Use optimized spinning loop for arm64

Hi Waiman,

On 2020/1/13 7:58, Waiman Long wrote:
> Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
> for spinlock that can boost performance of sibling threads by putting
> the current cpu to a shallow sleep state that is woken up only when
> the monitored variable changes or an external event happens.
> 
> OSQ has a more complicated spinning loop. Besides the lock value, it
> also checks for need_resched() and vcpu_is_preempted(). The check for
> need_resched() is not a problem as it is only set by the tick interrupt
> handler. That will be detected by the spinning cpu right after iret.
> 
> The vcpu_is_preempted() check, however, is a problem as changes to the
> preempt state of of previous node will not affect the sleep state. For
> ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard
> against future addition of vcpu_is_preempted() to arm64, code is added
> to cause build error when vcpu_is_preempted becomes defined in arm64
> without the corresponding changes in the OSQ spinning code.

Recently, I am supporting vcpu_is_preempted() for arm64. There is a patch set which do this[1].

[1] https://lore.kernel.org/linux-arm-kernel/20191226135833.1052-1-yezengruan@huawei.com/

> 
> On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
> microbenchmark was run for 10s with and without the patch. The
> performance numbers before patch were:
> 
> Running locktest with mutex [runtime = 10s, load = 1]
> Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
> Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
> 
> After patch, the numbers were:
> 
> Running locktest with mutex [runtime = 10s, load = 1]
> Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
> Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
> 
> So there was about 20% performance improvement.
> 
> Signed-off-by: Waiman Long <longman@...hat.com>
> ---
>  arch/arm64/include/asm/barrier.h | 10 ++++++++++
>  kernel/locking/osq_lock.c        | 25 ++++++++++++-------------
>  2 files changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 7d9cc5ec4971..8eb5f1239885 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -152,6 +152,16 @@ do {									\
>  	VAL;								\
>  })
>  
> +/*
> + * In osq_lock(), smp_cond_load_relaxed() is called with a condition
> + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not
> + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in
> + * the future, smp_cond_load_relaxed() will not response to changes in the
> + * preempt state in a timely manner. So code changes will have to be made
> + * to address this deficiency.
> + */
> +#define vcpu_is_preempted_not_used
> +
>  #define smp_cond_load_acquire(ptr, cond_expr)				\
>  ({									\
>  	typeof(ptr) __PTR = (ptr);					\
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 6ef600aa0f47..69ec5161c3cc 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,6 +13,14 @@
>   */
>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>  
> +/*
> + * The optimized smp_cond_load_relaxed() spin loop should not be used with
> + * vcpu_is_preempted defined.
> + */
> +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used)
> +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!"
> +#endif
> +
>  /*
>   * We use the value 0 to represent "no CPU", thus the encoded value
>   * will be the CPU number incremented by 1.
> @@ -134,20 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>  	 * cmpxchg in an attempt to undo our queueing.
>  	 */
>  
> -	while (!READ_ONCE(node->locked)) {
> -		/*
> -		 * If we need to reschedule bail... so we can block.
> -		 * Use vcpu_is_preempted() to avoid waiting for a preempted
> -		 * lock holder:
> -		 */
> -		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
> -			goto unqueue;
> -
> -		cpu_relax();
> -	}
> -	return true;
> +	if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> +				  vcpu_is_preempted(node_cpu(node->prev))))
> +		return true;
>  
> -unqueue:
> +	/* unqueue */
>  	/*
>  	 * Step - A  -- stabilize @prev
>  	 *
> 

Thanks,

Zengruan