linux-kernel - Re: [PATCH v2] locking/osq: Use optimized spinning loop for arm64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <a4233fec-89e7-08e2-01a6-35a2ffb289b1@redhat.com>
Date:   Mon, 13 Jan 2020 08:43:58 -0500
From:   Waiman Long <longman@...hat.com>
To:     Will Deacon <will@...nel.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Will Deacon <will.deacon@....com>,
        Catalin Marinas <catalin.marinas@....com>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        maz@...nel.org
Subject: Re: [PATCH v2] locking/osq: Use optimized spinning loop for arm64

On 1/13/20 6:57 AM, Will Deacon wrote:
> [+Marc]
>
> On Sun, Jan 12, 2020 at 06:58:54PM -0500, Waiman Long wrote:
>> Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
>> for spinlock that can boost performance of sibling threads by putting
>> the current cpu to a shallow sleep state that is woken up only when
>> the monitored variable changes or an external event happens.
>>
>> OSQ has a more complicated spinning loop. Besides the lock value, it
>> also checks for need_resched() and vcpu_is_preempted(). The check for
>> need_resched() is not a problem as it is only set by the tick interrupt
>> handler. That will be detected by the spinning cpu right after iret.
>>
>> The vcpu_is_preempted() check, however, is a problem as changes to the
>> preempt state of of previous node will not affect the sleep state. For
>> ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard
>> against future addition of vcpu_is_preempted() to arm64, code is added
>> to cause build error when vcpu_is_preempted becomes defined in arm64
>> without the corresponding changes in the OSQ spinning code.
>>
>> On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
>> microbenchmark was run for 10s with and without the patch. The
>> performance numbers before patch were:
>>
>> Running locktest with mutex [runtime = 10s, load = 1]
>> Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
>> Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
>>
>> After patch, the numbers were:
>>
>> Running locktest with mutex [runtime = 10s, load = 1]
>> Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
>> Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
>>
>> So there was about 20% performance improvement.
>>
>> Signed-off-by: Waiman Long <longman@...hat.com>
>> ---
>>  arch/arm64/include/asm/barrier.h | 10 ++++++++++
>>  kernel/locking/osq_lock.c        | 25 ++++++++++++-------------
>>  2 files changed, 22 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
>> index 7d9cc5ec4971..8eb5f1239885 100644
>> --- a/arch/arm64/include/asm/barrier.h
>> +++ b/arch/arm64/include/asm/barrier.h
>> @@ -152,6 +152,16 @@ do {									\
>>  	VAL;								\
>>  })
>>  
>> +/*
>> + * In osq_lock(), smp_cond_load_relaxed() is called with a condition
>> + * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not
>> + * currently defined. So it is a no-op. If vcpu_is_preempted is defined in
>> + * the future, smp_cond_load_relaxed() will not response to changes in the
>> + * preempt state in a timely manner. So code changes will have to be made
>> + * to address this deficiency.
>> + */
>> +#define vcpu_is_preempted_not_used
>> +
>>  #define smp_cond_load_acquire(ptr, cond_expr)				\
>>  ({									\
>>  	typeof(ptr) __PTR = (ptr);					\
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 6ef600aa0f47..69ec5161c3cc 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -13,6 +13,14 @@
>>   */
>>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>>  
>> +/*
>> + * The optimized smp_cond_load_relaxed() spin loop should not be used with
>> + * vcpu_is_preempted defined.
>> + */
>> +#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used)
>> +#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!"
>> +#endif
> Although I appreciate you going the extra mile for arm64 (thanks!), I think
> this is probably a bit overkill given that I don't plan to merge the series
> from Zengruan any time soon. Instead, how about just defining
> vcpu_is_preempted in arch/arm64/include/asm/spinlock.h with a comment:
>
>
> 	/*
> 	 * Changing this will break osq_lock() thanks to the call inside
> 	 * smp_cond_load_relaxed().
> 	 *
> 	 * See:
> 	 * https://lore.kernel.org/lkml/20200110100612.GC2827@hirez.programming.kicks-ass.net
> 	 */
> 	#define vcpu_is_preempted(cpu)	false
>
>
> So we'll notice that when somebody tries to change it.

Yes, that works for me. I just want to make sure that if any changes to
add vcpu_is_preempted to arm64 in the future will get caught.

Cheers,
Longman