linux-kernel - Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 7 Nov 2017 11:39:02 -0500
From:   Waiman Long <longman@...hat.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
        Paolo Bonzini <pbonzini@...hat.com>,
        Juergen Gross <jgross@...e.com>,
        Radim Krčmář <rkrcmar@...hat.com>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>
Subject: Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

On 11/07/2017 07:58 AM, Peter Zijlstra wrote:
> On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:
>> Currently, all the lock waiters entering the slowpath will do one
>> lock stealing attempt to acquire the lock. That helps performance,
>> especially in VMs with over-committed vCPUs. However, the current
>> pvqspinlocks still don't perform as good as unfair locks in many cases.
>> On the other hands, unfair locks do have the problem of lock starvation
>> that pvqspinlocks don't have.
>>
>> This patch combines the best attributes of an unfair lock and a
>> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
>> mode. A lock waiter goes into the unfair mode when there are waiters
>> in the wait queue but the pending bit isn't set. Otherwise, it will
>> go into the queued mode waiting in the queue for its turn.
>>
>> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
>> (make -j<n>) was done in a VM with unpinned vCPUs 3 times with the
>> best time selected and <n> is the number of vCPUs available. The build
>> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
>> with various number of vCPUs are as follows:
>>
>>   vCPUs    pvqlock     hybrid pvqlock    unfair lock
>>   -----    -------     --------------    -----------
>>     30      342.1s         329.1s          329.1s
>>     36      314.1s         305.3s          307.3s
>>     45      345.0s         302.1s          306.6s
>>     54      365.4s         308.6s          307.8s
>>     72      358.9s         293.6s          303.9s
>>    108      343.0s         285.9s          304.2s
>>
>> The hybrid pvqspinlock performs better or comparable to the unfair
>> lock.
>>
>> By turning on QUEUED_LOCK_STAT, the table below showed the number
>> of lock acquisitions in unfair mode and queue mode after a kernel
>> build with various number of vCPUs.
>>
>>   vCPUs    queued mode  unfair mode
>>   -----    -----------  -----------
>>     30      9,130,518      294,954
>>     36     10,856,614      386,809
>>     45      8,467,264   11,475,373
>>     54      6,409,987   19,670,855
>>     72      4,782,063   25,712,180
>>
>> It can be seen that as the VM became more and more over-committed,
>> the ratio of locks acquired in unfair mode increases. This is all
>> done automatically to get the best overall performance as possible.
>>
>> The table below shows the kernel build times on a smaller 2-socket
>> 16-core 32-thread E5-2620 v4 system.
>>
>>   vCPUs    pvqlock     hybrid pvqlock    unfair lock
>>   -----    -------     --------------    -----------
>>     16      436.8s         433.4s          435.6s
>>     36      366.2s         364.8s          364.5s
>>     48      423.6s         376.3s          370.2s
>>     64      433.1s         376.6s          376.8s
>>
>> Again, the performance of the hybrid pvqspinlock was comparable to
>> that of the unfair lock.
> You forgot to test a starvation case. And you also forgot to describe
> how they cannot happen. I really cannot remember how all this is
> supposed to work.

Lock starvation shouldn't happen. The pending bit is set by the queue
head to indicate its readiness before spinning on the lock. Once the
pending bit is made visible to all the CPUs, no one can steal the lock
and they will all queued up in the wait queue.

I ran my locking microbenchmark on a 2-socket 36-core system (HT off)
with # of locking threads equals to the number of vCPUs in a kvm VM. The
table below show the the min/mean/max per-thread locking ops done in 5
seconds:

#vCPUs    min/avg/max lockops
36            822,135/881,063/950,363
54            542,435/581,664/625,937
72            397,500/428,177/499,299

It is obvious that there was no lock starvation here.

Cheers,
Longman