lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Thu, 18 Jun 2015 18:14:54 -0400
From:	Waiman Long <waiman.long@...com>
To:	Will Deacon <will.deacon@....com>
CC:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, Arnd Bergmann <arnd@...db.de>,
	"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Scott J Norton <scott.norton@...com>,
	Douglas Hatch <doug.hatch@...com>
Subject: Re: [PATCH v3 2/2] locking/qrwlock: Don't contend with readers when
 setting _QW_WAITING

On 06/18/2015 08:40 AM, Will Deacon wrote:
> On Thu, Jun 18, 2015 at 02:33:56AM +0100, Waiman Long wrote:
>> On 06/16/2015 02:02 PM, Will Deacon wrote:
>>> On Mon, Jun 15, 2015 at 11:24:03PM +0100, Waiman Long wrote:
>>>> The current cmpxchg() loop in setting the _QW_WAITING flag for writers
>>>> in queue_write_lock_slowpath() will contend with incoming readers
>>>> causing possibly extra cmpxchg() operations that are wasteful. This
>>>> patch changes the code to do a byte cmpxchg() to eliminate contention
>>>> with new readers.
>>>>
>>>> A multithreaded microbenchmark running 5M read_lock/write_lock loop
>>>> on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
>>>> with the qspinlock patch have the following execution times (in ms)
>>>> with and without the patch:
>>>>
>>>> With R:W ratio = 5:1
>>>>
>>>> 	Threads	   w/o patch	with patch	% change
>>>> 	-------	   ---------	----------	--------
>>>> 	   2	     990 	    895		  -9.6%
>>>> 	   3	    2136 	   1912		 -10.5%
>>>> 	   4	    3166	   2830		 -10.6%
>>>> 	   5	    3953	   3629		  -8.2%
>>>> 	   6	    4628	   4405		  -4.8%
>>>> 	   7	    5344	   5197		  -2.8%
>>>> 	   8	    6065	   6004		  -1.0%
>>>> 	   9	    6826	   6811		  -0.2%
>>>> 	  10	    7599	   7599		   0.0%
>>>> 	  15	    9757	   9766		  +0.1%
>>>> 	  20	   13767	  13817		  +0.4%
>>>>
>>>> With small number of contending threads, this patch can improve
>>>> locking performance by up to 10%. With more contending threads,
>>>> however, the gain diminishes.
>>>>
>>>> Signed-off-by: Waiman Long<Waiman.Long@...com>
>>>> ---
>>>>    kernel/locking/qrwlock.c |   28 ++++++++++++++++++++++++----
>>>>    1 files changed, 24 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
>>>> index d7d7557..559198a 100644
>>>> --- a/kernel/locking/qrwlock.c
>>>> +++ b/kernel/locking/qrwlock.c
>>>> @@ -22,6 +22,26 @@
>>>>    #include<linux/hardirq.h>
>>>>    #include<asm/qrwlock.h>
>>>>
>>>> +/*
>>>> + * This internal data structure is used for optimizing access to some of
>>>> + * the subfields within the atomic_t cnts.
>>>> + */
>>>> +struct __qrwlock {
>>>> +	union {
>>>> +		atomic_t cnts;
>>>> +		struct {
>>>> +#ifdef __LITTLE_ENDIAN
>>>> +			u8 wmode;	/* Writer mode   */
>>>> +			u8 rcnts[3];	/* Reader counts */
>>>> +#else
>>>> +			u8 rcnts[3];	/* Reader counts */
>>>> +			u8 wmode;	/* Writer mode   */
>>>> +#endif
>>>> +		};
>>>> +	};
>>>> +	arch_spinlock_t	lock;
>>>> +};
>>>> +
>>>>    /**
>>>>     * rspin_until_writer_unlock - inc reader count&   spin until writer is gone
>>>>     * @lock  : Pointer to queue rwlock structure
>>>> @@ -109,10 +129,10 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
>>>>    	 * or wait for a previous writer to go away.
>>>>    	 */
>>>>    	for (;;) {
>>>> -		cnts = atomic_read(&lock->cnts);
>>>> -		if (!(cnts&   _QW_WMASK)&&
>>>> -		    (atomic_cmpxchg(&lock->cnts, cnts,
>>>> -				    cnts | _QW_WAITING) == cnts))
>>>> +		struct __qrwlock *l = (struct __qrwlock *)lock;
>>>> +
>>>> +		if (!READ_ONCE(l->wmode)&&
>>>> +		   (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
>>>>    			break;
>>> Maybe you could also update the x86 implementation of queue_write_unlock
>>> to write the wmode field instead of casting to u8 *?
>>>
>> The queue_write_unlock() function is in the header file. I don't want to
>> expose the internal structure to other files.
> Then I don't see the value in the new data structure -- why not just cast
> to u8 * instead? In my mind, the structure has the advantage of supporting
> both big and little endian systems, but to be useful it would need to be
> available in the header file for architectures that chose to override
> queue_write_unlock.

Casting to (u8 *) directly will require ugly endian conditional 
compilation code in the function. It is much easier to look at and 
understand to do that in the data structure instead.

> As an aside, I have some patches to get this up and running on arm64
> which would need something like this structure for the big-endian case.

If there is going to be other consumer of the internal structure, I 
think it will be worthwhile to put that into the header file directly. I 
will update the patch to make that changes.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ