linux-kernel - Re: [PATCH] locking/rwsem: Add reader-owned state to the owner field

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <572A30EB.7050804@hpe.com>
Date:	Wed, 4 May 2016 13:27:07 -0400
From:	Waiman Long <waiman.long@....com>
To:	Davidlohr Bueso <dave@...olabs.net>
CC:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, <linux-kernel@...r.kernel.org>,
	Jason Low <jason.low2@...com>,
	Scott J Norton <scott.norton@....com>,
	Douglas Hatch <doug.hatch@....com>
Subject: Re: [PATCH] locking/rwsem: Add reader-owned state to the owner field

On 05/03/2016 08:21 PM, Davidlohr Bueso wrote:
> On Wed, 27 Apr 2016, Waiman Long wrote:
>
>> This patch adds a new state RWSEM_READER_OWNED to the owner field
>> to indicate that readers currently own the lock. This enables us to
>> address the following 2 issues in the rwsem optimistic spinning code:
>>
>> 1) rwsem_can_spin_on_owner() will disallow optimistic spinning if
>>    the owner field is NULL which can mean either the readers own
>>    the lock or the owning writer hasn't set the owner field yet.
>>    In the latter case, we miss the chance to do optimistic spinning.
>>
>> 2) While a writer is spinning and a reader takes the lock, the writer
>>    will continue to spin in the main rwsem_optimistic_spin() loop as
>>    the owner is NULL.
>>
>> Adding the new state will allow optimistic spinning to go forward as
>> long as the owner field is not RWSEM_READER_OWNED and the owner is
>> running, if set, but stop immediately when that state has been reached.
>>
>> On a 4-socket Haswell machine running on a 4.6-rc1 based kernel, the
>> fio test with multithreaded randrw and randwrite tests on the same
>> file on a XFS partition on top of a NVDIMM were run, the aggregated
>> bandwidths before and after the patch were as follows:
>>
>>  Test      BW before patch     BW after patch  % change
>>  ----      ---------------     --------------  --------
>>  randrw         988 MB/s          1192 MB/s      +21%
>>  randwrite     1513 MB/s          1623 MB/s      +7.3%
>>
>> The perf profile of the rwsem_down_write_failed() function in randrw
>> before and after the patch were:
>>
>>   19.95%  5.88%  fio  [kernel.vmlinux]  [k] rwsem_down_write_failed
>>   14.20%  1.52%  fio  [kernel.vmlinux]  [k] rwsem_down_write_failed
>>
>> The actual CPU cycles spend in rwsem_down_write_failed() dropped from
>> 5.88% to 1.52% after the patch.
>
> This all makes sense and addresses this gap between the counter and the
> owner fields. I've also been throwing mmap sem intensive workloads with
> good results on page reclaim and some compression benchmarks.  It would
> also probably good to try xfstests as it directly influences dchinner
> areas.
>

Yes, I will try to run the xfstests. However, I do expect good results 
as I was exercising the xfs rwsem code in my test.

>>
>> Signed-off-by: Waiman Long <Waiman.Long@....com>
>> ---
>> kernel/locking/rwsem-xadd.c |   34 +++++++++++++---------------------
>> kernel/locking/rwsem.c      |    8 ++++++--
>> kernel/locking/rwsem.h      |   31 +++++++++++++++++++++++++++++++
>> 3 files changed, 50 insertions(+), 23 deletions(-)
>>
>> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
>> index df4dcb8..09e5cf4 100644
>> --- a/kernel/locking/rwsem-xadd.c
>> +++ b/kernel/locking/rwsem-xadd.c
>> @@ -155,6 +155,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum 
>> rwsem_wake_type wake_type)
>>             /* Last active locker left. Retry waking readers. */
>>             goto try_reader_grant;
>>         }
>> +        rwsem_reader_owned(sem);
>
> This is here because we've just accounted for RWSEM_ACTIVE_READ_BIAS, no?
>

That call is not absolutely necessary, but it gives an early warning 
that the lock is now owned by readers. Otherwise, it may take some time 
before the first reader waken up and set the owner field.

>>     }
>>
>>     /* Grant an infinite number of read locks to the readers at the 
>> front
>> @@ -306,16 +307,11 @@ static inline bool 
>> rwsem_can_spin_on_owner(struct rw_semaphore *sem)
>>
>>     rcu_read_lock();
>>     owner = READ_ONCE(sem->owner);
>> -    if (!owner) {
>> -        long count = READ_ONCE(sem->count);
>> +    if (!rwsem_is_writer_owned(owner)) {
>>         /*
>> -         * If sem->owner is not set, yet we have just recently 
>> entered the
>> -         * slowpath with the lock being active, then there is a 
>> possibility
>> -         * reader(s) may have the lock. To be safe, bail spinning in 
>> these
>> -         * situations.
>> +         * Don't spin if the rwsem is readers owned.
>>          */
>> -        if (count & RWSEM_ACTIVE_MASK)
>> -            ret = false;
>> +        ret = (owner != RWSEM_READER_OWNED);
>>         goto done;
>>     }
>>
>> @@ -328,8 +324,6 @@ done:
>> static noinline
>> bool rwsem_spin_on_owner(struct rw_semaphore *sem, struct task_struct 
>> *owner)
>> {
>> -    long count;
>> -
>>     rcu_read_lock();
>>     while (sem->owner == owner) {
>>         /*
>> @@ -350,16 +344,11 @@ bool rwsem_spin_on_owner(struct rw_semaphore 
>> *sem, struct task_struct *owner)
>>     }
>>     rcu_read_unlock();
>>
>> -    if (READ_ONCE(sem->owner))
>> -        return true; /* new owner, continue spinning */
>> -
>>     /*
>> -     * When the owner is not set, the lock could be free or
>> -     * held by readers. Check the counter to verify the
>> -     * state.
>> +     * If there is a new owner or the owner is not set, we continue
>> +     * spinning.
>>      */
>> -    count = READ_ONCE(sem->count);
>> -    return (count == 0 || count == RWSEM_WAITING_BIAS);
>> +    return READ_ONCE(sem->owner) != RWSEM_READER_OWNED;
>> }
>>
>> static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
>> @@ -378,7 +367,8 @@ static bool rwsem_optimistic_spin(struct 
>> rw_semaphore *sem)
>>
>>     while (true) {
>>         owner = READ_ONCE(sem->owner);
>> -        if (owner && !rwsem_spin_on_owner(sem, owner))
>> +        if (rwsem_is_writer_owned(owner) &&
>> +           !rwsem_spin_on_owner(sem, owner))
>>             break;
>>
>>         /* wait_lock will be acquired if write_lock is obtained */
>> @@ -391,9 +381,11 @@ static bool rwsem_optimistic_spin(struct 
>> rw_semaphore *sem)
>>          * When there's no owner, we might have preempted between the
>>          * owner acquiring the lock and setting the owner field. If
>>          * we're an RT task that will live-lock because we won't let
>> -         * the owner complete.
>> +         * the owner complete. We also quit if the lock is owned by
>> +         * readers.
>>          */
>> -        if (!owner && (need_resched() || rt_task(current)))
>> +        if ((owner == RWSEM_READER_OWNED) ||
>> +           (!owner && (need_resched() || rt_task(current))))
>>             break;
>
> Is this really necessary? rwsem_can_spin_on_owner() already makes sure 
> that owner is not
> RWSEM_READER_OWNER so at this point we spin as long as the sem->owner 
> still matches with
> the original owner.

A writer may wait in the OSQ for some time before getting OSQ lock. 
During that period, a reader may have gotten the lock. In this case, 
owner will be NULL in the original code or RWSEM_READER_OWNED in the new 
code. In this case, the writer is effectively spinning in the main 
optimistic spin loop as rwsem_spin_on_owner() will not be called. I 
think we don't want to spin on reader as we have no way to find out if 
any of the readers has slept. The reduction in time spent in 
rwsem_down_write_failed() as documented in the change log was due to this.

>
>>
>>         /*
>> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
>> index c817216..632766a 100644
>> --- a/kernel/locking/rwsem.c
>> +++ b/kernel/locking/rwsem.c
>> @@ -22,6 +22,7 @@ void __sched down_read(struct rw_semaphore *sem)
>>     rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
>>
>>     LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
>> +    rwsem_reader_owned(sem);
>> }
>>
>> EXPORT_SYMBOL(down_read);
>> @@ -33,8 +34,10 @@ int down_read_trylock(struct rw_semaphore *sem)
>> {
>>     int ret = __down_read_trylock(sem);
>>
>> -    if (ret == 1)
>> +    if (ret == 1) {
>>         rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_);
>> +        rwsem_reader_owned(sem);
>> +    }
>>     return ret;
>> }
>>
>> @@ -124,7 +127,7 @@ void downgrade_write(struct rw_semaphore *sem)
>>      * lockdep: a downgraded write will live on as a write
>>      * dependency.
>>      */
>> -    rwsem_clear_owner(sem);
>> +    rwsem_reader_owned(sem);
>>     __downgrade_write(sem);
>> }
>>
>> @@ -138,6 +141,7 @@ void down_read_nested(struct rw_semaphore *sem, 
>> int subclass)
>>     rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
>>
>>     LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
>> +    rwsem_reader_owned(sem);
>> }
>>
>> EXPORT_SYMBOL(down_read_nested);
>> diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
>> index 870ed9a..2353ebf 100644
>> --- a/kernel/locking/rwsem.h
>> +++ b/kernel/locking/rwsem.h
>> @@ -1,3 +1,20 @@
>> +/*
>> + * The owner field of the rw_semaphore structure will be set to
>> + * RWSEM_READ_OWNED when a reader grabs the lock. A writer will clear
>> + * the owner field when it unlocks. A reader, on the other hand, will
>> + * not touch the owner field when it unlocks.
>> + *
>> + * In essence, the owner field now has the following 3 states:
>> + *  1) 0
>> + *     - lock is free or the owner hasn't set the field yet
>> + *  2) RWSEM_READER_OWNED
>> + *     - lock is currently or previously owned by readers (lock is free
>> + *       or not set by owner yet)
>> + *  3) Other non-zero value
>> + *     - a writer owns the lock
>> + */
>> +#define RWSEM_READER_OWNED    ((struct task_struct *)1)
>
> I think it looks a lot better if you just define it as 1UL and then
> just cast in rwsem_reader_owned(). That would also save the cast in
> rwsem_is_writer_owned().

That is fine, I can make the change.

>
>> +
>> #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>> static inline void rwsem_set_owner(struct rw_semaphore *sem)
>> {
>> @@ -9,6 +26,16 @@ static inline void rwsem_clear_owner(struct 
>> rw_semaphore *sem)
>>     sem->owner = NULL;
>> }
>>
>> +static inline void rwsem_reader_owned(struct rw_semaphore *sem)
>
> Nits: rwsem_set_reader_owner()?

How about rwsem_set_reder_owned()? reader_owner kind of looks weird to me.

>
>> +{
>> +    if (sem->owner != RWSEM_READER_OWNED)
>> +        sem->owner = RWSEM_READER_OWNED;
>
> ... and just blindly setting it ough to be fine.
>
> Thanks,
> Davidlohr

I was trying to minimize the number of writes to the rwsem cacheline so 
as to reduce the amount of cacheline contention. I am not sure if the 
CPU will be smart enough to discard the write if the cacheline has 
already contained the value to be written.

Cheers,
Longman