linux-kernel - Re: [PATCH] locking/rwsem: Optimize down_read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMZfGtUnUjF5wW8ZNdm3KNs54jUBxRfm12q-KbXWtDDBd2HhTQ@mail.gmail.com>
Date:   Fri, 19 Nov 2021 10:52:56 +0800
From:   Muchun Song <songmuchun@...edance.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>, boqun.feng@...il.com,
        LKML <linux-kernel@...r.kernel.org>,
        Xiongchun duan <duanxiongchun@...edance.com>,
        Qi Zheng <zhengqi.arch@...edance.com>
Subject: Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly
 contended case

On Thu, Nov 18, 2021 at 8:57 PM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:
>
> > By using the above benchmark, the real executing time on a x86-64 system
> > before and after the patch were:
>
> What kind of x86_64 ?

Intel(R) Xeon(R) CPU E5-2650

>
> >
> >                   Before Patch  After Patch
> >    # of Threads      real          real     reduced by
> >    ------------     ------        ------    ----------
> >          1          65,373        65,206       ~0.0%
> >          4          15,467        15,378       ~0.5%
> >         40           6,214         5,528      ~11.0%
> >
> > For the uncontended case, the new down_read_trylock() is the same as
> > before. For the contended cases, the new down_read_trylock() is faster
> > than before. The more contended, the more fast.
> >
> > Signed-off-by: Muchun Song <songmuchun@...edance.com>
> > ---
> >  kernel/locking/rwsem.c | 11 ++++-------
> >  1 file changed, 4 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index c51387a43265..ef2b2a3f508c 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
> >
> >       DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
> >
> > -     /*
> > -      * Optimize for the case when the rwsem is not locked at all.
> > -      */
> > -     tmp = RWSEM_UNLOCKED_VALUE;
> > -     do {
> > +     tmp = atomic_long_read(&sem->count);
> > +     while (!(tmp & RWSEM_READ_FAILED_MASK)) {
> >               if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> > -                                     tmp + RWSEM_READER_BIAS)) {
> > +                                                 tmp + RWSEM_READER_BIAS)) {
> >                       rwsem_set_reader_owned(sem);
> >                       return 1;
> >               }
> > -     } while (!(tmp & RWSEM_READ_FAILED_MASK));
> > +     }
> >       return 0;
> >  }
>
> This is weird... so the only difference is that leading load, but given
> contention you'd expect that load to stall, also, given it's a
> non-exclusive load, to get stolen by a competing CPU. Whereas the old
> code would start with a cmpxchg, which obviously will also stall, but
> does an exclusive load.

The old code always assumes that rwsem is unlocked at all, so the
initialization of local variable @tmp is 0 (RWSEM_UNLOCKED_VALUE).
Ideally, the first cmpxchg will be successful. However, if rwsem is
contended (e.g. mmap read lock is almost hold by other threads
), sem->count is not always RWSEM_UNLOCKED_VALUE, so the
first cmpxchg has to fail and the second cmpxchg usually succeeds.
This patch will load sem->count first to @tmp and then the cmpxchg
will usually succeed. So this patch reduces the number of executions
of cmpxchg if running the above benchmark.

I have added a patch as follows to verify this. I got the following
statistics by using bpftrace (bpftrace -e
'tracepoint:rwsem:rwsem_read_loop /comm == "a.out"/ {
@cnt[args->count] = count(); }').

Before patch:
@cnt[17]: 1
@cnt[11]: 6
@cnt[10]: 14
@cnt[9]: 62
@cnt[8]: 339
@cnt[1]: 1211
@cnt[7]: 1585
@cnt[6]: 7796
@cnt[5]: 34742
@cnt[4]: 158245
@cnt[3]: 921631
@cnt[2]: 25089070

After patch:
@cnt[9]: 1
@cnt[8]: 3
@cnt[7]: 28
@cnt[6]: 151
@cnt[0]: 197
@cnt[5]: 823
@cnt[4]: 5521
@cnt[3]: 39126
@cnt[2]: 328012
@cnt[1]: 25840840

As you can see the total number of executions of cmpxchg is
reduced by about ~1x.

Thanks.

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index ef2b2a3f508c..cdeb1ad1921f 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1246,14 +1246,18 @@ static inline int __down_read_killable(struct
rw_semaphore *sem)
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
        long tmp;
+       int count = 0;

        DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);

        tmp = atomic_long_read(&sem->count);
        while (!(tmp & RWSEM_READ_FAILED_MASK)) {
+               count++;
                if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
                                                    tmp + RWSEM_READER_BIAS)) {
                        rwsem_set_reader_owned(sem);
+                       if (&current->mm->mmap_lock == sem)
+                               trace_rwsem_read_loop(count);
                        return 1;
                }
        }

>
> And the thinking is that the exclusive load and the presence of the
> cmpxchg loop would keep the line on that CPU for a little while and
> progress is made.
>
> Clearly this isn't working as expected. Also I suppose it would need
> wider testing...
>