linux-kernel - Re: [PATCH] locking/qspinlock: Optimize pending state waiting for unlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJF2gTS52jBm7_3c=9i1uPjmV90=42xs4dOs6woA4NnHf4RHgQ@mail.gmail.com>
Date:   Sun, 25 Dec 2022 10:57:48 +0800
From:   Guo Ren <guoren@...nel.org>
To:     Waiman Long <longman@...hat.com>
Cc:     peterz@...radead.org, linux-kernel@...r.kernel.org,
        Guo Ren <guoren@...ux.alibaba.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Will Deacon <will@...nel.org>, Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH] locking/qspinlock: Optimize pending state waiting for unlock

On Sun, Dec 25, 2022 at 9:55 AM Waiman Long <longman@...hat.com> wrote:
>
> On 12/24/22 07:05, guoren@...nel.org wrote:
> > From: Guo Ren <guoren@...ux.alibaba.com>
> >
> > When we're pending, we only care about lock value. The xchg_tail
> > wouldn't affect the pending state. That means the hardware thread
> > could stay in a sleep state and leaves the rest execution units'
> > resources of pipeline to other hardware threads. This optimization
> > may work only for SMT scenarios because the granularity between
> > cores is cache-block.
Please have a look at the comment I've written.

> >
> > Signed-off-by: Guo Ren <guoren@...ux.alibaba.com>
> > Signed-off-by: Guo Ren <guoren@...nel.org>
> > Cc: Waiman Long <longman@...hat.com>
> > Cc: Peter Zijlstra <peterz@...radead.org>
> > Cc: Boqun Feng <boqun.feng@...il.com>
> > Cc: Will Deacon <will@...nel.org>
> > Cc: Ingo Molnar <mingo@...hat.com>
> > ---
> >   kernel/locking/qspinlock.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index 2b23378775fe..ebe6b8ec7cb3 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -371,7 +371,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >       /*
> >        * We're pending, wait for the owner to go away.
> >        *
> > -      * 0,1,1 -> 0,1,0
> > +      * 0,1,1 -> *,1,0
> >        *
> >        * this wait loop must be a load-acquire such that we match the
> >        * store-release that clears the locked bit and create lock
> Yes, we don't care about the tail.
> > @@ -380,7 +380,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >        * barriers.
> >        */
> >       if (val & _Q_LOCKED_MASK)
> > -             atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
> > +             smp_cond_load_acquire(&lock->locked, !VAL);
> >
> >       /*
> >        * take ownership and clear the pending bit.
>
> We may save an AND operation here which may be a cycle or two.  I
> remember that it may be more costly to load a byte instead of an integer
> in some arches. So it doesn't seem like that much of an optimization
> from my point of view.
The reason is, of course, not here. See my commit comment.

> I know that arm64 will enter a low power state in
> this *cond_load_acquire() loop, but I believe any change in the state of
> the the lock cacheline will wake it up. So it doesn't really matter if
> you are checking a byte or an int.
The situation is the SMT scenarios in the same core. Not an entering
low-power state situation. Of course, the granularity between cores is
"cacheline", but the granularity between SMT hw threads of the same
core could be "byte" which internal LSU handles. For example, when a
hw-thread yields the resources of the core to other hw-threads, this
patch could help the hw-thread stay in the sleep state and prevent it
from being woken up by other hw-threads xchg_tail.

Finally, from the software semantic view, does the patch make it more
accurate? (We don't care about the tail here.)

>
> Do you have any other data point to support your optimization claim?
>
> Cheers,
> Longman
>


-- 
Best Regards
 Guo Ren