linux-kernel - Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130613235741.GI5146@linux.vnet.ibm.com>
Date:	Thu, 13 Jun 2013 16:57:41 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Lai Jiangshan <eag0628@...il.com>
Cc:	linux-kernel@...r.kernel.org, mingo@...e.hu, dipankar@...ibm.com,
	akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
	josh@...htriplett.org, niv@...ibm.com, tglx@...utronix.de,
	peterz@...radead.org, rostedt@...dmis.org, Valdis.Kletnieks@...edu,
	dhowells@...hat.com, edumazet@...gle.com, darren@...art.com,
	fweisbec@...il.com, sbw@....edu, torvalds@...ux-foundation.org,
	walken@...gle.com, waiman.long@...com, davidlohr.bueso@...com
Subject: Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
> <paulmck@...ux.vnet.ibm.com> wrote:
> > On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
> >> On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> >> > Breaking up locks is better than implementing high-contention locks, but
> >> > if we must have high-contention locks, why not make them automatically
> >> > switch between light-weight ticket locks at low contention and queued
> >> > locks at high contention?  After all, this would remove the need for
> >> > the developer to predict which locks will be highly contended.
> >> >
> >> > This commit allows ticket locks to automatically switch between pure
> >> > ticketlock and queued-lock operation as needed.  If too many CPUs are
> >> > spinning on a given ticket lock, a queue structure will be allocated
> >> > and the lock will switch to queued-lock operation.  When the lock becomes
> >> > free, it will switch back into ticketlock operation.  The low-order bit
> >> > of the head counter is used to indicate that the lock is in queued mode,
> >> > which forces an unconditional mismatch between the head and tail counters.
> >> > This approach means that the common-case code path under conditions of
> >> > low contention is very nearly that of a plain ticket lock.
> >> >
> >> > A fixed number of queueing structures is statically allocated in an
> >> > array.  The ticket-lock address is used to hash into an initial element,
> >> > but if that element is already in use, it moves to the next element.  If
> >> > the entire array is already in use, continue to spin in ticket mode.
> >> >
> >> > Signed-off-by: Paul E. McKenney <paulmck@...ux.vnet.ibm.com>
> >> > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> >> > [ paulmck: Address Eric Dumazet review feedback. ]
> >> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> >> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> >> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> >> > [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> >> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
> >> > [ paulmck: Type safety fixes (Steven Rostedt). ]
> >> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> >> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
> >>
> >>
> >> Hi, Paul,
> >>
> >> I simplify the code and remove the search by encoding the index of struct tkt_q_head
> >> into lock->tickets.head.
> >>
> >> A) lock->tickets.head(when queued-lock):
> >> ---------------------------------
> >>  index of struct tkt_q_head |0|1|
> >> ---------------------------------
> >
> > Interesting approach!  It might reduce queued-mode overhead a bit in
> > some cases, though I bet that in the common case the first queue element
> > examined is the right one.  More on this below.
> >
> >> The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(),
> >> thus __ticket_spin_unlock() will not change the higher bits of lock->tickets.head.
> >>
> >> B) tqhp->head is for the real value of lock->tickets.head.
> >> if the last bit of tqhp->head is 1, it means it is the head ticket when started queuing.
> >
> > But don't you also need the xadd() in __ticket_spin_unlock() to become
> > a cmpxchg() for this to work?  Or is your patch missing your changes to
> > arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
> > the no-contention overhead, which might be counterproductive.  Wouldn't
> > hurt to get measurements, though.
> 
> No, don't need to change __ticket_spin_unlock() in my idea.
> bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
> __ticket_spin_unlock() only changes the bit1, it will not change
> the higher bits. tkt_q_do_wake() will restore the tickets.head.
> 
> This approach avoids cmpxchg in __ticket_spin_unlock().

Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
need to be atomic in order to handle concurrent invocations of
__ticket_spin_lock()?

Either way, it would be good to see the performance effects of this.

							Thanx, Paul

> > Given the results that Davidlohr posted, I believe that the following
> > optimizations would also provide some improvement:
> >
> > 1.      Move the call to tkt_spin_pass() from __ticket_spin_lock()
> >         to a separate linker section in order to reduce the icache
> >         penalty exacted by the spinloop.  This is likely to be causing
> >         some of the performance reductions in the cases where ticket
> >         locks are not highly contended.
> >
> > 2.      Limit the number of elements searched for in the array of
> >         queues.  However, this would help only if a number of ticket
> >         locks were in queued mode at the same time.
> >
> > 3.      Dynamically allocate the queue array at boot.  This might
> >         also reduce cache pressure, again, at least in cases where
> >         there are a number of ticket locks in queued mode at the
> >         same time.
> >
> > Frederic just reminded me that I owe him some energy-efficiency improvements
> > for adaptive ticks, so I won't get to these very quickly.  Please feel free
> > to take these on -- the patch clearly does well under high contention, so
> > reducing the no-contention penalty could really help.
> >
> >                                                         Thanx, Paul
> >
> >> Thanks,
> >> Lai
> >>
> >>  kernel/tktqlock.c |   51 +++++++++++++--------------------------------------
> >>  1 files changed, 13 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c
> >> index 912817c..1329d0f 100644
> >> --- a/kernel/tktqlock.c
> >> +++ b/kernel/tktqlock.c
> >> @@ -33,7 +33,7 @@ struct tkt_q {
> >>
> >>  struct tkt_q_head {
> >>       arch_spinlock_t *ref;           /* Pointer to spinlock if in use. */
> >> -     s64 head_tkt;                   /* Head ticket when started queuing. */
> >> +     __ticket_t head;                /* Real head when queued. */
> >>       struct tkt_q *spin;             /* Head of queue. */
> >>       struct tkt_q **spin_tail;       /* Tail of queue. */
> >>  };
> >> @@ -77,15 +77,8 @@ static unsigned long tkt_q_hash(arch_spinlock_t *lock)
> >>   */
> >>  static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *lock)
> >>  {
> >> -     int i;
> >> -     int start;
> >> -
> >> -     start = i = tkt_q_hash(lock);
> >> -     do
> >> -             if (ACCESS_ONCE(tkt_q_heads[i].ref) == lock)
> >> -                     return &tkt_q_heads[i];
> >> -     while ((i = tkt_q_next_slot(i)) != start);
> >> -     return NULL;
> >> +     BUILD_BUG_ON(TKT_Q_NQUEUES > (1 << (TICKET_SHIFT - 2)));
> >> +     return &tkt_q_heads[ACCESS_ONCE(lock->tickets.head) >> 2];
> >>  }
> >>
> >>  /*
> >> @@ -101,11 +94,11 @@ static bool tkt_q_try_unqueue(arch_spinlock_t *lock, struct tkt_q_head *tqhp)
> >>
> >>       /* Pick up the ticket values. */
> >>       asold = ACCESS_ONCE(*lock);
> >> -     if ((asold.tickets.head & ~0x1) == asold.tickets.tail) {
> >> +     if (tqhp->head == asold.tickets.tail) {
> >>
> >>               /* Attempt to mark the lock as not having a queue. */
> >>               asnew = asold;
> >> -             asnew.tickets.head &= ~0x1;
> >> +             asnew.tickets.head = tqhp->head;
> >>               if (cmpxchg(&lock->head_tail,
> >>                           asold.head_tail,
> >>                           asnew.head_tail) == asold.head_tail) {
> >> @@ -128,12 +121,9 @@ void tkt_q_do_wake(arch_spinlock_t *lock)
> >>       struct tkt_q_head *tqhp;
> >>       struct tkt_q *tqp;
> >>
> >> -     /*
> >> -      * If the queue is still being set up, wait for it.  Note that
> >> -      * the caller's xadd() provides the needed memory ordering.
> >> -      */
> >> -     while ((tqhp = tkt_q_find_head(lock)) == NULL)
> >> -             cpu_relax();
> >> +     tqhp = tkt_q_find_head(lock);
> >> +     ACCESS_ONCE(lock->tickets.head) -= __TKT_SPIN_INC;
> >> +     ACCESS_ONCE(tqhp->head) = (tqhp->head & ~0x1) + __TKT_SPIN_INC;
> >>
> >>       for (;;) {
> >>
> >> @@ -145,9 +135,7 @@ void tkt_q_do_wake(arch_spinlock_t *lock)
> >>                       return; /* No element, successfully removed queue. */
> >>               cpu_relax();
> >>       }
> >> -     if (ACCESS_ONCE(tqhp->head_tkt) != -1)
> >> -             ACCESS_ONCE(tqhp->head_tkt) = -1;
> >> -     smp_mb(); /* Order pointer fetch and assignment against handoff. */
> >> +     smp_mb(); /* Order modification, pointer fetch and assignment against handoff. */
> >>       ACCESS_ONCE(tqp->cpu) = -1;
> >>  }
> >>  EXPORT_SYMBOL(tkt_q_do_wake);
> >> @@ -169,10 +157,7 @@ bool tkt_q_do_spin(arch_spinlock_t *lock, struct __raw_tickets inc)
> >>        */
> >>       smp_mb();  /* See above block comment. */
> >>
> >> -     /* If there no longer is a queue, leave. */
> >>       tqhp = tkt_q_find_head(lock);
> >> -     if (tqhp == NULL)
> >> -             return false;
> >>
> >>       /* Initialize our queue element. */
> >>       tq.cpu = raw_smp_processor_id();
> >> @@ -180,9 +165,8 @@ bool tkt_q_do_spin(arch_spinlock_t *lock, struct __raw_tickets inc)
> >>       tq.next = NULL;
> >>
> >>       /* Check to see if we already hold the lock. */
> >> -     if (ACCESS_ONCE(tqhp->head_tkt) == inc.tail) {
> >> +     if (ACCESS_ONCE(tqhp->head) == (inc.tail | 0x1)) {
> >>               /* The last holder left before queue formed, we hold lock. */
> >> -             tqhp->head_tkt = -1;
> >>               return true;
> >>       }
> >>
> >> @@ -290,16 +274,14 @@ tkt_q_init_contend(int i, arch_spinlock_t *lock, struct __raw_tickets inc)
> >>                * Record the head counter in case one of the spinning
> >>                * CPUs already holds the lock but doesn't realize it yet.
> >>                */
> >> -             tqhp->head_tkt = asold.tickets.head;
> >> +             tqhp->head = asold.tickets.head | 0x1;
> >>
> >>               /* The low-order bit in the head counter says "queued". */
> >> -             asnew.tickets.head |= 0x1;
> >> +             asnew.tickets.head = (i << 2) + 0x1;
> >>       } while (cmpxchg(&lock->head_tail,
> >>                asold.head_tail,
> >>                asnew.head_tail) != asold.head_tail);
> >>
> >> -     /* Point the queue at the lock and go spin on it. */
> >> -     ACCESS_ONCE(tqhp->ref) = lock;
> >>       return tkt_q_do_spin(lock, inc);
> >>  }
> >>
> >> @@ -321,15 +303,8 @@ bool tkt_q_start_contend(arch_spinlock_t *lock, struct __raw_tickets inc)
> >>        * the lock with the corresponding queue.
> >>        */
> >>       do {
> >> -             /*
> >> -              * Use 0x1 to mark the queue in use, but also avoiding
> >> -              * any spinners trying to use it before we get it all
> >> -              * initialized.
> >> -              */
> >>               if (!tkt_q_heads[i].ref &&
> >> -                 cmpxchg(&tkt_q_heads[i].ref,
> >> -                         NULL,
> >> -                         (arch_spinlock_t *)0x1) == NULL) {
> >> +                 cmpxchg(&tkt_q_heads[i].ref, NULL, lock) == NULL) {
> >>
> >>                       /* Succeeded, now go initialize it. */
> >>                       return tkt_q_init_contend(i, lock, inc);
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@...r.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/