[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210415090215.GA1015@arm.com>
Date: Thu, 15 Apr 2021 10:02:18 +0100
From: Catalin Marinas <catalin.marinas@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Stafford Horne <shorne@...il.com>, Guo Ren <guoren@...nel.org>,
Christoph Müllner <christophm30@...il.com>,
Palmer Dabbelt <palmer@...belt.com>,
Anup Patel <anup@...infault.org>,
linux-riscv <linux-riscv@...ts.infradead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Guo Ren <guoren@...ux.alibaba.com>,
Will Deacon <will@...nel.org>, Arnd Bergmann <arnd@...db.de>,
jonas@...thpole.se, stefan.kristiansson@...nalahti.fi
Subject: Re: [RFC][PATCH] locking: Generic ticket-lock
(fixed Will's email address)
On Thu, Apr 15, 2021 at 10:09:54AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 15, 2021 at 05:47:34AM +0900, Stafford Horne wrote:
> > > How's this then? Compile tested only on openrisc/simple_smp_defconfig.
> >
> > I did my testing with this FPGA build SoC:
> >
> > https://github.com/stffrdhrn/de0_nano-multicore
> >
> > Note, the CPU timer sync logic uses mb() and is a bit flaky. So missing mb()
> > might be a reason. I thought we had defined mb() and l.msync, but it seems to
> > have gotten lost.
> >
> > With that said I could test out this ticket-lock implementation. How would I
> > tell if its better than qspinlock?
>
> Mostly if it isn't worse, it's better for being *much* simpler. As you
> can see, the guts of ticket is like 16 lines of C (lock+unlock) and you
> only need the behaviour of atomic_fetch_add() to reason about behaviour
> of the whole thing. qspinlock OTOH is mind bending painful to reason
> about.
>
> There are some spinlock tests in locktorture; but back when I had a
> userspace copy of the lot and would measure min,avg,max acquire times
> under various contention loads (making sure to only run a single task
> per CPU etc.. to avoid lock holder preemption and other such 'fun'
> things).
>
> It took us a fair amount of work to get qspinlock to compete with ticket
> for low contention cases (by far the most common in the kernel), and it
> took a fairly large amount of CPUs for qspinlock to really win from
> ticket on the contended case. Your hardware may vary. In particular the
> access to the external cacheline (for queueing, see the queue: label in
> queued_spin_lock_slowpath) is a pain-point and the relative cost of
> cacheline misses for your arch determines where (and if) low contention
> behaviour is competitive.
>
> Also, less variance (the reason for the min/max measure) is better.
> Large variance is typically a sign of fwd progress trouble.
IIRC, one issue we had with ticket spinlocks on arm64 was on big.LITTLE
systems where the little CPUs were always last to get a ticket when
racing with the big cores. That was with load/store exclusives (LR/SC
style) and would have probably got better with atomics but we moved to
qspinlocks eventually (the Juno board didn't have atomics).
(leaving the rest of the text below for Will's convenience)
> That's not saying that qspinlock isn't awesome, but I'm arguing that you
> should get there by first trying all the simpler things. By gradually
> increasing complexity you can also find the problem spots (for your
> architecture) and you have something to fall back to in case of trouble.
>
> Now, the obvious selling point of qspinlock is that due to the MCS style
> nature of the thing it doesn't bounce the lock around, but that comes at
> a cost of having to use that extra cacheline (due to the kernel liking
> sizeof(spinlock_t) == sizeof(u32)). But things like ARM64's WFE (see
> smp_cond_load_acquire()) can shift the balance quite a bit on that front
> as well (ARM has a similar thing but less useful, see it's spinlock.h
> and look for wfe() and dsb_sev()).
>
> Once your arch hits NUMA, qspinlock is probably a win. However, low
> contention performance is still king for most workloads. Better high
> contention behaviour is nice.
--
Catalin
Powered by blists - more mailing lists