linux-kernel - [PATCH next 0/5] locking/osq_lock: Optimisations to osq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <73a4b31c9c874081baabad9e5f2e5204@AcuMS.aculab.com>
Date: Fri, 29 Dec 2023 20:51:46 +0000
From: David Laight <David.Laight@...LAB.COM>
To: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"peterz@...radead.org" <peterz@...radead.org>, "longman@...hat.com"
	<longman@...hat.com>
CC: "mingo@...hat.com" <mingo@...hat.com>, "will@...nel.org"
	<will@...nel.org>, "boqun.feng@...il.com" <boqun.feng@...il.com>, "'Linus
 Torvalds'" <torvalds@...ux-foundation.org>, "'xinhui.pan@...ux.vnet.ibm.com'"
	<xinhui.pan@...ux.vnet.ibm.com>,
	"'virtualization@...ts.linux-foundation.org'"
	<virtualization@...ts.linux-foundation.org>, 'Zeng Heng'
	<zengheng4@...wei.com>
Subject: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code

Zeng Heng noted that heavy use of the osq (optimistic spin queue) code
used rather more cpu than might be expected. See:
https://lore.kernel.org/lkml/202312210155.Wc2HUK8C-lkp@intel.com/T/#mcc46eedd1ef22a0d668828b1d088508c9b1875b8

Part of the problem is there is a pretty much guaranteed cache line reload
reading node->prev->cpu for the vcpu_is_preempted() check (even on bare metal)
in the wakeup path which slows it down.
(On bare metal the hypervisor call is patched out, but the argument is still read.)

Careful analysis shows that it isn't necessary to dirty the per-cpu data
on the fast-path osq_lock() path. This may be slightly beneficial.

The code also uses this_cpu_ptr() to get the address of the per-cpu data.
On x86-64 (at least) this is implemented as:
	 &per_cpu_data[smp_processor_id()]->member
ie there is a real function call, an array index and an add.
However if raw_cpu_read() can used then (which is typically just an offset
from register - %gs for x86-64) the code will be faster.
Putting the address of the per-cpu node into itself means that only one
cache line need be loaded.

I can't see a list of per-cpu data initialisation functions, so the fields
are initialised on the first osq_lock() call.

The last patch avoids the cache line reload calling vcpu_is_preempted()
by simply saving node->prev->cpu as node->prev_cpu and updating it when
node->prev changes.
This is simpler than the patch proposed by Waimon.

David Laight (5):
  Move the definition of optimistic_spin_node into osf_lock.c
  Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
  Clarify osq_wait_next()
  Optimise per-cpu data accesses.
  Optimise vcpu_is_preempted() check.

 include/linux/osq_lock.h  |  5 ----
 kernel/locking/osq_lock.c | 61 +++++++++++++++++++++------------------
 2 files changed, 33 insertions(+), 33 deletions(-)

-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)