lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Fri, 29 Dec 2023 20:51:46 +0000
From: David Laight <David.Laight@...LAB.COM>
To: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"peterz@...radead.org" <peterz@...radead.org>, "longman@...hat.com"
	<longman@...hat.com>
CC: "mingo@...hat.com" <mingo@...hat.com>, "will@...nel.org"
	<will@...nel.org>, "boqun.feng@...il.com" <boqun.feng@...il.com>, "'Linus
 Torvalds'" <torvalds@...ux-foundation.org>, "'xinhui.pan@...ux.vnet.ibm.com'"
	<xinhui.pan@...ux.vnet.ibm.com>,
	"'virtualization@...ts.linux-foundation.org'"
	<virtualization@...ts.linux-foundation.org>, 'Zeng Heng'
	<zengheng4@...wei.com>
Subject: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code

Zeng Heng noted that heavy use of the osq (optimistic spin queue) code
used rather more cpu than might be expected. See:
https://lore.kernel.org/lkml/202312210155.Wc2HUK8C-lkp@intel.com/T/#mcc46eedd1ef22a0d668828b1d088508c9b1875b8

Part of the problem is there is a pretty much guaranteed cache line reload
reading node->prev->cpu for the vcpu_is_preempted() check (even on bare metal)
in the wakeup path which slows it down.
(On bare metal the hypervisor call is patched out, but the argument is still read.)

Careful analysis shows that it isn't necessary to dirty the per-cpu data
on the fast-path osq_lock() path. This may be slightly beneficial.

The code also uses this_cpu_ptr() to get the address of the per-cpu data.
On x86-64 (at least) this is implemented as:
	 &per_cpu_data[smp_processor_id()]->member
ie there is a real function call, an array index and an add.
However if raw_cpu_read() can used then (which is typically just an offset
from register - %gs for x86-64) the code will be faster.
Putting the address of the per-cpu node into itself means that only one
cache line need be loaded.

I can't see a list of per-cpu data initialisation functions, so the fields
are initialised on the first osq_lock() call.

The last patch avoids the cache line reload calling vcpu_is_preempted()
by simply saving node->prev->cpu as node->prev_cpu and updating it when
node->prev changes.
This is simpler than the patch proposed by Waimon.

David Laight (5):
  Move the definition of optimistic_spin_node into osf_lock.c
  Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
  Clarify osq_wait_next()
  Optimise per-cpu data accesses.
  Optimise vcpu_is_preempted() check.

 include/linux/osq_lock.h  |  5 ----
 kernel/locking/osq_lock.c | 61 +++++++++++++++++++++------------------
 2 files changed, 33 insertions(+), 33 deletions(-)

-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ