lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 22 Dec 2023 12:40:16 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Zeng Heng' <zengheng4@...wei.com>, "mingo@...hat.com" <mingo@...hat.com>,
	"will@...nel.org" <will@...nel.org>, "peterz@...radead.org"
	<peterz@...radead.org>, "longman@...hat.com" <longman@...hat.com>,
	"boqun.feng@...il.com" <boqun.feng@...il.com>
CC: "xiexiuqi@...wei.com" <xiexiuqi@...wei.com>, "liwei391@...wei.com"
	<liwei391@...wei.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2] locking/osq_lock: Avoid false sharing in
 optimistic_spin_node

From: Zeng Heng
> Sent: 22 December 2023 12:11
> 
> Using the UnixBench test suite, we clearly find that osq_lock() cause
> extremely high overheads with perf tool in the File Copy items:
> 
> Overhead  Shared Object            Symbol
>   94.25%  [kernel]                 [k] osq_lock
>    0.74%  [kernel]                 [k] rwsem_spin_on_owner
>    0.32%  [kernel]                 [k] filemap_get_read_batch
> 
> In response to this, we conducted an analysis and made some gains:
> 
> In the prologue of osq_lock(), it set `cpu` member of percpu struct
> optimistic_spin_node with the local cpu id, after that the value of the
> percpu struct would never change in fact. Based on that, we can regard
> the `cpu` member as a constant variable.
> 
...
> @@ -9,7 +11,13 @@
>  struct optimistic_spin_node {
>  	struct optimistic_spin_node *next, *prev;
>  	int locked; /* 1 if lock acquired */
> -	int cpu; /* encoded CPU # + 1 value */
> +
> +	CACHELINE_PADDING(_pad1_);
> +	/*
> +	 * Stores an encoded CPU # + 1 value.
> +	 * Only read by other cpus, so split into different cache lines.
> +	 */
> +	int cpu;
>  };

Isn't this structure embedded in every mutex and rwsem (etc)?
So that is a significant bloat especially on systems with
large cache lines.

Did you try just moving the initialisation of the per-cpu 'node'
below the first fast-path (uncontended) test in osq_lock()?

OTOH if you really have multiple cpu spinning on the same rwsem
perhaps the test and/or filemap code are really at fault!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ