lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20161119161036.12679-16-nicstange@gmail.com>
Date:   Sat, 19 Nov 2016 17:10:35 +0100
From:   Nicolai Stange <nicstange@...il.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     John Stultz <john.stultz@...aro.org>, linux-kernel@...r.kernel.org,
        Nicolai Stange <nicstange@...il.com>
Subject: [RFC v8 27/28] clockevents: optimize struct clock_event_device layout

Benchmarks have shown that clockevents_adjust_all_freqs() is memory bound.
The members of struct clock_event_device accessed therein, namely
->features, ->mult_adjusted, ->shift, ->mult and ->list are split across
two cachelines.

In what follows, a cacheline size of 64 bytes will be assumed.

I.) sizeof(long) == 4
At least for sizeof(long) == sizeof(void*) == 4, e.g. on ARM, this
situation can be improved by reordering the members of
struct clock_event_device.

This patch exploits the fact that with the current layout, there is some
padding inserted within the first cacheline on ARM: right now, the first
three members, which are of pointer type and 12 bytes in total, are
followed by a ktime_t. Baically being a s64, the latter gets aligned on
an 8 byte boundary. Thus, bytes 12 to 16 are occupied by unused padding.

Furthermore, for sizeof(unsigned long) == 4, this patch series has achieved
another net saving of 4 bytes within the first cacheline so far.

Summarizing: through reordering of the first cacheline's members, a total
space of 8 bytes can be made available, compared to the state before this
series. This allows for the placement of the 12 bytes made up by ->list and
->mult within the first cacheline while moving only one 4 byte member out.

This victim will be ->set_state_periodic() which will be moved to its
->set_state_*() relatives within the second cacheline.

II.) sizeof(long) == 8
For the case of sizeof(long) == sizeof(void*) == 8, the first cacheline's
contents will be unchanged.
The ->list and ->mult members are moved from the third to the start of
the second cacheline. Compared to the situation before this series,
this will move ->broadcast() and ->suspend() from the end of the second
cacheline to the start of the third one.
A nice side effect is that due to the elimination of padding in the third
cacheline, together with the patch ("clockevents: degrade ->retries to
unsigned int"), the struct clock_event_device now fits exactly into a
multiple of three cachelines.

Thus, in struct clock_event_device:
- move the ->event_handler(), ->set_next_event() and ->set_next_ktime()
  in order to avoid padding in front of ->next_event on ARM.
- move the ->list and ->mult members right before the ->retries member.

Benchmark results:
The following measurements have been carried out on a Raspberry Pi 2B
(armv7, 4 cores, 900MHz). The adjustment process has been driven by
periodically injecting a certain offset via adjtimex(2) approximately
every 3.7h. A 'stress --vm 8 --vm-bytes 32M' was running. The runtime of
clockevents_adjust_all_freqs() has been measured.

- Before this patch:
  Mean: 8750.73+-1034.77
  Quantiles:
    0%    25%    50%    75%    100%
  3854   8171   8901   9452   13593 (ns)

- After this patch:
  Mean: 6916.90+-782.63
  Quantiles:
    0%    25%    50%    75%    100%
  3072   6412   6927   7430   10989 (ns)

Signed-off-by: Nicolai Stange <nicstange@...il.com>
---
 include/linux/clockchips.h | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 878a802..17ef940 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -74,10 +74,6 @@ enum clock_event_state {
 
 /**
  * struct clock_event_device - clock event device descriptor
- * @event_handler:	Assigned by the framework to be called by the low
- *			level handler of the event source
- * @set_next_event:	set next event function using a clocksource delta
- * @set_next_ktime:	set next event function using a direct ktime value
  * @next_event:		local storage for the next event in oneshot mode
  * @max_delta_ns:	maximum delta value in ns
  * @max_delta_ticks:	maximum delta value in ticks
@@ -86,6 +82,12 @@ enum clock_event_state {
  * @shift:		nanoseconds to cycles divisor (power of two)
  * @state_use_accessors:current state of the device, assigned by the core code
  * @features:		features
+ * @event_handler:	Assigned by the framework to be called by the low
+ *			level handler of the event source
+ * @set_next_event:	set next event function using a clocksource delta
+ * @set_next_ktime:	set next event function using a direct ktime value
+ * @list:		list head for the management code
+ * @mult:		ns to cycles multiplier stored for reconfiguration
  * @retries:		number of forced programming retries
  * @set_state_periodic:	switch state to periodic
  * @set_state_oneshot:	switch state to oneshot
@@ -95,18 +97,13 @@ enum clock_event_state {
  * @broadcast:		function to broadcast events
  * @name:		ptr to clock event name
  * @min_delta_ticks:	minimum delta value in ticks stored for reconfiguration
- * @mult:		ns to cycles multiplier stored for reconfiguration
  * @rating:		variable to rate clock event devices
  * @irq:		IRQ number (only for non CPU local devices)
  * @bound_on:		Bound on CPU
  * @cpumask:		cpumask to indicate for which CPUs this device works
- * @list:		list head for the management code
  * @owner:		module reference
  */
 struct clock_event_device {
-	void			(*event_handler)(struct clock_event_device *);
-	int			(*set_next_event)(unsigned long evt, struct clock_event_device *);
-	int			(*set_next_ktime)(ktime_t expires, struct clock_event_device *);
 	ktime_t			next_event;
 	u64			max_delta_ns;
 	unsigned long		max_delta_ticks;
@@ -115,6 +112,11 @@ struct clock_event_device {
 	u32			shift;
 	enum clock_event_state	state_use_accessors:8;
 	unsigned int		features:24;
+	void			(*event_handler)(struct clock_event_device *);
+	int			(*set_next_event)(unsigned long evt, struct clock_event_device *);
+	int			(*set_next_ktime)(ktime_t expires, struct clock_event_device *);
+	struct list_head	list;
+	u32			mult;
 	unsigned int		retries;
 
 	int			(*set_state_periodic)(struct clock_event_device *);
@@ -129,12 +131,10 @@ struct clock_event_device {
 
 	const char		*name;
 	unsigned int		min_delta_ticks;
-	u32			mult;
 	int			rating;
 	int			irq;
 	int			bound_on;
 	const struct cpumask	*cpumask;
-	struct list_head	list;
 	struct module		*owner;
 } ____cacheline_aligned;
 
-- 
2.10.2

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ