linux-kernel - Re: [PATCH v5 16/18] timer: Implement the hierarchical pull model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230323142440.GC2512024@hirez.programming.kicks-ass.net>
Date:   Thu, 23 Mar 2023 15:24:40 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc:     linux-kernel@...r.kernel.org, John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Eric Dumazet <edumazet@...gle.com>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Arjan van de Ven <arjan@...radead.org>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Rik van Riel <riel@...riel.com>
Subject: Re: [PATCH v5 16/18] timer: Implement the hierarchical pull model

On Wed, Mar 01, 2023 at 03:17:42PM +0100, Anna-Maria Behnsen wrote:

> diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
> new file mode 100644
> index 000000000000..ceb336e705df
> --- /dev/null
> +++ b/kernel/time/timer_migration.h
> @@ -0,0 +1,123 @@
> +#ifndef _KERNEL_TIME_MIGRATION_H
> +#define _KERNEL_TIME_MIGRATION_H
> +
> +/* Per group capacity. Must be a power of 2! */
> +#define TMIGR_CHILDS_PER_GROUP 8
> +
> +/**
> + * struct tmigr_event - a timer event associated to a CPU
> + * @nextevt:	The node to enqueue an event in the parent group queue
> + * @cpu:	The CPU to which this event belongs
> + * @ignore:	Hint whether the event could be ignored; it is set when
> + *		CPU or group is active;
> + */
> +struct tmigr_event {
> +	struct timerqueue_node	nextevt;
> +	unsigned int		cpu;
> +	int			ignore;
> +};
> +
> +/**
> + * struct tmigr_group - timer migration hierarchy group
> + * @lock:		Lock protecting the event information
> + * @cpus:		Array with CPUs which are member of group; required for
> + *			sibling CPUs; used only when level == 0

That's 32 bytes wasted for every group that isn't 0, maybe stick on the
end and conditionally allocate? Also, afaict it is only ever used during
setup, and as such should not be placed in a hot line, unless you've
done that explicitly as padding, in which case it needs a comment to
that effect.

Also, since it's setup only, why can't you match against:

  per_cpu_ptr(&tmigr_cpu, cpu)->parent

?

> + * @parent:		Pointer to parent group
> + * @list:		List head that is added to per level tmigr_level_list

Also setup only?

> + * @level:		Hierarchy level of group
> + * @numa_node:		Is set to numa node when level < tmigr_crossnode_level;
> + *			otherwise it is set to NUMA_NO_NODE; Required for setup
> + *			only
> + * @num_childs:		Counter of group childs; Required for setup only
> + * @num_cores:		Counter of cores per group; Required for setup only when
> + *			level == 0 and siblings exist

Also setup only, move the end?

> + * @migr_state:		State of group (see struct tmigr_state)
> + * @childmask:		childmask of group in parent group; is set during setup
> + *			never changed; could be read lockless
> + * @events:		Timer queue for child events queued in the group
> + * @groupevt:		Next event of group; it is only reliable when group is
> + *			!active (ignore bit is set when group is active)
> + * @next_expiry:	Base monotonic expiry time of next event of group;
> + *			Used for racy lockless check whether remote expiry is
> + *			required; it is always reliable

This is basically leftmost of @events? A racy lockless check sorta
implies not reliable, comment is confusing.

Also, isn't this identical to @groupevt.nextevt.expiry ?

> + */
> +struct tmigr_group {
> +	raw_spinlock_t		lock;
> +	unsigned int		cpus[TMIGR_CHILDS_PER_GROUP];
> +	struct tmigr_group	*parent;
> +	struct list_head	list;
> +	unsigned int		level;
> +	unsigned int		numa_node;
> +	unsigned int		num_childs;
> +	unsigned int		num_cores;
> +	atomic_t		*migr_state;

Per the other email; why isn't this:

	union tmigr_state	migr_state;

?

> +	u32			childmask;
> +	struct timerqueue_head	events;
> +	struct tmigr_event	groupevt;
> +	u64			next_expiry;
> +};
> +
> +/**
> + * struct tmigr_cpu - timer migration per CPU group
> + * @lock:	Lock protecting tmigr_cpu group information
> + * @online:	Indicates wheter CPU is online

What I'm missing is *why* we're keeping this state. I suspect you need
serialization on tmigr_cpu->lock between hotplug and something.

> + * @idle:	Indicates wheter CPU is idle in timer migration hierarchy
> + * @remote:	Is set when timers of CPU are expired remote

How are these not the same? When idle timers are expired remote, no?

> + * @tmgroup:	Pointer to parent group
> + * @childmask:	childmask of tmigr_cpu in parent group
> + * @cpuevt:	CPU event which could be queued into parent group
> + * @wakeup:	Stores the first timer when the timer migration hierarchy is
> + *		completely idle and remote expiry was done; is returned to
> + *		timer code when tmigr_cpu_deactive() is called and group is
> + *		idle; afterwards a reset to KTIME_MAX is required;
> + */
> +struct tmigr_cpu {
> +	raw_spinlock_t		lock;
> +	int			online;
> +	int			idle;
> +	int			remote;
> +	struct tmigr_group	*tmgroup;
> +	u32			childmask;
> +	struct tmigr_event	cpuevt;
> +	u64			wakeup;
> +};
> +
> +/**
> + * union tmigr_state - state of tmigr_group
> + * @state:	Combined version of the state - only used for atomic
> + * 		read/cmpxchg function
> + * @struct:	Splitted version of the state - only use the struct members to
> + *		update information to stay independant of endianess
> + */
> +union tmigr_state {
> +	u32 state;
> +	/**
> +	 * struct - splitted state of tmigr_group
> +	 * @active:	Contains each childmask bit of active childs
> +	 * @migrator:	Contains childmask of child which is migrator
> +	 * @seq:	Seqence number to prevent race when update in child
> +	 *		group are propagated in wrong order (especially when
> +	 *		migrator changes are involved)
> +	 */
> +	struct {
> +		u8	active;
> +		u8	migrator;

So childmask is the bit set in either of these masks, but it is u32
elsewhere and u8 here. Surely if the target type is u8, then it is best
to keep it consistently u8 elsewhere too, no?

> +		u16	seq;
> +	} __packed;
> +};