linux-kernel - Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231206163536.r9DcrsWQ@linutronix.de>
Date:   Wed, 6 Dec 2023 17:35:36 +0100
From:   Sebastian Siewior <bigeasy@...utronix.de>
To:     Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Eric Dumazet <edumazet@...gle.com>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Arjan van de Ven <arjan@...radead.org>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Giovanni Gherdovich <ggherdovich@...e.cz>,
        Lukasz Luba <lukasz.luba@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Srinivas Pandruvada <srinivas.pandruvada@...el.com>,
        K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
…
> As long as a CPU is busy it expires both local and global timers. When a
> CPU goes idle it arms for the first expiring local timer. If the first
> expiring pinned (local) timer is before the first expiring movable timer,
> then no action is required because the CPU will wake up before the first
> movable timer expires. If the first expiring movable timer is before the
> first expiring pinned (local) timer, then this timer is queued into a idle
                                                                      an
> timerqueue and eventually expired by some other active CPU.
s/some other/another ?

…
> 
> Signed-off-by: Anna-Maria Behnsen <anna-maria@...utronix.de>
> ---
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index b6c9ac0c3712..ac3e888d053f 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -2103,6 +2104,64 @@ void timer_lock_remote_bases(unsigned int cpu)
…
> +static void timer_use_tmigr(unsigned long basej, u64 basem,
> +			    unsigned long *nextevt, bool *tick_stop_path,
> +			    bool timer_base_idle, struct timer_events *tevt)
> +{
> +	u64 next_tmigr;
> +
> +	if (timer_base_idle)
> +		next_tmigr = tmigr_cpu_new_timer(tevt->global);
> +	else if (tick_stop_path)
> +		next_tmigr = tmigr_cpu_deactivate(tevt->global);
> +	else
> +		next_tmigr = tmigr_quick_check();
> +
> +	/*
> +	 * If the CPU is the last going idle in timer migration hierarchy, make
> +	 * sure the CPU will wake up in time to handle remote timers.
> +	 * next_tmigr == KTIME_MAX if other CPUs are still active.
> +	 */
> +	if (next_tmigr < tevt->local) {
> +		u64 tmp;
> +
> +		/* If we missed a tick already, force 0 delta */
> +		if (next_tmigr < basem)
> +			next_tmigr = basem;
> +
> +		tmp = div_u64(next_tmigr - basem, TICK_NSEC);

Is this considered a hot path? Asking because u64 divs are nice if can
be avoided ;)

I guess the original value is from fetch_next_timer_interrupt(). But
then you only need it if the caller (__get_next_timer_interrupt()) has
the `idle' value set. Otherwise the operation is pointless.
Would it somehow work to replace
	base_local->is_idle = time_after(nextevt, basej + 1);

with maybe something like
	base_local->is_idle = tevt.local > basem + TICK_NSEC

If so you could avoid the `nextevt' maneuver.

> +		*nextevt = basej + (unsigned long)tmp;
> +		tevt->local = next_tmigr;
> +	}
> +}
> +# else
…
> @@ -2132,6 +2190,21 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
>  	nextevt = fetch_next_timer_interrupt(basej, basem, base_local,
>  					     base_global, &tevt);
>  
> +	/*
> +	 * When the when the next event is only one jiffie ahead there is no

   If the next event is only one jiffy ahead then there is no

> +	 * need to call timer migration hierarchy related
> +	 * functions. @tevt->global will be KTIME_MAX, nevertheless if the next
> +	 * timer is a global timer. This is also true, when the timer base is

The second sentence is hard to parse.

> +	 * idle.
> +	 *
> +	 * The proper timer migration hierarchy function depends on the callsite
> +	 * and whether timer base is idle or not. @nextevt will be updated when
> +	 * this CPU needs to handle the first timer migration hierarchy event.
> +	 */
> +	if (time_after(nextevt, basej + 1))
> +		timer_use_tmigr(basej, basem, &nextevt, idle,
> +				base_local->is_idle, &tevt);
> +
>  	/*
>  	 * We have a fresh next event. Check whether we can forward the
>  	 * base.
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@
…
> +/*
> + * The timer migration mechanism is built on a hierarchy of groups. The
> + * lowest level group contains CPUs, the next level groups of CPU groups
> + * and so forth. The CPU groups are kept per node so for the normal case
> + * lock contention won't happen across nodes. Depending on the number of
> + * CPUs per node even the next level might be kept as groups of CPU groups
> + * per node and only the levels above cross the node topology.
> + *
> + * Example topology for a two node system with 24 CPUs each.
> + *
> + * LVL 2			[GRP2:0]
> + *			      GRP1:0 = GRP1:M
> + *
> + * LVL 1            [GRP1:0]		         [GRP1:1]
> + *	         GRP0:0 - GRP0:2	      GRP0:3 - GRP0:5
> + *
> + * LVL 0  [GRP0:0]  [GRP0:1]  [GRP0:2]  [GRP0:3]  [GRP0:4]  [GRP0:5]
> + * CPUS     0-7       8-15     16-23     24-31	   32-39     40-47

In the CPUS list between 24-31 and 32-39 is a tab while the other
separators are spaces. Could you please align it with spaces? Judging
form the top you have tabstop=8 but here tabstop=4 looks "nice".

> + *
> + * The groups hold a timer queue of events sorted by expiry time. These
> + * queues are updated when CPUs go in idle. When they come out of idle
> + * ignore flag of events is set.
> + *

Sebastian