linux-kernel - Re: [Question] timers: trigger_dyntick_cpu() vs TIMER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220725104356.GA2950296@lothringen>
Date:   Mon, 25 Jul 2022 12:43:56 +0200
From:   Frederic Weisbecker <frederic@...nel.org>
To:     Valentin Schneider <vschneid@...hat.com>
Cc:     linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        Arjan van de Ven <arjan@...radead.org>,
        Chris Mason <clm@...com>, Eric Dumazet <edumazet@...gle.com>,
        George Spelvin <linux@...encehorizons.net>,
        Josh Triplett <josh@...htriplett.org>,
        Len Brown <lenb@...nel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Rik van Riel <riel@...riel.com>,
        Marcelo Tosatti <mtosatti@...hat.com>
Subject: Re: [Question] timers: trigger_dyntick_cpu() vs TIMER_DEFERRABLE

On Mon, Jul 25, 2022 at 10:32:42AM +0100, Valentin Schneider wrote:
> Hi,
> 
> I've been incidentally staring at some NOHZ bits related to the timer
> wheels, and trigger_dyntick_cpu() confuses me:
> 
>   static void
>   trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
>   {
>           [...]
>           /*
>            * TODO: This wants some optimizing similar to the code below, but we
>            * will do that when we switch from push to pull for deferrable timers.
>            */
>           if ((timer->flags & TIMER_DEFERRABLE)) {
>                   if (tick_nohz_full_cpu(base->cpu))
>                           wake_up_nohz_cpu(base->cpu);
>                   return;
>           }
>           [...]
>   }
> 
> From what I grok out of get_nohz_timer_target(), under
> timers_migration_enabled we should migrate the timer to an non-idle CPU
> (or at the very least a non-isolated CPU) *before* enqueuing the
> timer.

That's not always the case. For example TIMER_PINNED timers might have
to run on a buzy or isolated CPU.

And note that even when (base->cpu == smp_processor_id()) we want to kick
the current CPU with a self-IPI. This way we force, from IRQ-tail, the
tick to recalculate the next deadline to fire, considering the new enqueued
timer callback.

> Without timers_migration_enabled (or if TIMER_PINNED), I don't see
> anything that could migrate the timer elsewhere, so:
> 
> Why bother kicking a NOHZ CPU for a deferrable timer if it is the next
> expiring one? Per the definition:
> 
>  * @TIMER_DEFERRABLE: A deferrable timer will work normally when the
>  * system is busy, but will not cause a CPU to come out of idle just
>  * to service it; instead, the timer will be serviced when the CPU
>  * eventually wakes up with a subsequent non-deferrable timer.
> 
> I tried to find some discussion over this in LKML, but found nothing.
> v3 of the patch did *not* kick a CPU for a deferrable timer, but v4 (the
> one that ended up merged) did (see below). Patch in question is:
> 
>   a683f390b93f ("timers: Forward the wheel clock whenever possible")

Because TIMER_DEFERRABLE timers should only be deferred when the CPU is
in "nohz-idle". If the CPU runs an actual task with the tick shutdown
("nohz-full"), we should execute those deferrable timers.

Now that's the theory. In practice the deferrable timers are ignored by
both nohz-idle and nohz-full when it comes to compute the next nohz delta.
This is a mistake that is there since the introduction of nohz-full but I've
always been scared to break some user setup while fixing it. Anyway things
should look like this (untested):

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index de192dcff828..5f8ef777a785 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -819,7 +819,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
 		 * disabled this also looks at the next expiring
 		 * hrtimer.
 		 */
-		next_tick = get_next_timer_interrupt(basejiff, basemono);
+		next_tick = get_next_timer_interrupt(basejiff, basemono, ts->inidle);
 		ts->next_timer = next_tick;
 	}
 
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 717fcb9fb14a..8279d4e9b7a0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -574,16 +574,6 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
 	if (!is_timers_nohz_active())
 		return;
 
-	/*
-	 * TODO: This wants some optimizing similar to the code below, but we
-	 * will do that when we switch from push to pull for deferrable timers.
-	 */
-	if (timer->flags & TIMER_DEFERRABLE) {
-		if (tick_nohz_full_cpu(base->cpu))
-			wake_up_nohz_cpu(base->cpu);
-		return;
-	}
-
 	/*
 	 * We might have to IPI the remote CPU if the base is idle and the
 	 * timer is not deferrable. If the other CPU is on the way to idle
@@ -1678,17 +1668,9 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
 	return DIV_ROUND_UP_ULL(nextevt, TICK_NSEC) * TICK_NSEC;
 }
 
-/**
- * get_next_timer_interrupt - return the time (clock mono) of the next timer
- * @basej:	base time jiffies
- * @basem:	base time clock monotonic
- *
- * Returns the tick aligned clock monotonic time of the next pending
- * timer or KTIME_MAX if no timer is pending.
- */
-u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
+static u64 get_next_base_interrupt(struct timer_base *base,
+				   unsigned long basej, u64 basem)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
 	u64 expires = KTIME_MAX;
 	unsigned long nextevt;
 
@@ -1734,6 +1716,32 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 	}
 	raw_spin_unlock(&base->lock);
 
+	return expires;
+}
+
+/**
+ * get_next_timer_interrupt - return the time (clock mono) of the next timer
+ * @basej:	base time jiffies
+ * @basem:	base time clock monotonic
+ * @idle:	is the CPU idle?
+ *
+ * Returns the tick aligned clock monotonic time of the next pending
+ * timer or KTIME_MAX if no timer is pending.
+ */
+u64 get_next_timer_interrupt(unsigned long basej, u64 basem, bool idle)
+{
+	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+	u64 expires;
+
+	expires = get_next_base_interrupt(base, basej, basem);
+	if (!idle) {
+		u64 expires_def;
+
+		base = this_cpu_ptr(&timer_bases[BASE_DEF]);
+		expires_def = get_next_base_interrupt(base, basej, basem);
+		expires = min(expires, expires_def);
+	}
+
 	return cmp_next_hrtimer_event(basem, expires);
 }
 
@@ -1744,15 +1752,15 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
  */
 void timer_clear_idle(void)
 {
-	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
-
 	/*
 	 * We do this unlocked. The worst outcome is a remote enqueue sending
 	 * a pointless IPI, but taking the lock would just make the window for
 	 * sending the IPI a few instructions smaller for the cost of taking
 	 * the lock in the exit from idle path.
 	 */
-	base->is_idle = false;
+	__this_cpu_write(timer_bases[BASE_STD].is_idle, false);
+	if (tick_nohz_full_cpu(smp_processor_id()))
+		__this_cpu_write(timer_bases[BASE_DEF].is_idle, false);
 }
 #endif