linux-kernel - Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v87a9033.fsf@somnus>
Date: Tue, 30 Jan 2024 18:56:32 +0100
From: Anna-Maria Behnsen <anna-maria@...utronix.de>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
 John Stultz <jstultz@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
 Eric Dumazet <edumazet@...gle.com>, "Rafael J . Wysocki"
 <rafael.j.wysocki@...el.com>, Arjan van de Ven <arjan@...radead.org>,
 "Paul E . McKenney" <paulmck@...nel.org>, Rik van Riel <riel@...riel.com>,
 Steven Rostedt <rostedt@...dmis.org>, Sebastian Siewior
 <bigeasy@...utronix.de>, Giovanni Gherdovich <ggherdovich@...e.cz>, Lukasz
 Luba <lukasz.luba@....com>, "Gautham R . Shenoy" <gautham.shenoy@....com>,
 Srinivas Pandruvada <srinivas.pandruvada@...el.com>, K Prateek Nayak
 <kprateek.nayak@....com>
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

Frederic Weisbecker <frederic@...nel.org> writes:

> Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a écrit :
>> +int tmigr_requires_handle_remote(void)
>> +{
>> +	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
>> +	struct tmigr_remote_data data;
>> +	unsigned int ret = 0;
>> +	unsigned long jif;
>> +
>> +	if (tmigr_is_not_available(tmc))
>> +		return ret;
>> +
>> +	data.now = get_jiffies_update(&jif);
>> +	data.childmask = tmc->childmask;
>> +	data.firstexp = KTIME_MAX;
>> +	data.tmc_active = !tmc->idle;
>> +	data.check = false;
>> +
>> +	/*
>> +	 * If the CPU is active, walk the hierarchy to check whether a remote
>> +	 * expiry is required.
>> +	 *
>> +	 * Check is done lockless as interrupts are disabled and @tmc->idle is
>> +	 * set only by the local CPU.
>> +	 */
>> +	if (!tmc->idle) {
>> +		__walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
>> +
>> +		if (data.firstexp != KTIME_MAX)
>> +			ret = 1;
>> +
>> +		return ret;
>> +	}
>> +
>> +	/*
>> +	 * If the CPU is idle, check whether the recalculation of @tmc->wakeup
>> +	 * is required. @tmc->wakeup_recalc is set, when the last active CPU
>> +	 * went offline. The last active CPU delegated the handling of the timer
>> +	 * migration hierarchy to another (this) CPU by updating this flag and
>> +	 * sending a reschedule.
>> +	 *
>> +	 * Racy lockless check is valid:
>> +	 * - @tmc->wakeup_recalc is set by the remote CPU before it issues
>> +	 *   reschedule IPI.
>> +	 * - As interrupts are disabled here this CPU will either observe
>> +	 *   @tmc->wakeup_recalc set before the reschedule IPI can be handled or
>> +	 *   it will observe it when this function is called again on return
>> +	 *   from handling the reschedule IPI.
>> +	 */
>> +	if (tmc->wakeup_recalc) {
>> +		__walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
>> +
>> +		if (data.firstexp != KTIME_MAX)
>> +			ret = 1;
>> +
>> +		raw_spin_lock(&tmc->lock);
>> +		WRITE_ONCE(tmc->wakeup, data.firstexp);
>> +		tmc->wakeup_recalc = false;
>> +		raw_spin_unlock(&tmc->lock);
>
> Suppose we have:
>
>             [GRP1:0]
>             migrator = GRP0:1
>             active   = GRP0:0, GRP0:1
>               /                \
>      [GRP0:0]                  [GRP0:1]
>      migrator = CPU 1         migrator = CPU 3
>      active   = CPU 1         active   = CPU 3
>        /         \               /         \
> CPUs  0           1             2           3
>      idle        active        idle        active
>
> CPU 0 and CPU 2 have no timer.
> CPU 1 has a timer in a few millisecs.
>
>             [GRP1:0]
>             migrator = GRP0:1
>             active   = GRP0:1
>               /                \
>      [GRP0:0]                  [GRP0:1]
>      migrator = NONE           migrator = CPU 3
>      active   = NONE           active   = CPU 3
>        /         \                /         \
> CPUs  0           1              2           3
>      idle        idle         idle        active
>
>
> CPU 1 went idle, CPU 3 will take care of CPU 1's timer. Then come two
> things happening at the same time: CPU 0 has a timer interrupt, due to
> RCU callbacks handling for example, and CPU 3 goes offline:
>
> CPU 0                                   CPU 3
> -----                                   -----
>                                         // On top level [GRP1:0], just set migrator = TMIGR_NONE
>                                         tmigr_inactive_up() {
>                                             cpu = cpumask_any_but(cpu_online_mask, cpu);
>                                             //cpu == 0
>                                             tmc_resched = per_cpu_ptr(&tmigr_cpu, CPU 0);
>                                             raw_spin_lock(&tmc_resched->lock);
>                                             tmc_resched->wakeup_recalc = true;
>                                             raw_spin_unlock(&tmc_resched->lock);
> // timer interrupt
> run_local_timers() {
>     tmigr_requires_handle_remote() {
>         data.firstexp = KTIME_MAX;
>         // CPU 0 sees the tmc_resched->wakeup_recalc
>         // latest update
>         if (tmc->wakeup_recalc) {
>             tmigr_requires_handle_remote_up() {
>                 // CPU 0 doesn't see GRP0:0 
>                 // latest update from CPU 1,
>                 // because it has no locking
>                 // and does a racy check.
>         	    if (!tmigr_check_migrator(group, childmask))
>                     return true;
>             }
>             raw_spin_lock(&tmc->lock);
>             WRITE_ONCE(tmc->wakeup, data.firstexp);
>             tmc->wakeup_recalc = false;
>             raw_spin_unlock(&tmc->lock)
>             return 0;
>         }
>                                             // IPI is sent only now
> 		                                    smp_send_reschedule(cpu);
>                                             }
>
>
> There is nothing that prevents CPU 0 from not seeing the hierarchy updates from
> other CPUs since it checks the migrators in a racy way. As a result the timer of
> CPU 1 may be ignored by CPU 0.
>
> You'd need to lock the tmc while calling tmigr_requires_handle_remote_up(), so
> that CPU 0 "inherits" everything that CPU 3 has seen, and that includes changes
> from CPU 1.
>

puhh. ok. But for the !idle case the lockless walk of
tmigr_requires_handle_remote_up() is ok? It's also possible, that the
CPU misses an update of the state - another CPU goes idle and selects
this CPU as the new migrator. And this CPU reads a stale value where the
other CPU is migrator. But this will be revisited on the next
tick. hmm...

>
> But I see that tmigr_cpu_new_timer() does it right. Wouldn't it be possible to
> exlusively let tmigr_cpu_new_timer() handle the wakeup_recalc thing? This is
> going to be called after the end of the IRQ (whether timer interrupt or sched
> IPI) in any case.

Should work, yes. But when a timer has to be handled right away and it
is checked after the end of the IRQ, then the tick might be reprogrammed
so that CPU comes out of idle, or am I wrong?

But, while I'm having a deeper look at the code - I completely destroyed
the logic to use the 'check' value of the tmigr_remote_date struct for
making a decision whether to raise a timer softirq or not... Whenever
the expiry in the top level is !KTIME_MAX, then
tmigr_require_handle_remote returns 1. Oh no. Also the struct member
description is not up to date.

Thanks,

	Anna-Maria