linux-kernel - Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zbb5m0hRHgk59-8z@pavilion.home>
Date: Mon, 29 Jan 2024 02:04:27 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc: linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
	John Stultz <jstultz@...gle.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Eric Dumazet <edumazet@...gle.com>,
	"Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
	Arjan van de Ven <arjan@...radead.org>,
	"Paul E . McKenney" <paulmck@...nel.org>,
	Rik van Riel <riel@...riel.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Sebastian Siewior <bigeasy@...utronix.de>,
	Giovanni Gherdovich <ggherdovich@...e.cz>,
	Lukasz Luba <lukasz.luba@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>,
	Srinivas Pandruvada <srinivas.pandruvada@...el.com>,
	K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a écrit :
> +int tmigr_requires_handle_remote(void)
> +{
> +	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> +	struct tmigr_remote_data data;
> +	unsigned int ret = 0;
> +	unsigned long jif;
> +
> +	if (tmigr_is_not_available(tmc))
> +		return ret;
> +
> +	data.now = get_jiffies_update(&jif);
> +	data.childmask = tmc->childmask;
> +	data.firstexp = KTIME_MAX;
> +	data.tmc_active = !tmc->idle;
> +	data.check = false;
> +
> +	/*
> +	 * If the CPU is active, walk the hierarchy to check whether a remote
> +	 * expiry is required.
> +	 *
> +	 * Check is done lockless as interrupts are disabled and @tmc->idle is
> +	 * set only by the local CPU.
> +	 */
> +	if (!tmc->idle) {
> +		__walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
> +
> +		if (data.firstexp != KTIME_MAX)
> +			ret = 1;
> +
> +		return ret;
> +	}
> +
> +	/*
> +	 * If the CPU is idle, check whether the recalculation of @tmc->wakeup
> +	 * is required. @tmc->wakeup_recalc is set, when the last active CPU
> +	 * went offline. The last active CPU delegated the handling of the timer
> +	 * migration hierarchy to another (this) CPU by updating this flag and
> +	 * sending a reschedule.
> +	 *
> +	 * Racy lockless check is valid:
> +	 * - @tmc->wakeup_recalc is set by the remote CPU before it issues
> +	 *   reschedule IPI.
> +	 * - As interrupts are disabled here this CPU will either observe
> +	 *   @tmc->wakeup_recalc set before the reschedule IPI can be handled or
> +	 *   it will observe it when this function is called again on return
> +	 *   from handling the reschedule IPI.
> +	 */
> +	if (tmc->wakeup_recalc) {
> +		__walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
> +
> +		if (data.firstexp != KTIME_MAX)
> +			ret = 1;
> +
> +		raw_spin_lock(&tmc->lock);
> +		WRITE_ONCE(tmc->wakeup, data.firstexp);
> +		tmc->wakeup_recalc = false;
> +		raw_spin_unlock(&tmc->lock);

Suppose we have:

            [GRP1:0]
            migrator = GRP0:1
            active   = GRP0:0, GRP0:1
              /                \
     [GRP0:0]                  [GRP0:1]
     migrator = CPU 1         migrator = CPU 3
     active   = CPU 1         active   = CPU 3
       /         \               /         \
CPUs  0           1             2           3
     idle        active        idle        active

CPU 0 and CPU 2 have no timer.
CPU 1 has a timer in a few millisecs.

            [GRP1:0]
            migrator = GRP0:1
            active   = GRP0:1
              /                \
     [GRP0:0]                  [GRP0:1]
     migrator = NONE           migrator = CPU 3
     active   = NONE           active   = CPU 3
       /         \                /         \
CPUs  0           1              2           3
     idle        idle         idle        active


CPU 1 went idle, CPU 3 will take care of CPU 1's timer. Then come two
things happening at the same time: CPU 0 has a timer interrupt, due to
RCU callbacks handling for example, and CPU 3 goes offline:

CPU 0                                   CPU 3
-----                                   -----
                                        // On top level [GRP1:0], just set migrator = TMIGR_NONE
                                        tmigr_inactive_up() {
                                            cpu = cpumask_any_but(cpu_online_mask, cpu);
                                            //cpu == 0
                                            tmc_resched = per_cpu_ptr(&tmigr_cpu, CPU 0);
                                            raw_spin_lock(&tmc_resched->lock);
                                            tmc_resched->wakeup_recalc = true;
                                            raw_spin_unlock(&tmc_resched->lock);
// timer interrupt
run_local_timers() {
    tmigr_requires_handle_remote() {
        data.firstexp = KTIME_MAX;
        // CPU 0 sees the tmc_resched->wakeup_recalc
        // latest update
        if (tmc->wakeup_recalc) {
            tmigr_requires_handle_remote_up() {
                // CPU 0 doesn't see GRP0:0 
                // latest update from CPU 1,
                // because it has no locking
                // and does a racy check.
        	    if (!tmigr_check_migrator(group, childmask))
                    return true;
            }
            raw_spin_lock(&tmc->lock);
            WRITE_ONCE(tmc->wakeup, data.firstexp);
            tmc->wakeup_recalc = false;
            raw_spin_unlock(&tmc->lock)
            return 0;
        }
                                            // IPI is sent only now
		                                    smp_send_reschedule(cpu);
                                            }


There is nothing that prevents CPU 0 from not seeing the hierarchy updates from
other CPUs since it checks the migrators in a racy way. As a result the timer of
CPU 1 may be ignored by CPU 0.

You'd need to lock the tmc while calling tmigr_requires_handle_remote_up(), so
that CPU 0 "inherits" everything that CPU 3 has seen, and that includes changes
from CPU 1.


But I see that tmigr_cpu_new_timer() does it right. Wouldn't it be possible to
exlusively let tmigr_cpu_new_timer() handle the wakeup_recalc thing? This is
going to be called after the end of the IRQ (whether timer interrupt or sched
IPI) in any case.

Thanks.