linux-kernel - Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231208181805.BbFDsoJe@linutronix.de>
Date:   Fri, 8 Dec 2023 19:18:05 +0100
From:   Sebastian Siewior <bigeasy@...utronix.de>
To:     Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Eric Dumazet <edumazet@...gle.com>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Arjan van de Ven <arjan@...radead.org>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Giovanni Gherdovich <ggherdovich@...e.cz>,
        Lukasz Luba <lukasz.luba@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Srinivas Pandruvada <srinivas.pandruvada@...el.com>,
        K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@
…
> + * Required event and timerqueue update after a remote expiry:
> + * -----------------------------------------------------------
> + *
> + * After a remote expiry of a CPU, a walk through the hierarchy updating the
> + * events and timerqueues has to be done when there is a 'new' global timer of
> + * the remote CPU (which is obvious) but also if there is no new global timer,
> + * but the remote CPU is still idle:

After expiring timers of a remote CPU, a walk through the hierarchy and
updating events timerqueues is required. It is obviously needed if there
is a 'new' global timer but also if there no new global timer but the
remote CPU is still idle.

> + * 1. CPU2 is the migrator and does the remote expiry in GRP1:0; expiry of
> + *    evt-CPU0 and evt-CPU1 are equal:

       CPU0 and CPU1 have both a timer expiring at the same time so both
       have an event enqueued in the timerqueue. CPU2 and CPU3 have no
       global timer pending and CPU2 is the only active CPU and also the
       migrator.

> + *
> + *    LVL 1            [GRP1:0]
> + *                     migrator = GRP0:1
> + *                     active   = GRP0:1
> + *                 --> timerqueue = evt-GRP0:0
> + *                   /                \
> + *    LVL 0  [GRP0:0]                  [GRP0:1]
> + *           migrator = TMIGR_NONE     migrator = CPU2
> + *           active   =                active   = CPU2
> + *           groupevt.ignore = false   groupevt.ignore = true
> + *           groupevt.cpu = CPU0       groupevt.cpu =
> + *           timerqueue = evt-CPU0,    timerqueue =
> + *                        evt-CPU1
> + *              /         \                /         \
> + *    CPUs     0           1              2           3
> + *             idle        idle           active      idle
> + *
> + * 2. Remove the first event of the timerqueue in GRP1:0 and expire the timers
> + *    of CPU0 (see evt-GRP0:0->cpu value):

      CPU2 begins to expire remote timers. It starts with own group
      GRP0:1. GRP0:1 has nothing in ts timerqueue and continues with its
      parent, GRP1:0. In GRP1:0 it dequeues the first event. It looks at
      CPU member expires the pending timer of CPU0.

> + *    LVL 1            [GRP1:0]
> + *                     migrator = GRP0:1
> + *                     active   = GRP0:1
> + *                 --> timerqueue =
> + *                   /                \
> + *    LVL 0  [GRP0:0]                  [GRP0:1]
> + *           migrator = TMIGR_NONE     migrator = CPU2
> + *           active   =                active   = CPU2
> + *           groupevt.ignore = false   groupevt.ignore = true
> + *       --> groupevt.cpu = CPU0       groupevt.cpu =
> + *           timerqueue = evt-CPU0,    timerqueue =
> + *                        evt-CPU1
> + *              /         \                /         \
> + *    CPUs     0           1              2           3
> + *             idle        idle           active      idle
> + *
> + * 3. After the remote expiry CPU0 has no global timer that needs to be
> + *    enqueued. When skipping the walk, the global timer of CPU1 is not handled,
> + *    as the group event of GRP0:0 is not updated and not enqueued into GRP1:0. The
> + *    walk has to be done to update the group events and timerqueues:

     The work isn't over after expiring timers of CPU0. If we stop
     here, then CPU1's timer have not been expired and the timerqueue of
     GRP0:0 has still an event for CPU0 enqueued which has just been
     processed. So it is required to walk the hierarchy from CPU0's
     point of view and update it accordingly.
     CPU0 will be removed from the timerqueue because it has no pending
     timer. If CPU0 would have a timer pending then it has to expire
     after CPU1's first timer because all timer from this period have
     just been expired.
     Either way CPU1 will be first in GRP0:0's timerqueue and therefore
     set in the CPU field of the group event which is enqueued in
     GRP1:0's timerqueue.

> + *    LVL 1            [GRP1:0]
> + *                     migrator = GRP0:1
> + *                     active   = GRP0:1
> + *                 --> timerqueue = evt-GRP0:0
> + *                   /                \
> + *    LVL 0  [GRP0:0]                  [GRP0:1]
> + *           migrator = TMIGR_NONE     migrator = CPU2
> + *           active   =                active   = CPU2
> + *           groupevt.ignore = false   groupevt.ignore = true
> + *       --> groupevt.cpu = CPU1       groupevt.cpu =
> + *       --> timerqueue = evt-CPU1     timerqueue =
> + *              /         \                /         \
> + *    CPUs     0           1              2           3
> + *             idle        idle           active      idle
> + *
> + * Now CPU2 (migrator) is able to handle the timer of CPU1 as CPU2 only scans
> + * the timerqueues of GRP0:1 and GRP1:0.

    Now CPU2 continues step 2 at GRP1:0 and will expire the timer of
    CPU1.

> + * The update of step 3 is valid to be skipped, when the remote CPU went offline
> + * in the meantime because an update was already done during inactive path. When
> + * CPU became active in the meantime, update isn't required as well, because
> + * GRP0:0 is now longer idle.

   The hierarchy walk in step 3 can be skipped if the migrator notices
   that a CPU of GRP0:0 is active. The CPU will mark GRP0:0 active and
   take care of the group and any needed updates within the hierarchy.

I skipped the "offline" part because it is not needed. Before the CPU
can go offline it has first to come out of idle. While going offline it
won't (probably) participate here and the remaining timer will be
migrated to another CPU.

> + */
…
> +
> +typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, void *);
> +
> +static void __walk_groups(up_f up, void *data,
> +			  struct tmigr_cpu *tmc)
> +{
> +	struct tmigr_group *child = NULL, *group = tmc->tmgroup;
> +
> +	do {
> +		WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
> +
> +		if (up(group, child, data))
> +			break;
> +
> +		child = group;
> +		group = group->parent;
> +	} while (group);
> +}
> +
> +static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
> +{
> +	lockdep_assert_held(&tmc->lock);
> +
> +	__walk_groups(up, data, tmc);
> +}

So these two. walk_groups() uses all have tmigr_cpu::lock acquired and
__walk_groups() don't. Also the `up' function passed walk_groups() has
always the same data type while the data argument passed to
__walk_groups() has also the same type but different compared to the
former.

Given the locking situation and the type of the data argument looks like
walk_groups() is used for thing#1 and __walk_groups() for thing#2.
Therefore it could make sense have two separate functions (instead of
walk_groups() and __walk_groups()) to distinguish this.
Now it is too late but maybe later I figure out why the one type
requires locking and the other doesn't.

…
> +/*
> + * Return the next event which is already expired of the group timerqueue
> + *
> + * Event, which is returned, is also removed from the queue.
> + */
> +static struct tmigr_event *tmigr_next_expired_groupevt(struct tmigr_group *group,
> +						     u64 now)
> +{
> +	struct tmigr_event *evt = tmigr_next_groupevt(group);
> +
> +	if (!evt || now < evt->nextevt.expires)
> +		return NULL;
> +
> +	/*
> +	 * The event is already expired. Remove it. If it's not the last event,
> +	 * then update all group event related information.
> +	 */

  The event expired, remove it. Update group's next expire time.

> +	if (timerqueue_del(&group->events, &evt->nextevt))
> +		tmigr_next_groupevt(group);
> +	else
> +		WRITE_ONCE(group->next_expiry, KTIME_MAX);

And then you can invoke tmigr_next_groupevt() unconditionally.

> +	return evt;
> +}
> +

Sebastian