linux-kernel - Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231207180928.FZB319OJ@linutronix.de>
Date:   Thu, 7 Dec 2023 19:09:28 +0100
From:   Sebastian Siewior <bigeasy@...utronix.de>
To:     Anna-Maria Behnsen <anna-maria@...utronix.de>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Eric Dumazet <edumazet@...gle.com>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Arjan van de Ven <arjan@...radead.org>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Giovanni Gherdovich <ggherdovich@...e.cz>,
        Lukasz Luba <lukasz.luba@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Srinivas Pandruvada <srinivas.pandruvada@...el.com>,
        K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@
…
> + * Protection of the tmigr group state information:
> + * ------------------------------------------------
> + *
> + * The state information with the list of active children and migrator needs to
> + * be protected by a sequence counter. It prevents a race when updates in a

s/a$//

> + * child groups are propagated in changed order. The following scenario
> + * describes what happens without updating the sequence counter:
> + *
> + * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well
> + * as GRP0:1 will not change during the scenario):
> + *
> + *    LVL 1            [GRP1:0]
> + *                     migrator = GRP0:1
> + *                     active   = GRP0:0, GRP0:1
> + *                   /                \
> + *    LVL 0  [GRP0:0]                  [GRP0:1]
> + *           migrator = CPU0           migrator = CPU2
> + *           active   = CPU0           active   = CPU2
> + *              /         \                /         \
> + *    CPUs     0           1              2           3
> + *             active      idle           active      idle
> + *
> + *
> + * 1. CPU0 goes idle (changes are updated in GRP0:0; afterwards the current
> + *    states of GRP0:0 and GRP1:0 are stored in the data for walking the
> + *    hierarchy):

   CPU0 goes idle. The state update is performed lock less and group
   wise. In the first step only GRP0:0 has been updated. The update of
   GRP1:0 is pending, the CPU walks through the hierarchy.

> + *
> + *    LVL 1            [GRP1:0]
> + *                     migrator = GRP0:1
> + *                     active   = GRP0:0, GRP0:1
> + *                   /                \
> + *    LVL 0  [GRP0:0]                  [GRP0:1]
> + *       --> migrator = TMIGR_NONE     migrator = CPU2
> + *       --> active   =                active   = CPU2
> + *              /         \                /         \
> + *    CPUs     0           1              2           3
> + *         --> idle        idle           active      idle

> + * 2. CPU1 comes out of idle (changes are update in GRP0:0; afterwards the
> + *    current states of GRP0:0 and GRP1:0 are stored in the data for walking the
> + *    hierarchy):

   While CPU0 goes idle and continues to update the state, CPU1 comes
   out of idle. CPU1 updates GRP0:0. The update for GRP1:0 is pending,
   tge CPU walks through the hierarchy. Both CPUs now walk the hierarchy
   to perform the needed update from their point of view.
   The currently visible state:

> + *
> + *    LVL 1            [GRP1:0]
> + *                     migrator = GRP0:1
> + *                     active   = GRP0:0, GRP0:1
> + *                   /                \
> + *    LVL 0  [GRP0:0]                  [GRP0:1]
> + *       --> migrator = CPU1           migrator = CPU2
> + *       --> active   = CPU1           active   = CPU2
> + *              /         \                /         \
> + *    CPUs     0           1              2           3
> + *             idle    --> active         active      idle
> + *
> + * 3. Here comes the change of the order: Propagating the changes of step 2
> + *    through the hierarchy to GRP1:0 - nothing to be done, because GRP0:0
> + *    is already up to date.

    Here is the race condition: CPU1 managed to propagate its changes
    through the hierarchy to GRP1:0 before CPU0 did. The active members
    of GRP1:0 remain unchanged after the update since it is still valid
    from CPU1 current point of view:

         LVL 1            [GRP1:0]
                      --> migrator = GRP0:1
                      --> active   = GRP0:0, GRP0:1
                        /                \
         LVL 0  [GRP0:0]                  [GRP0:1]
                migrator = CPU1           migrator = CPU2
                active   = CPU1           active   = CPU2
                   /         \                /         \
         CPUs     0           1              2           3
                  idle        active         active      idle

[ I take it as the migrator remains set to GRP0:1 by CPU1 but it could
  be changed to GRP0:0. I assume that both fields (migrator+active) are
  changed there via the propagation and the arrow in both fields denotes
  this. ]

> + * 4. Propagating the changes of step 1 through the hierarchy to GRP1:0

     Now CPU0 finally propagates its changes to GRP1:0.

> + *
> + *    LVL 1            [GRP1:0]
> + *                 --> migrator = GRP0:1
> + *                 --> active   = GRP0:1
> + *                   /                \
> + *    LVL 0  [GRP0:0]                  [GRP0:1]
> + *           migrator = CPU1           migrator = CPU2
> + *           active   = CPU1           active   = CPU2
> + *              /         \                /         \
> + *    CPUs     0           1              2           3
> + *             idle        active         active      idle
> + *
> + * Now there is a inconsistent overall state because GRP0:0 is active, but
> + * it is marked as idle in the GRP1:0. This is prevented by incrementing
> + * sequence counter whenever changing the state.

   The race of CPU0 vs CPU1 led to an inconsistent state in GRP1:0.
   CPU1 is active and is correctly listed as active in GRP0:0. However
   GRP1:0 does not have GRP0:0 listed as active which is wrong.
   The sequence counter has been added to avoid inconsistent states
   during updates. The state is updated atomically only if all members,
   including the sequence counter, match the expected value
   (compare-and-exchange).
   Looking back at the previous example with the addition of the
   sequence number: The update as performed by CPU0 in step 4 will fail.
   CPU1 changed the sequence number during the update in step 3 so the
   expected old value (as seen by CPU0 before starting the walk) does
   not match.


Sebastian