lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <875y19ouj2.fsf@somnus>
Date:   Fri, 08 Dec 2023 11:31:13 +0100
From:   Anna-Maria Behnsen <anna-maria@...utronix.de>
To:     Sebastian Siewior <bigeasy@...utronix.de>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        John Stultz <jstultz@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Eric Dumazet <edumazet@...gle.com>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Arjan van de Ven <arjan@...radead.org>,
        "Paul E . McKenney" <paulmck@...nel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Giovanni Gherdovich <ggherdovich@...e.cz>,
        Lukasz Luba <lukasz.luba@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Srinivas Pandruvada <srinivas.pandruvada@...el.com>,
        K Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

Sebastian Siewior <bigeasy@...utronix.de> writes:

> On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
>> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
>> new file mode 100644
>> index 000000000000..05cd8f1bc45d
>> --- /dev/null
>> +++ b/kernel/time/timer_migration.c
>> @@ -0,0 +1,1636 @@
> …
>> + * Protection of the tmigr group state information:
>> + * ------------------------------------------------
>> + *
>> + * The state information with the list of active children and migrator needs to
>> + * be protected by a sequence counter. It prevents a race when updates in a
>
> s/a$//
>
>> + * child groups are propagated in changed order. The following scenario
>> + * describes what happens without updating the sequence counter:
>> + *
>> + * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well
>> + * as GRP0:1 will not change during the scenario):
>> + *
>> + *    LVL 1            [GRP1:0]
>> + *                     migrator = GRP0:1
>> + *                     active   = GRP0:0, GRP0:1
>> + *                   /                \
>> + *    LVL 0  [GRP0:0]                  [GRP0:1]
>> + *           migrator = CPU0           migrator = CPU2
>> + *           active   = CPU0           active   = CPU2
>> + *              /         \                /         \
>> + *    CPUs     0           1              2           3
>> + *             active      idle           active      idle
>> + *
>> + *
>> + * 1. CPU0 goes idle (changes are updated in GRP0:0; afterwards the current
>> + *    states of GRP0:0 and GRP1:0 are stored in the data for walking the
>> + *    hierarchy):
>
>    CPU0 goes idle. The state update is performed lock less and group
>    wise. In the first step only GRP0:0 has been updated. The update of
>    GRP1:0 is pending, the CPU walks through the hierarchy.
>
>> + *
>> + *    LVL 1            [GRP1:0]
>> + *                     migrator = GRP0:1
>> + *                     active   = GRP0:0, GRP0:1
>> + *                   /                \
>> + *    LVL 0  [GRP0:0]                  [GRP0:1]
>> + *       --> migrator = TMIGR_NONE     migrator = CPU2
>> + *       --> active   =                active   = CPU2
>> + *              /         \                /         \
>> + *    CPUs     0           1              2           3
>> + *         --> idle        idle           active      idle
>
>> + * 2. CPU1 comes out of idle (changes are update in GRP0:0; afterwards the
>> + *    current states of GRP0:0 and GRP1:0 are stored in the data for walking the
>> + *    hierarchy):
>
>    While CPU0 goes idle and continues to update the state, CPU1 comes
>    out of idle. CPU1 updates GRP0:0. The update for GRP1:0 is pending,
>    tge CPU walks through the hierarchy. Both CPUs now walk the hierarchy
>    to perform the needed update from their point of view.
>    The currently visible state:
>
>> + *
>> + *    LVL 1            [GRP1:0]
>> + *                     migrator = GRP0:1
>> + *                     active   = GRP0:0, GRP0:1
>> + *                   /                \
>> + *    LVL 0  [GRP0:0]                  [GRP0:1]
>> + *       --> migrator = CPU1           migrator = CPU2
>> + *       --> active   = CPU1           active   = CPU2
>> + *              /         \                /         \
>> + *    CPUs     0           1              2           3
>> + *             idle    --> active         active      idle
>> + *
>> + * 3. Here comes the change of the order: Propagating the changes of step 2
>> + *    through the hierarchy to GRP1:0 - nothing to be done, because GRP0:0
>> + *    is already up to date.
>
>     Here is the race condition: CPU1 managed to propagate its changes
>     through the hierarchy to GRP1:0 before CPU0 did. The active members
>     of GRP1:0 remain unchanged after the update since it is still valid
>     from CPU1 current point of view:
>
>          LVL 1            [GRP1:0]
>                       --> migrator = GRP0:1
>                       --> active   = GRP0:0, GRP0:1
>                         /                \
>          LVL 0  [GRP0:0]                  [GRP0:1]
>                 migrator = CPU1           migrator = CPU2
>                 active   = CPU1           active   = CPU2
>                    /         \                /         \
>          CPUs     0           1              2           3
>                   idle        active         active      idle
>
> [ I take it as the migrator remains set to GRP0:1 by CPU1 but it could
>   be changed to GRP0:0. I assume that both fields (migrator+active) are
>   changed there via the propagation and the arrow in both fields denotes
>   this. ]
>
>> + * 4. Propagating the changes of step 1 through the hierarchy to GRP1:0
>
>      Now CPU0 finally propagates its changes to GRP1:0.
>
>> + *
>> + *    LVL 1            [GRP1:0]
>> + *                 --> migrator = GRP0:1
>> + *                 --> active   = GRP0:1
>> + *                   /                \
>> + *    LVL 0  [GRP0:0]                  [GRP0:1]
>> + *           migrator = CPU1           migrator = CPU2
>> + *           active   = CPU1           active   = CPU2
>> + *              /         \                /         \
>> + *    CPUs     0           1              2           3
>> + *             idle        active         active      idle
>> + *
>> + * Now there is a inconsistent overall state because GRP0:0 is active, but
>> + * it is marked as idle in the GRP1:0. This is prevented by incrementing
>> + * sequence counter whenever changing the state.
>
>    The race of CPU0 vs CPU1 led to an inconsistent state in GRP1:0.
>    CPU1 is active and is correctly listed as active in GRP0:0. However
>    GRP1:0 does not have GRP0:0 listed as active which is wrong.
>    The sequence counter has been added to avoid inconsistent states
>    during updates. The state is updated atomically only if all members,
>    including the sequence counter, match the expected value
>    (compare-and-exchange).
>    Looking back at the previous example with the addition of the
>    sequence number: The update as performed by CPU0 in step 4 will fail.
>    CPU1 changed the sequence number during the update in step 3 so the
>    expected old value (as seen by CPU0 before starting the walk) does
>    not match.
>

Thanks a lot for rephrasing the documentation to make it clearer for the
reader! I use your proposal with some minor changes.

Thanks,

	Anna-Maria

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ