linux-kernel - Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zfrag1jh.fsf@somnus>
Date: Mon, 24 Jun 2024 10:58:26 +0200
From: Anna-Maria Behnsen <anna-maria@...utronix.de>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org,
 Borislav Petkov <bp@...en8.de>, Narasimhan V <Narasimhan.V@....com>
Subject: Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements

Frederic Weisbecker <frederic@...nel.org> writes:

> Le Fri, Jun 21, 2024 at 11:37:05AM +0200, Anna-Maria Behnsen a écrit :
>> Borislav reported a warning in timer migration deactive path
>> 
>>   https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local
>> 
>> Sadly it doesn't reproduce directly. But with the change of timing (by
>> adding a trace prinkt before the warning), it is possible to trigger the
>> warning reliable at least in my test setup. The problem here is a racy
>> check agains group->parent pointer. This is also used in other places in
>> the code and fixing this racy usage is adressed by the first patch.
>> 
>> While working with the code, I saw two things which could be improved
>> (tracing and update of per cpu group wakeup value). This improvements are
>> adressed by the other two patches.
>> 
>> Patches are available here:
>> 
>>   https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc
>> 
>> Cc: Frederic Weisbecker <frederic@...nel.org>
>> Cc: Thomas Gleixner <tglx@...utronix.de>
>> Cc: linux-kernel@...r.kernel.org
>> 
>> Thanks,
>> 
>> Anna-Maria
>> 
>> ---
>
> This made me stare at the group creation again and I might have found
> something. Does the following race look plausible to you?

Yes...

>
>                   [GRP0:0]
>                migrator = 0
>                active   = 0
>                nextevt  = KTIME_MAX
>                /         \
>               0         1 .. 7
>           active         idle
>
> 0) Hierarchy has only 8 CPUs (single node for now with only CPU 0
>    as active.
>
>    
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                                          \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = 0              migrator = TMIGR_NONE
>               active   = 0              active   = NONE
>               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>           active         idle                !online
>
> 1) CPU 8 is booting and creates a new node and a new top. For now it's
>    only connected to GRP0:1, not yet to GRP0:0. Also CPU 8 hasn't called
>    __tmigr_cpu_activate() on itself yet.

NIT: In step 1) CPU8 is not yet connected to GRP0:1 as the second while
loop is not yet finished, but nevertheless...

>
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = 0              migrator = TMIGR_NONE
>               active   = 0              active   = NONE
>               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>           active         idle                active
>
> 2) CPU 8 connects GRP0:0 to GRP1:0 and observes while in
>    tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
>    prepares to call tmigr_active_up() on it. It hasn't done it yet.

NIT: CPU8 keeps its state !online until step 5.

>
>
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE        migrator = TMIGR_NONE
>               active   = NONE              active   = NONE
>               nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
>    GRP0:0->lock held, CPU 0 observes GRP1:0 after calling tmigr_update_events()
>    and it propagates the change to the top (no change there and no wakeup
>    programmed since there is no timer).
>
>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE       migrator = TMIGR_NONE
>               active   = NONE             active   = NONE
>               nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 4) Now CPU 8 finally calls tmigr_active_up() to GRP0:0

... oh no. Here it starts to go mad. Good catch!

>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0, GRP0:1
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE       migrator = 8
>               active   = NONE             active   = 8
>               nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 5) And out of tmigr_cpu_online() CPU 8 calls tmigr_active_up() on itself
>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0
>                         nextevt  = T8
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE         migrator = TMIGR_NONE
>               active   = NONE               active   = NONE
>               nextevt  = KTIME_MAX          nextevt  = T8
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                  idle
>
> 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
>    But it's not really active, so T8 gets ignored.
>
>
> And if that race looks plausible, does the following fix look good?

Holding the lock will not help as the state is not protected by the
lock. Step 4) would nevertheless happen. To properly fix this, we need a
memory barrier. I added a comment back then in tmigr_active_up() why the
barrier is not required there but without having this corner case in
mind. Or am I wrong?

To solve it, we could

a) add a memory barrier also in tmigr_active_up() and read the
   childstate there always (similar to tmigr_inactive_up()).

b) create a separate tmigr_setup_active_up() function with this memory
   barrier and with reading the childstate there after the memory
   barrier.

I would propose to go with b) to do not impact active_up().

I dropped the walk_hierarchy information for tmigr_setup_active_up() and
also do not set the group event ignore bit. This shouldn't be required,
as this active update could only happen between the new top level and
the formerly top level group. And the event ignore bit on the top level
isn't taken into account.

This is not tested yet, but it could look like the following. When it
makes sense to you and if I didn't miss something else - I would start
testing and functionality verification ;)

---8<----
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -623,12 +623,37 @@ static u64 tmigr_next_groupevt_expires(s
 		return evt->nextevt.expires;
 }
 
+static bool __tmigr_active_up(struct tmigr_group *group, bool *walk_done, union tmigr_state *curstate, u8 childmask)
+{
+	union tmigr_state newstate;
+
+	newstate = *curstate;
+	*walk_done = true;
+
+	if (newstate.migrator == TMIGR_NONE) {
+		newstate.migrator = childmask;
+
+		/* Changes need to be propagated */
+		*walk_done = false;
+	}
+
+	newstate.active |= childmask;
+	newstate.seq++;
+
+	if (atomic_try_cmpxchg(&group->migr_state, &curstate->state, newstate.state)) {
+		trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+		return true;
+	}
+
+	return false;
+}
+
 static bool tmigr_active_up(struct tmigr_group *group,
 			    struct tmigr_group *child,
 			    void *ptr)
 {
-	union tmigr_state curstate, newstate;
 	struct tmigr_walk *data = ptr;
+	union tmigr_state curstate;
 	bool walk_done;
 	u8 childmask;
 
@@ -640,23 +665,10 @@ static bool tmigr_active_up(struct tmigr
 	 */
 	curstate.state = atomic_read(&group->migr_state);
 
-	do {
-		newstate = curstate;
-		walk_done = true;
-
-		if (newstate.migrator == TMIGR_NONE) {
-			newstate.migrator = childmask;
-
-			/* Changes need to be propagated */
-			walk_done = false;
-		}
-
-		newstate.active |= childmask;
-		newstate.seq++;
-
-	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
-
-	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+	for(;;) {
+		if (__tmigr_active_up(group, &walk_done, &curstate, childmask))
+		    break;
+	}
 
 	if (walk_done == false)
 		data->childmask = group->childmask;
@@ -1436,6 +1448,35 @@ u64 tmigr_quick_check(u64 nextevt)
 	return KTIME_MAX;
 }
 
+static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
+{
+	union tmigr_state curstate, childstate;
+	bool walk_done;
+
+	/*
+	 * FIXME: Memory barrier is required here as the child state
+	 * could have changed in the meantime
+	 */
+	curstate.state = atomic_read_acquire(&group->migr_state);
+
+	for (;;) {
+		childstate.state = atomic_read(&child->migr_state);
+		if (!childstate.active)
+			return;
+
+		if (__tmigr_active_up(group, &walk_done, &curstate, child->childmask))
+			break;
+
+		/*
+		 * The memory barrier is paired with the cmpxchg() in
+		 * tmigr_inactive_up() to make sure the updates of child and group
+		 * states are ordered. It is required only when the
+		 * try_cmpxchg() in __tmigr_active_up() fails.
+		 */
+		smp_mb__after_atomic();
+	}
+}
+
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 			     int node)
 {
@@ -1510,8 +1551,6 @@ static struct tmigr_group *tmigr_get_gro
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent)
 {
-	union tmigr_state childstate;
-
 	raw_spin_lock_irq(&child->lock);
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
 
@@ -1539,21 +1578,7 @@ static void tmigr_connect_child_parent(s
 	 *   executed with the formerly top level group (child) and the newly
 	 *   created group (parent).
 	 */
-	childstate.state = atomic_read(&child->migr_state);
-	if (childstate.migrator != TMIGR_NONE) {
-		struct tmigr_walk data;
-
-		data.childmask = child->childmask;
-
-		/*
-		 * There is only one new level per time (which is protected by
-		 * tmigr_mutex). When connecting the child and the parent and
-		 * set the child active when the parent is inactive, the parent
-		 * needs to be the uppermost level. Otherwise there went
-		 * something wrong!
-		 */
-		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
-	}
+	tmigr_setup_active_up(parent, child);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)