linux-kernel - Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2128d714-90f7-1598-b32b-559206fce5de@amd.com>
Date: Fri, 4 Oct 2024 16:40:08 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Klaus Kudielka <klaus.kudielka@...il.com>, Chris Bainbridge
	<chris.bainbridge@...il.com>, Peter Zijlstra <peterz@...radead.org>
CC: <linux-kernel@...r.kernel.org>, <bsegall@...gle.com>,
	<dietmar.eggemann@....com>, <efault@....de>, <juri.lelli@...hat.com>,
	<mgorman@...e.de>, <mingo@...hat.com>, <rostedt@...dmis.org>,
	<tglx@...utronix.de>, <vincent.guittot@...aro.org>, <vschneid@...hat.com>,
	<wuyun.abel@...edance.com>, <youssefesmat@...omium.org>, <spasswolf@....de>,
	<regressions@...ts.linux.dev>, Johannes Weiner <hannes@...xchg.org>, "Linux
 regression tracking (Thorsten Leemhuis)" <regressions@...mhuis.info>,
	"Gautham R. Shenoy" <gautham.shenoy@....com>
Subject: Re: [REGRESSION] Re: [PATCH 17/24] sched/fair: Implement delayed
 dequeue

Hello folks,

On 10/3/2024 11:01 AM, Klaus Kudielka wrote:
> On Sun, 2024-09-22 at 16:45 +0100, Chris Bainbridge wrote:
>> On Fri, Aug 30, 2024 at 02:34:56PM +0200, Bert Karwatzki wrote:
>>> Since linux next-20240820 the following messages appears when booting:
>>>
>>> [    T1] smp: Bringing up secondary CPUs ...
>>> [    T1] smpboot: x86: Booting SMP configuration:
>>> [    T1] .... node  #0, CPUs:        #2  #4  #6  #8 #10 #12 #14  #1
>>> This is the line I'm concerend about:
>>> [    T1] psi: inconsistent task state! task=61:cpuhp/3 cpu=0 psi_flags=4 clear=0 set=4
>>> [    T1]   #3  #5  #7  #9 #11 #13 #15
>>> [    T1] Spectre V2 : Update user space SMT mitigation: STIBP always-on
>>> [    T1] smp: Brought up 1 node, 16 CPUs
>>> [    T1] smpboot: Total of 16 processors activated (102216.16 BogoMIPS)
>>>
>>> I bisected this to commit 152e11f6df29 ("sched/fair: Implement delayed dequeue").
>>> Is this normal or is this something I should worry about?
>>>
>>> Bert Karwatzki
>>
>> I am also getting a similar error on boot, and bisected it to the same commit:
>>
>> [    0.342931] psi: inconsistent task state! task=15:rcu_tasks_trace cpu=0 psi_flags=4 clear=0 set=4
>>
>> #regzbot introduced: 152e11f6df293e816a6a37c69757033cdc72667d
> 
> Just another data point, while booting 6.12-rc1 on a Turris Omnia:
> 
> [    0.000000] Linux version 6.12.0-rc1 (XXX) (arm-linux-gnueabihf-gcc (Debian 14.2.0-1) 14.2.0, GNU ld (GNU Binutils for Debian) 2.43.1) #1 SMP Thu Oct  3 06:59:25 CEST 2024
> [    0.000000] CPU: ARMv7 Processor [414fc091] revision 1 (ARMv7), cr=10c5387d
> [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
> [    0.000000] OF: fdt: Machine model: Turris Omnia
> ...
> [    0.000867] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
> [    0.000876] psi: inconsistent task state! task=2:kthreadd cpu=0 psi_flags=4 clear=0 set=4
> 

Not sure if someone took a stab at this but I haven't seen the "psi:
inconsistent task state" warning with the below diff. I'm not sure if my
approach is right which if why I'm pasting the diff before sending out
an official series. Any comments or testing is greatly appreciated.

The diff is based on:

     git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent

at commit d4ac164bde7a ("sched/eevdf: Fix wakeup-preempt by checking
cfs_rq->nr_running")

My approach was as follows:

o psi_dequeue() relied on psi_sched_switch() to set the PSI flags
   appropriately for a dequeued task. However, psi_sched_switch() used
   "!task_on_rq_queued(prev)" to judge if the prev task is blocked which
   is now untrue with DELAYED_DEQUEUE. Fix it by checking
   "p->se.sched_delayed" as well. I also added a matching check for
   ENQUEUE_DELAYED for psi_enqueue().

o With the above, the warning was put off for a few more seconds but it
   still appeared. I dumped all PSI flag transition along with
   "tsk->se.sched_delayed" to see what trips it and I saw the following
   state changes for the task that finally tripped it:

     psi: task state: task=18:rcu_preempt cpu=0 psi_flags=0 clear=0 set=0 delayed=1
     psi: task state: task=18:rcu_preempt cpu=128 psi_flags=0 clear=0 set=4 delayed=1
     psi: task state: task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0
     psi: inconsistent task state! task=18:rcu_preempt cpu=128 psi_flags=4 clear=0 set=4 delayed=0

   Note that cpu switched with "tsk->se.sched_delayed" still set which
   got me looking at the task migration path. The warning added below
   in "deactivate_task()" tripped without fail, just before the PSI
   warning was logged.

   To prevent migration of a delayed entity (XXX: Is it a good idea?)
   we do a "account_task_dequeue()" in the delayed dequeue case to
   remove the task from the "rq->cfs_list", thus removing it from the
   purview of the load balancer.

o With the above change, I only managed to trip the deactivate_task()
   WARN_ON() and immediately the PSI warning in the NUMA balancing path

     ------------[ cut here ]------------
     p->se.sched_delayed
     WARNING: CPU: 75 PID: 473 at kernel/sched/core.c:2075 deactivate_task+0xa6/0xc0
     Modules linked in: ...
     CPU: 75 UID: 0 PID: 473 Comm: migration/75 Not tainted 6.12.0-rc1-peterz-sched-urgent-psi-fix+ #32
     Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
     Stopper: multi_cpu_stop+0x0/0x110 <- migrate_swap+0xd7/0x150
     RIP: 0010:deactivate_task+0xa6/0xc0
     Code: ...
     RSP: 0018:ffff9f210e12fdc0 EFLAGS: 00010086
     RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
     RDX: ffff90116ffa18c8 RSI: 0000000000000001 RDI: ffff90116ffa18c0
     RBP: ffff8fd2d8559ac0 R08: 0000000000000003 R09: 0000000000000000
     R10: 64656863732e6573 R11: 646579616c65645f R12: ffff90116ffb6500
     R13: ffff90116ffb6500 R14: 0000000000000004 R15: ffff8fd2f28f433c
     FS:  0000000000000000(0000) GS:ffff90116ff80000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 000055748a34e000 CR3: 000000807974e004 CR4: 0000000000f70ef0
     PKRU: 55555554
     Call Trace:
      <TASK>
      ? __warn+0x88/0x130
      ? deactivate_task+0xa6/0xc0
      ? report_bug+0x18e/0x1a0
      ? prb_read_valid+0x1b/0x30
      ? handle_bug+0x5b/0xa0
      ? exc_invalid_op+0x18/0x70
      ? asm_exc_invalid_op+0x1a/0x20
      ? deactivate_task+0xa6/0xc0
      __migrate_swap_task.part.0+0xbe/0x180
      migrate_swap_stop+0x1b6/0x1f0
      multi_cpu_stop+0x6e/0x110
      ? __pfx_multi_cpu_stop+0x10/0x10
      cpu_stopper_thread+0x97/0x160
      ? __pfx_smpboot_thread_fn+0x10/0x10
      smpboot_thread_fn+0xdd/0x1d0
      kthread+0xd3/0x100
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x34/0x50
      ? __pfx_kthread+0x10/0x10
      ret_from_fork_asm+0x1a/0x30
      </TASK>
     ---[ end trace 0000000000000000 ]---

   From some logging, I can say the "dst_task" is the one that is
   delayed but I could not go up the stack to find out how it is
   chosen for the swap. For this RFC, I just block the delayed
   entity in "__migrate_swap_task()" and set the "p->wake_cpu"
   to redirect the next wakeup to the appropriate NUMA node.

I haven't encountered any warnings with my machine going for a while now
but I haven't tested any fancy cgroups scenarios yet either; Mileage may
vary :)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..b55b52b081ba 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2014,7 +2014,9 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
  
  	if (!(flags & ENQUEUE_RESTORE)) {
  		sched_info_enqueue(rq, p);
-		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
+		/* Delayed tasks are considered dequeued by PSI tracking */
+		psi_enqueue(p, ((flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)) ||
+			       (flags & ENQUEUE_DELAYED));
  	}
  
  	p->sched_class->enqueue_task(rq, p, flags);
@@ -2069,6 +2071,9 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
  {
  	SCHED_WARN_ON(flags & DEQUEUE_SLEEP);
  
+	/* Delayed tasks should not be migrated */
+	SCHED_WARN_ON(p->se.sched_delayed);
+
  	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
  	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
  
@@ -3298,9 +3303,21 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
  		struct rq_flags srf, drf;
  
  		src_rq = task_rq(p);
-		dst_rq = cpu_rq(cpu);
-
  		rq_pin_lock(src_rq, &srf);
+
+		if (p->se.sched_delayed) {
+			block_task(src_rq, p, DEQUEUE_DELAYED);
+			rq_unpin_lock(src_rq, &srf);
+
+			/*
+			 * Make it appear we last ran on the preferred
+			 * node. See the comment below.
+			 */
+			p->wake_cpu = cpu;
+			return;
+		}
+
+		dst_rq = cpu_rq(cpu);
  		rq_pin_lock(dst_rq, &drf);
  
  		deactivate_task(src_rq, p, 0);
@@ -6667,7 +6684,14 @@ static void __sched notrace __schedule(int sched_mode)
  
  		migrate_disable_switch(rq, prev);
  		psi_account_irqtime(rq, prev, next);
-		psi_sched_switch(prev, next, !task_on_rq_queued(prev));
+
+		/*
+		 * psi_task_switch() is responsible for clearing TSK_RUNNING
+		 * and TSK_IOWAIT which psi_dequeue() skips for a task going
+		 * to sleep (see comment there). Consider a delayed entity
+		 * as one that has gone to sleep for PSI accounting.
+		 */
+		psi_sched_switch(prev, next, !task_on_rq_queued(prev) || prev->se.sched_delayed);
  
  		trace_sched_switch(preempt, prev, next, prev_state);
  
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab497fafa7be..cf02d202ab0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3661,18 +3661,40 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
  
  #endif /* CONFIG_NUMA_BALANCING */
  
+#ifdef CONFIG_SMP
+
+static void
+account_task_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	struct rq *rq = rq_of(cfs_rq);
+
+	account_numa_enqueue(rq, task_of(se));
+	list_add(&se->group_node, &rq->cfs_tasks);
+}
+
+static void
+account_task_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+	list_del_init(&se->group_node);
+}
+
+#else
+
+static void
+account_task_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
+static void
+account_task_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
+
+#endif
+
  static void
  account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
  {
  	update_load_add(&cfs_rq->load, se->load.weight);
-#ifdef CONFIG_SMP
-	if (entity_is_task(se)) {
-		struct rq *rq = rq_of(cfs_rq);
+	if (entity_is_task(se) && !se->sched_delayed)
+		account_task_enqueue(cfs_rq, se);
  
-		account_numa_enqueue(rq, task_of(se));
-		list_add(&se->group_node, &rq->cfs_tasks);
-	}
-#endif
  	cfs_rq->nr_running++;
  	if (se_is_idle(se))
  		cfs_rq->idle_nr_running++;
@@ -3682,12 +3704,11 @@ static void
  account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
  {
  	update_load_sub(&cfs_rq->load, se->load.weight);
-#ifdef CONFIG_SMP
-	if (entity_is_task(se)) {
-		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
-		list_del_init(&se->group_node);
-	}
-#endif
+
+	/* Delayed tasks are already dequeued the first time */
+	if (entity_is_task(se) && !se->sched_delayed)
+		account_task_dequeue(cfs_rq, se);
+
  	cfs_rq->nr_running--;
  	if (se_is_idle(se))
  		cfs_rq->idle_nr_running--;
@@ -6943,6 +6964,10 @@ requeue_delayed_entity(struct sched_entity *se)
  
  	update_load_avg(cfs_rq, se, 0);
  	se->sched_delayed = 0;
+
+	if (entity_is_task(se))
+		account_task_enqueue(cfs_rq, se);
+
  }
  
  /*
@@ -7190,10 +7215,18 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
   */
  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  {
-	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
+	struct sched_entity *se = &p->se;
+
+	if (!(se->sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
  		util_est_dequeue(&rq->cfs, p);
  
-	if (dequeue_entities(rq, &p->se, flags) < 0) {
+	if (dequeue_entities(rq, se, flags) < 0) {
+		/*
+		 * Remove delayed entity from rq->cfs_tasks list
+		 * to prevent load balancer from migrating it
+		 * away.
+		 */
+		account_task_dequeue(cfs_rq_of(se), se);
  		util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
  		return false;
  	}
--

The above changes were arrived at by experimenting. If there are no
obvious objections, I'll send a clean series after some more testing.
Any and all comments are highly appreciated.
-- 
Thanks and Regards,
Prateek