linux-kernel - Re: [PATCH] RFC: sched: Rework task_sched_runtime to avoid calling update_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCqcbCJNSyrKG5b7vyjmuHUm0kAJmDecqHF-QRZ_EHq=Zw@mail.gmail.com>
Date: Mon, 17 Jun 2024 17:42:25 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Qais Yousef <qyousef@...alina.io>, LKML <linux-kernel@...r.kernel.org>, 
	Thomas Gleixner <tglx@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider <vschneid@...hat.com>, 
	Joel Fernandes <joel@...lfernandes.org>, kernel-team@...roid.com
Subject: Re: [PATCH] RFC: sched: Rework task_sched_runtime to avoid calling update_rq_clock

On Fri, Jun 14, 2024 at 2:48 AM Peter Zijlstra <peterz@...radead.org> wrote:
> Which then gets me something like the (completely untested) below..
>
> Hmm?
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..36aed99d6a6c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -724,7 +724,6 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>
>         rq->prev_irq_time += irq_delta;
>         delta -= irq_delta;
> -       psi_account_irqtime(rq->curr, irq_delta);
>         delayacct_irq(rq->curr, irq_delta);
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> @@ -5459,6 +5458,8 @@ void sched_tick(void)
>
>         sched_clock_tick();
>
> +       psi_account_irqtime(curr, NULL, &rq->psi_irq_time);
> +
>         rq_lock(rq, &rf);
>
>         update_rq_clock(rq);
> @@ -6521,6 +6524,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>                 ++*switch_count;
>
>                 migrate_disable_switch(rq, prev);
> +               psi_account_irqtime(prev, next, &rq->psi_irq_time);

FYI: These references to psi_irq_time hit build issues if
CONFIG_IRQ_TIME_ACCOUNTING is disabled.

Also, separately, while I didn't see this earlier testing on physical
devices, when running virtualized, I can pretty easily trip over the
following:

[   65.207340] watchdog: BUG: soft lockup - CPU#0 stuck for 26s!
[kworker/0:3:374]
[   65.211107] irq event stamp: 118664
[   65.212786] hardirqs last  enabled at (118663):
[<ffffffff97a00e46>] asm_sysvec_apic_timer_interrupt+0x16/0x20
[   65.218440] hardirqs last disabled at (118664):
[<ffffffff977fdeca>] sysvec_apic_timer_interrupt+0xa/0xc0
[   65.223074] softirqs last  enabled at (118546):
[<ffffffff9676db78>] __irq_exit_rcu+0x88/0xe0
[   65.227118] softirqs last disabled at (118541):
[<ffffffff9676db78>] __irq_exit_rcu+0x88/0xe0
[   65.231137] CPU: 0 PID: 374 Comm: kworker/0:3 Not tainted
6.10.0-rc4-dirty #4393
[   65.234625] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   65.239089] Workqueue: events psi_avgs_work
[   65.241122] RIP: 0010:collect_percpu_times+0xff/0x310
[   65.243525] Code: b9 02 00 00 00 48 89 df e8 8e a4 01 00 48 8b b4
24 d0 00 00 00 48 89 df e8 5e 9e 01 00 58 45 8b 34 24 41 f6 c6 01 74
0c f3 0
[   65.252926] RSP: 0018:ffff958501263d50 EFLAGS: 00000202
[   65.255433] RAX: 0000000000017b61 RBX: ffff9585b901d848 RCX: 0000000000000006
[   65.258755] RDX: ffffffff967eb6ac RSI: ffffffff9819e305 RDI: ffffffff98177748
[   65.262113] RBP: ffff958501263db0 R08: 0000000000000001 R09: 0000000000000000
[   65.265475] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9585b901d840
[   65.268785] R13: ffff9585b901d884 R14: 0000000000033d8b R15: 0000000000000000
[   65.272146] FS:  0000000000000000(0000) GS:ffff9585b9000000(0000)
knlGS:0000000000000000
[   65.275908] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   65.278629] CR2: 00005631aee8b000 CR3: 0000000116c0e001 CR4: 0000000000370ef0
[   65.282002] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   65.285386] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   65.288730] Call Trace:
[   65.289958]  <IRQ>
[   65.290965]  ? watchdog_timer_fn+0x275/0x310
[   65.293185]  ? __pfx_watchdog_timer_fn+0x10/0x10
[   65.295379]  ? __hrtimer_run_queues+0x190/0x3b0
[   65.297795]  ? hrtimer_interrupt+0xf9/0x230
[   65.299782]  ? __sysvec_apic_timer_interrupt+0x82/0x210
[   65.302243]  ? sysvec_apic_timer_interrupt+0x98/0xc0
[   65.304590]  </IRQ>
[   65.305658]  <TASK>
[   65.306708]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[   65.309206]  ? psi_avgs_work+0x3c/0xb0
[   65.311001]  ? collect_percpu_times+0xff/0x310
[   65.313153]  psi_avgs_work+0x3c/0xb0
[   65.314864]  process_one_work+0x1fe/0x700
[   65.316782]  ? lock_is_held_type+0xcd/0x120
[   65.318782]  worker_thread+0x1c7/0x3b0
[   65.320571]  ? __pfx_worker_thread+0x10/0x10
[   65.322626]  kthread+0xe0/0x110
[   65.324103]  ? __pfx_kthread+0x10/0x10
[   65.325853]  ret_from_fork+0x28/0x40
[   65.327512]  ? __pfx_kthread+0x10/0x10
[   65.329255]  ret_from_fork_asm+0x1a/0x30
[   65.331073]  </TASK>
[   65.332119] Kernel panic - not syncing: softlockup: hung tasks

Where collect_percpu_times+0xff/0x310:
__seqprop_sequence at include/linux/seqlock.h:211 (discriminator 2)
(inlined by) get_recent_times at kernel/sched/psi.c:261 (discriminator 2)
(inlined by) collect_percpu_times at kernel/sched/psi.c:359 (discriminator 2)

Which looks like its getting stuck in the seqlock loop, and the only
way I can see that catching right off, is if we're in some sort of
livelock where the calls to psi_account_irqtime(curr, NULL,
&rq->psi_irq_time) is coming in frequently enough to change the seqno
each iteration through the reader.  But from my initial trace_printk
debugging, it seems like from a kworker we enter the loop in
get_recent_times(), hit an irq and somehow never really come back out
of irq context. Though we continue to get ticks on the task and
continue to call psi_account_irqtime().  I was worried we were somehow
getting stuck in the 'while ((group = group->parent));' loop in
psi_account_irqtime(), but that doesn't seem to be the case.

[  238.297094] kworker/-798       0..... 200647713us :
collect_percpu_times: JDB: get_recent_times at top of loop 0!
[  238.301705] kworker/-798       0..... 200647767us :
collect_percpu_times: JDB: get_recent_times done with read (looped:
1)!
[  238.306689] kworker/-798       0..... 200647768us :
collect_percpu_times: JDB: get_recent_times at top of loop 0!
[  238.311313] kworker/-798       0..... 200647769us :
collect_percpu_times: JDB: get_recent_times done with read (looped:
1)!
[  238.316318] kworker/-798       0d..2. 200647786us :
psi_group_change: JDB: psi_group_change seqwrite
<normal behavior above>
[  238.320460] kworker/-10        0..... 200647790us :
collect_percpu_times: JDB: get_recent_times at top of loop 0!
[  238.325131] kworker/-10        0d.h.. 200648408us :
psi_account_irqtime: JDB: psi_account_irqtime seqwrite (loop count: 0)
[  238.330149] kworker/-10        0d.h.. 200649406us :
psi_account_irqtime: JDB: psi_account_irqtime seqwrite (loop count: 0)
[  238.335140] kworker/-10        0d.h.. 200650405us :
psi_account_irqtime: JDB: psi_account_irqtime seqwrite (loop count: 0)
...
With the psi_account_irqtime just repeating there each ms/tick.

I'm still digging a bit here to understand what's going on. But I
wanted to share.

thanks
-john