lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCrShc0ak2BZN91fR9uj0MM6R20oskoDD-f7++SgPao6vA@mail.gmail.com>
Date: Mon, 17 Jun 2024 21:45:30 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Qais Yousef <qyousef@...alina.io>, LKML <linux-kernel@...r.kernel.org>, 
	Thomas Gleixner <tglx@...utronix.de>, Frederic Weisbecker <frederic@...nel.org>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider <vschneid@...hat.com>, 
	Joel Fernandes <joel@...lfernandes.org>, kernel-team@...roid.com
Subject: Re: [PATCH] RFC: sched: Rework task_sched_runtime to avoid calling update_rq_clock

On Mon, Jun 17, 2024 at 5:42 PM John Stultz <jstultz@...gle.com> wrote:
>
> Also, separately, while I didn't see this earlier testing on physical
> devices, when running virtualized, I can pretty easily trip over the
> following:
>
> [   65.207340] watchdog: BUG: soft lockup - CPU#0 stuck for 26s!
> [kworker/0:3:374]
> [   65.211107] irq event stamp: 118664
> [   65.212786] hardirqs last  enabled at (118663):
> [<ffffffff97a00e46>] asm_sysvec_apic_timer_interrupt+0x16/0x20
> [   65.218440] hardirqs last disabled at (118664):
> [<ffffffff977fdeca>] sysvec_apic_timer_interrupt+0xa/0xc0
> [   65.223074] softirqs last  enabled at (118546):
> [<ffffffff9676db78>] __irq_exit_rcu+0x88/0xe0
> [   65.227118] softirqs last disabled at (118541):
> [<ffffffff9676db78>] __irq_exit_rcu+0x88/0xe0
> [   65.231137] CPU: 0 PID: 374 Comm: kworker/0:3 Not tainted
> 6.10.0-rc4-dirty #4393
> [   65.234625] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   65.239089] Workqueue: events psi_avgs_work
> [   65.241122] RIP: 0010:collect_percpu_times+0xff/0x310
...
> [   65.288730] Call Trace:
> [   65.289958]  <IRQ>
> [   65.290965]  ? watchdog_timer_fn+0x275/0x310
> [   65.293185]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [   65.295379]  ? __hrtimer_run_queues+0x190/0x3b0
> [   65.297795]  ? hrtimer_interrupt+0xf9/0x230
> [   65.299782]  ? __sysvec_apic_timer_interrupt+0x82/0x210
> [   65.302243]  ? sysvec_apic_timer_interrupt+0x98/0xc0
> [   65.304590]  </IRQ>
> [   65.305658]  <TASK>
> [   65.306708]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [   65.309206]  ? psi_avgs_work+0x3c/0xb0
> [   65.311001]  ? collect_percpu_times+0xff/0x310
> [   65.313153]  psi_avgs_work+0x3c/0xb0
> [   65.314864]  process_one_work+0x1fe/0x700
> [   65.316782]  ? lock_is_held_type+0xcd/0x120
> [   65.318782]  worker_thread+0x1c7/0x3b0
> [   65.320571]  ? __pfx_worker_thread+0x10/0x10
> [   65.322626]  kthread+0xe0/0x110
> [   65.324103]  ? __pfx_kthread+0x10/0x10
> [   65.325853]  ret_from_fork+0x28/0x40
> [   65.327512]  ? __pfx_kthread+0x10/0x10
> [   65.329255]  ret_from_fork_asm+0x1a/0x30
> [   65.331073]  </TASK>
> [   65.332119] Kernel panic - not syncing: softlockup: hung tasks

As I've been digging on this some more, I'm still a bit baffled.  The
kworker seems to get stuck once the irq lands and we don't seem to go
back to the workqueue work that was in progress. That cpu ends up not
doing anything but occasionally handling ticks until we crash from the
softlockup watchdog.

Interestingly, if I move the psi_account_irqtime() call to be under
the rq_lock in sched_tick(), as I suggested earlier, I have yet to
reproduce the problem.
But I'm still trying to piece together exactly why.

thanks
-john

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ