linux-kernel - Re: [regression] suspend stress test stalls within 30 minutes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJZ5v0iHoU7dHxzL5ryehZT7unZiapoiBzPo1d=wsffNGqcS7Q@mail.gmail.com>
Date: Fri, 17 May 2024 21:00:33 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Kalle Valo <kvalo@...nel.org>
Cc: Dave Hansen <dave.hansen@...el.com>, Borislav Petkov <bp@...en8.de>, 
	Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>, Thomas Gleixner <tglx@...utronix.de>, 
	Ingo Molnar <mingo@...hat.com>, Dave Hansen <dave.hansen@...ux.intel.com>, 
	"Rafael J. Wysocki" <rafael@...nel.org>, x86@...nel.org, linux-pm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, regressions@...ts.linux.dev, 
	Jeff Johnson <quic_jjohnson@...cinc.com>
Subject: Re: [regression] suspend stress test stalls within 30 minutes

On Fri, May 17, 2024 at 8:37 PM Kalle Valo <kvalo@...nel.org> wrote:
>
> Dave Hansen <dave.hansen@...el.com> writes:
>
> > On 5/17/24 10:15, Kalle Valo wrote:
> >> Borislav Petkov <bp@...en8.de> writes:
> >>> There might be some #GP or so in the logs in case we've managed to f*ck
> >>> up microcode application which emulates that IBRS MSR bit and the
> >>> actual toggling or so when suspending...
> >> So the weird part is that when the bug happens (ie. suspend stalls) I
> >> can access the box normally using ssh and I don't see anything special
> >> in dmesg. Below is a full copy of dmesg output after the suspend
> >> stalled. Do note that I copied this dmesg before I updated microcode so
> >> it will still show the old microcode version.
> >>
> >> Let me know if you need more info.
> >
> > Kalle, could you remind us what we're seeing here?  Does this show 30
> > working rtcwake tests followed by a failure at "rtcwake test 31" where
> > the system failed to suspend?
>
> Correct. So basically what I do is that I start the nuc box, ssh into it
> and run:
>
> sudo su
> for i in {1..400}; do echo "rtcwake test $i" > /dev/kmsg; rtcwake -m mem -s 10; sleep 10; done
>
> Here's the start of first loop:
>
> [   54.945105] rtcwake test 1
> [   55.162603] PM: suspend entry (deep)
> [   55.168875] Filesystems sync: 0.006 seconds
> [   55.182427] Freezing user space processes
> [   55.191498] Freezing user space processes completed (elapsed 0.008 seconds)
> [   55.191711] OOM killer disabled.
> [   55.191805] Freezing remaining freezable tasks
> [   55.193507] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [   55.194056] printk: Suspending console(s) (use no_console_suspend to debug)
> [   55.244962] e1000e: EEE TX LPI TIMER: 00000011
>
> Now I leave the box to run it's test. I come back later to see that the
> for loop has stalled and the box is not going into suspend gain. I ssh
> into the machine and see this in dmesg:
>
> [  449.061525] rtcwake test 31
> [  449.176854] PM: suspend entry (deep)
> [  449.179072] Filesystems sync: 0.002 seconds

This means that ksys_sync_helper() has run, so it blocks somewhere in
enter_state() around suspend_prepare().

Can please echo 1 (as root) to /sys/power/pm_debug_messages and retest?

This should allow us to see more in the log.