lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <87o79cjjik.fsf@kernel.org>
Date: Sat, 11 May 2024 21:22:43 +0300
From: Kalle Valo <kvalo@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 "Rafael J. Wysocki" <rafael@...nel.org>
Cc: x86@...nel.org, linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
    regressions@...ts.linux.dev,
    Jeff Johnson <quic_jjohnson@...cinc.com>
Subject: [regression] suspend stress test stalls within 30 minutes

Hi,

I have a weird problem with suspend. Somewhere around v6.9-rc4 or so (not sure
exactly) I started seeing that our ath11k Wi-Fi driver suspend tests to
randomly fail. I have been investigating this for some time and now it
looks like it's somehow related to CPU_MITIGATIONS Kconfig option and
nothing to do with wireless.

The simplified test case I have is to run suspend and resume in loop
like this (Wi-Fi modules are not loaded):

for i in {1..400}; do echo "rtcwake test $i" > /dev/kmsg; rtcwake -m mem -s 10; sleep 10; done

If CPU_MITIGATIONS is enabled I usually see suspend stalling within 30
minutes. If I disable CPU_MITIGATIONS using menuconfig I don't see the bug.

When the bug happens in the kernel.log I see this and suspend stalls:

[  361.716546] PM: suspend entry (deep)
[  361.722558] Filesystems sync: 0.005 seconds
[  624.222721] kworker/dying (2519) used greatest stack depth: 22240 bytes left
[  633.897857] loop0: detected capacity change from 0 to 8

And if I don't do anything for several minutes nothing happens. What is
really strange is that once I run 'sudo shutdown -h now' then suspend
somehow immediately unstalls and continues with suspend, like this:

[  847.631147] Freezing user space processes
[  847.649590] Freezing user space processes completed (elapsed 0.016 seconds)
[  847.650710] OOM killer disabled.
[  847.651799] Freezing remaining freezable tasks
[  847.654618] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[  847.663757] printk: Suspending console(s) (use no_console_suspend to debug)
[  847.710060] e1000e: EEE TX LPI TIMER: 00000011
[  847.852370] ACPI: EC: interrupt blocked
[  847.899416] ACPI: PM: Preparing to enter system sleep state S3
[  847.933433] ACPI: EC: event blocked
[  847.933437] ACPI: EC: EC stopped
[  847.933441] ACPI: PM: Saving platform NVS memory
[  847.933817] Disabling non-boot CPUs ...

And now the system goes into suspend state as it should. And if I press
the power button on the device then the system resumes and after that
shuts down (as expected because I run the shutdown command). This
behaviour is consistent, I see it every time the suspend bug happens.

The test setup is a several years old Intel NUC x86 system, more info
below.

Any recommendations how should I debug this further? I tried to bisect
this earlier but that failed, most likely because I hadn't yet realised
that this is related to CPU_MITIGATIONS and might have messed up the
config settings during bisect.

Kalle

DMI: Intel(R) Client Systems NUC8i7HVK/NUC8i7HVB, BIOS HNKBLi70.86A.0067.2021.0528.1339 05/28/2021

Ubuntu 20.04.6 LTS (GNU/Linux 6.9.0-rc7+ x86_64)

systemd 245.4-4ubuntu3.23 running in system mode. (+PAM +AUDIT +SELINUX
+IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS
+ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
default-hierarchy=hybrid)

I verified that I see this on latest commit from Linus' tree:

cf87f46fd34d Merge tag 'drm-fixes-2024-05-11' of https://gitlab.freedesktop.org/drm/kernel

Here's the diff between broken and working .config:

$ diffconfig broken.config works.config 
-CALL_PADDING y
-CALL_THUNKS y
-CALL_THUNKS_DEBUG n
-HAVE_CALL_THUNKS y
-MITIGATION_CALL_DEPTH_TRACKING y
-MITIGATION_GDS_FORCE y
-MITIGATION_IBPB_ENTRY y
-MITIGATION_IBRS_ENTRY y
-MITIGATION_PAGE_TABLE_ISOLATION y
-MITIGATION_RETHUNK y
-MITIGATION_RETPOLINE y
-MITIGATION_RFDS y
-MITIGATION_SLS y
-MITIGATION_SPECTRE_BHI y
-MITIGATION_SRSO y
-MITIGATION_UNRET_ENTRY y
-PREFIX_SYMBOLS y
 CPU_MITIGATIONS y -> n

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ