linux-kernel - x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAK8P3a1mkHEjRJgJPsRy+kuN=48=JEDJAeR2z9n+O71qbJ8hSA@mail.gmail.com>
Date:   Thu, 2 Jun 2022 11:19:59 +0200
From:   Arnd Bergmann <arnd@...nel.org>
To:     Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Len Brown <len.brown@...el.com>,
        Ricardo Neri <ricardo.neri-calderon@...ux.intel.com>
Cc:     "Rafael J. Wysocki" <rafael@...nel.org>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Amit Kucheria <amitk@...nel.org>,
        Zhang Rui <rui.zhang@...el.com>, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK?

I have a Xeon W-2265 (family 6, model 85, stepping 7) that started
constantly spewing messages from the therm_throt driver after one
core overheated:

May 31 13:57:54 kernel: [15512.209474] unchecked MSR access error:
WRMSR to 0x19c (tried to write 0x0000000000002a80) at rIP:
0xffffffff9f67f974 (native_write_msr+0x4/0x20)
May 31 13:57:54 kernel: [15512.209486] Call Trace:
May 31 13:57:54 kernel: [15512.209488]  <TASK>
May 31 13:57:54 kernel: [15512.209489]  ? throttle_active_work+0xea/0x1f0
May 31 13:57:54 kernel: [15512.209498]  process_one_work+0x21d/0x3c0
May 31 13:57:54 kernel: [15512.209502]  worker_thread+0x4d/0x3f0
May 31 13:57:54 kernel: [15512.209505]  ? process_one_work+0x3c0/0x3c0
May 31 13:57:54 kernel: [15512.209508]  kthread+0x127/0x150
May 31 13:57:54 kernel: [15512.209510]  ? set_kthread_struct+0x40/0x40
May 31 13:57:54 kernel: [15512.209513]  ret_from_fork+0x1f/0x30
...
May 31 13:57:59 kernel: [15517.333445] CPU11: Core temperature is
above threshold, cpu clock is throttled (total events = 3)

I could not find CPU model specific documentation for this register,
but I see that in [1], the bits 13 through 15 are marked as reserved
in some cases but not others. Manually writing the value 0xa80
instead of 0x2a80 from user space makes the warnings stop, so
my guess is that this CPU does not support the 0x2000 bit:

$ sudo  wrmsr -p 11 0x19c 0xa80 ; dmesg
[177764.874555] msr: Write to unrecognized MSR 0x19c by wrmsr (pid: 142969).
[177764.874560] msr: See
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about for
details.
[177765.371180] CPU11: Core temperature/speed normal (total events = 42)
[177765.371180] CPU23: Core temperature/speed normal (total events = 42)

I have not tried the patch below, but I think this would address it on my
system, while likely breaking other machines. Any ideas what the
correct fix is?

      Arnd

diff --git a/drivers/thermal/intel/therm_throt.c
b/drivers/thermal/intel/therm_throt.c
index 8352083b87c7..620d7f4c013e 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -196,7 +196,7 @@ static const struct attribute_group thermal_attr_group = {
 #define THERM_THROT_POLL_INTERVAL      HZ
 #define THERM_STATUS_PROCHOT_LOG       BIT(1)

-#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) |
BIT(7) | BIT(9) | BIT(11) | BIT(13) | BIT(15))
+#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) |
BIT(7) | BIT(9) | BIT(11))
 #define THERM_STATUS_CLEAR_PKG_MASK  (BIT(1) | BIT(3) | BIT(5) |
BIT(7) | BIT(9) | BIT(11))

 static void clear_therm_status_log(int level)

[1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf