linux-kernel - Re: x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <21b7d5a3de39e9eee4ccda48ad0c66d31b1fe7d1.camel@linux.intel.com>
Date:   Thu, 02 Jun 2022 13:10:27 -0700
From:   srinivas pandruvada <srinivas.pandruvada@...ux.intel.com>
To:     Arnd Bergmann <arnd@...nel.org>
Cc:     Len Brown <len.brown@...el.com>,
        Ricardo Neri <ricardo.neri-calderon@...ux.intel.com>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Amit Kucheria <amitk@...nel.org>,
        Zhang Rui <rui.zhang@...el.com>, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK?

On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote:
> On Thu, Jun 2, 2022 at 6:25 PM srinivas pandruvada
> <srinivas.pandruvada@...ux.intel.com> wrote:
> > On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote:
> > > On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada
> > > <srinivas.pandruvada@...ux.intel.com> wrote:
> > > > 
> > > > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote:
> > > > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that
> > > > > started
> > > > > constantly spewing messages from the therm_throt driver after
> > > > > one
> > > > > core overheated:
> > > > > 
> > > > I think this is a Cascade Lake system. Have you tried the
> > > > latest
> > > > micro-
> > > > code?
> > > 
> > > Thanks for your quick reply. I have installed the latest
> > > microcode
> > > 0x5003302
> > > now (manually, because the version provided by the distro was
> > > still
> > > using
> > > version 0x5003102).
> > > 
> > > After that, I tried writing the value 0x2a80 from userspace, and
> > > that did not cause a trap, so I assume that fixed it.
> > > 
> > Thanks for reporting.
> > I am aware of this issue and should be fixed by microcode update.
> 
> I wonder how common this problem it is. Would it help to add a driver
> workaround
> like this?
This issue affects only certain skews. The others already working as
expected. These are important log bits for debug, we don't want to
clear in this path. Printing warning for CLX stepping is fine without
clearing unrelated bits 13 and 15. 
Read-modify-update should always work where we only update the bits of
interest. Writing 1s to this register should be NOP.

Thanks,
Srinivas

> 
> diff --git a/drivers/thermal/intel/therm_throt.c
> b/drivers/thermal/intel/therm_throt.c
> index 8352083b87c7..acb402e56796 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level)
> 
>         rdmsrl(msr, msr_val);
>         msr_val &= mask;
> -       wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> +       if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) {
> +               /* work around Cascade Lake SKZ57 erratum */
> +               printk_once(KERN_WARNING "Failed to update
> IA32_THERM_STATUS, "
> +                                       "please upgrade
> microcode\n");
> +               wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
> +                       ~BIT(13) & ~BIT(15));
> +       }
>  }
> 
>  static void get_therm_status(int level, bool *proc_hot, u8 *temp)
> 
>         Arnd