lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <21b7d5a3de39e9eee4ccda48ad0c66d31b1fe7d1.camel@linux.intel.com>
Date:   Thu, 02 Jun 2022 13:10:27 -0700
From:   srinivas pandruvada <srinivas.pandruvada@...ux.intel.com>
To:     Arnd Bergmann <arnd@...nel.org>
Cc:     Len Brown <len.brown@...el.com>,
        Ricardo Neri <ricardo.neri-calderon@...ux.intel.com>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Amit Kucheria <amitk@...nel.org>,
        Zhang Rui <rui.zhang@...el.com>, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK?

On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote:
> On Thu, Jun 2, 2022 at 6:25 PM srinivas pandruvada
> <srinivas.pandruvada@...ux.intel.com> wrote:
> > On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote:
> > > On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada
> > > <srinivas.pandruvada@...ux.intel.com> wrote:
> > > > 
> > > > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote:
> > > > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that
> > > > > started
> > > > > constantly spewing messages from the therm_throt driver after
> > > > > one
> > > > > core overheated:
> > > > > 
> > > > I think this is a Cascade Lake system. Have you tried the
> > > > latest
> > > > micro-
> > > > code?
> > > 
> > > Thanks for your quick reply. I have installed the latest
> > > microcode
> > > 0x5003302
> > > now (manually, because the version provided by the distro was
> > > still
> > > using
> > > version 0x5003102).
> > > 
> > > After that, I tried writing the value 0x2a80 from userspace, and
> > > that did not cause a trap, so I assume that fixed it.
> > > 
> > Thanks for reporting.
> > I am aware of this issue and should be fixed by microcode update.
> 
> I wonder how common this problem it is. Would it help to add a driver
> workaround
> like this?
This issue affects only certain skews. The others already working as
expected. These are important log bits for debug, we don't want to
clear in this path. Printing warning for CLX stepping is fine without
clearing unrelated bits 13 and 15. 
Read-modify-update should always work where we only update the bits of
interest. Writing 1s to this register should be NOP.

Thanks,
Srinivas

> 
> diff --git a/drivers/thermal/intel/therm_throt.c
> b/drivers/thermal/intel/therm_throt.c
> index 8352083b87c7..acb402e56796 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level)
> 
>         rdmsrl(msr, msr_val);
>         msr_val &= mask;
> -       wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);
> +       if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) {
> +               /* work around Cascade Lake SKZ57 erratum */
> +               printk_once(KERN_WARNING "Failed to update
> IA32_THERM_STATUS, "
> +                                       "please upgrade
> microcode\n");
> +               wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG &
> +                       ~BIT(13) & ~BIT(15));
> +       }
>  }
> 
>  static void get_therm_status(int level, bool *proc_hot, u8 *temp)
> 
>         Arnd

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ