[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240827134706.GA719384@yaz-khff2.amd.com>
Date: Tue, 27 Aug 2024 09:47:06 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: Thomas Gleixner <tglx@...utronix.de>, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, tony.luck@...el.com, x86@...nel.org,
avadhut.naik@....com, john.allen@....com,
boris.ostrovsky@...cle.com
Subject: Re: [PATCH] x86/MCE: Prevent CPU offline for SMCA CPUs with non-core
banks
On Tue, Aug 27, 2024 at 02:50:40PM +0200, Borislav Petkov wrote:
> On August 26, 2024 3:20:57 PM GMT+02:00, Yazen Ghannam <yazen.ghannam@....com> wrote:
> >On Sun, Aug 25, 2024 at 01:16:37PM +0200, Thomas Gleixner wrote:
> >> On Wed, Aug 21 2024 at 09:00, Yazen Ghannam wrote:
> >> > Logical CPUs in AMD Scalable MCA (SMCA) systems can manage non-core
> >> > banks. Each of these banks represents unique and separate hardware
> >> > located within the system. Each bank is managed by a single logical CPU;
> >> > they are not shared. Furthermore, the "CPU to MCA bank" assignment
> >> > cannot be modified at run time.
> >> >
> >> > The MCE subsystem supports run time CPU hotplug. Many vendors have
> >> > non-core MCA banks, so MCA settings are not cleared when a CPU is
> >> > offlined for these vendors.
> >> >
> >> > Even though the non-core MCA banks remain enabled, MCA errors will not
> >> > be handled (reported, cleared, etc.) on SMCA systems when the managing
> >> > CPU is offline.
> >> >
> >> > Check if a CPU manages non-core MCA banks and, if so, prevent it from
> >> > being taken offline.
> >>
> >> Which in turn breaks hibernation and kexec...
> >>
> >
> >Right, good point.
> >
> >Maybe this change can apply only to a user-initiated (sysfs) case?
> >
> >Thanks,
> >Yazen
> >
>
> Or, you can simply say that the MCE cannot be processed because the user took the managing CPU offline.
>
I found that we can not populate the "cpuN/online" file. This would
prevent a user from offlining a CPU, but it shouldn't prevent the system
from doing what it needs.
This is already done for CPU0, and other cases I think.
> What is this actually really fixing anyway?
There are times where a user wants to take CPUs offline due to software
licensing. And this would prevent the user from unintentionally
offlining CPUs that would affect MCA handling.
Thanks,
Yazen
Powered by blists - more mailing lists