[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240816140853.GB29375@yaz-khff2.amd.com>
Date: Fri, 16 Aug 2024 10:08:53 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
tony.luck@...el.com, x86@...nel.org, avadhut.naik@....com,
john.allen@....com
Subject: Re: [PATCH 7/9] x86/mce: Unify AMD DFR handler with MCA Polling
On Tue, Jun 04, 2024 at 01:05:28PM +0200, Borislav Petkov wrote:
> On Thu, May 23, 2024 at 10:56:39AM -0500, Yazen Ghannam wrote:
> > +static bool smca_log_poll_error(struct mce *m, u32 *status_reg)
>
> That handing of *status_reg back'n'forth just to clear it in the end is
> not nice. Let's get rid of it:
>
> ---
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 0a9cff329487..a0ba82fe6de3 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -669,7 +669,7 @@ static void reset_thr_limit(unsigned int bank)
>
> DEFINE_PER_CPU(unsigned, mce_poll_count);
>
> -static bool smca_log_poll_error(struct mce *m, u32 *status_reg)
> +static bool smca_log_poll_error(struct mce *m, u32 status_reg)
> {
> /*
> * If this is a deferred error found in MCA_STATUS, then clear
> @@ -686,8 +686,8 @@ static bool smca_log_poll_error(struct mce *m, u32 *status_reg)
> * If the MCA_DESTAT register has valid data, then use
> * it as the status register.
> */
> - *status_reg = MSR_AMD64_SMCA_MCx_DESTAT(m->bank);
> - m->status = mce_rdmsrl(*status_reg);
> + status_reg = MSR_AMD64_SMCA_MCx_DESTAT(m->bank);
> + m->status = mce_rdmsrl(status_reg);
>
> if (!(m->status & MCI_STATUS_VAL))
> return false;
> @@ -695,6 +695,8 @@ static bool smca_log_poll_error(struct mce *m, u32 *status_reg)
> if (m->status & MCI_STATUS_ADDRV)
> m->addr = mce_rdmsrl(MSR_AMD64_SMCA_MCx_DEADDR(m->bank));
>
> + mce_wrmsrl(status_reg, 0);
> +
I had to think on this for a while. The reason to clear the status
register at the very end is to make sure another error doesn't come in
and overwrite all the "aux" registers before we grab them.
***BUT*** the reason we are going down this path is because another
(higher priority) error *did* overwrite everything. And we're trying to
gather any leftover data. So all the "aux" registers are already
out-of-sync.
I don't think we can solve this in software. We'd need all the state
registers to be duplicated in hardware. We have status and address which
seem to be enough.
I'll see if this can be simplified even further.
Thanks,
Yazen
Powered by blists - more mailing lists