[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yuo3dioqb9mDAOcT@gmail.com>
Date: Wed, 3 Aug 2022 10:53:10 +0200
From: Ingo Molnar <mingo@...nel.org>
To: Jane Chu <jane.chu@...cle.com>
Cc: tony.luck@...el.com, bp@...en8.de, tglx@...utronix.de,
mingo@...hat.com, dave.hansen@...ux.intel.com, x86@...nel.org,
linux-edac@...r.kernel.org, dan.j.williams@...el.com,
linux-kernel@...r.kernel.org, hch@....de, nvdimm@...ts.linux.dev
Subject: Re: [PATCH v7] x86/mce: retrieve poison range from hardware
* Jane Chu <jane.chu@...cle.com> wrote:
> With Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine
s/Commit/commit
> poison granularity") that changed nfit_handle_mce() callback to report
> badrange according to 1ULL << MCI_MISC_ADDR_LSB(mce->misc), it's been
> discovered that the mce->misc LSB field is 0x1000 bytes, hence injecting
> 2 back-to-back poisons and the driver ends up logging 8 badblocks,
> because 0x1000 bytes is 8 512-byte.
>
> Dan Williams noticed that apei_mce_report_mem_error() hardcode
> the LSB field to PAGE_SHIFT instead of consulting the input
> struct cper_sec_mem_err record. So change to rely on hardware whenever
> support is available.
>
> Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8502@oracle.com
>
> Reviewed-by: Dan Williams <dan.j.williams@...el.com>
> Reviewed-by: Ingo Molnar <mingo@...nel.org>
> Signed-off-by: Jane Chu <jane.chu@...cle.com>
> ---
> arch/x86/kernel/cpu/mce/apei.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index 717192915f28..8ed341714686 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -29,15 +29,26 @@
> void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
> {
> struct mce m;
> + int lsb;
>
> if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
> return;
>
> + /*
> + * Even if the ->validation_bits are set for address mask,
> + * to be extra safe, check and reject an error radius '0',
> + * and fall back to the default page size.
> + */
> + if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
> + lsb = find_first_bit((void *)&mem_err->physical_addr_mask, PAGE_SHIFT);
> + else
> + lsb = PAGE_SHIFT;
> +
> mce_setup(&m);
> m.bank = -1;
> /* Fake a memory read error with unknown channel */
> m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f;
> - m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
> + m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
LGTM.
I suppose this wants to go upstream via the tree the bug came from (NVDIMM
tree? ACPI tree?), or should we pick it up into the x86 tree?
Thanks,
Ingo
Powered by blists - more mailing lists