[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <62d1bd546fce2_242d29499@dwillia2-xfh.jf.intel.com.notmuch>
Date: Fri, 15 Jul 2022 12:17:40 -0700
From: Dan Williams <dan.j.williams@...el.com>
To: Jane Chu <jane.chu@...cle.com>,
Dan Williams <dan.j.williams@...el.com>,
"hch@...radead.org" <hch@...radead.org>,
"vishal.l.verma@...el.com" <vishal.l.verma@...el.com>,
"dave.jiang@...el.com" <dave.jiang@...el.com>,
"ira.weiny@...el.com" <ira.weiny@...el.com>,
"nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] acpi/nfit: badrange report spill over to clean range
[ add Tony ]
Jane Chu wrote:
> On 7/14/2022 6:19 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> >> one call for each poison with accurate address.
> >>
> >> Also, short ARS would find 2 poisons.
> >>
> >> I attached the console output, my annotation is prefixed with "<==".
> >
> > [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600 <== 2nd poison @ 0x600
> > [29078.642200] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffff00
> >
> > Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
> > is seeing a 256-byte address mask?
>
> Good question! One would think both GHES reporting and
> nfit_handle_mce() are consuming the same mce record...
> Who might know?
Did some grepping...
Have a look at: apei_mce_report_mem_error()
"The call is coming from inside the house!"
Luckily we do not need to contact a BIOS engineer to get this fixed.
> > Sigh, is this "firmware-first" causing the kernel to get bad information
> > via the native mechanisms >
> > I would expect that if this test was truly worried about minimizing BIOS
> > latency it would disable firmware-first error reporting. I wonder if
> > that fixes the observed problem?
>
> Could you elaborate on firmware-first error please? What are the
> possible consequences disabling it? and how to disable it?
With my Linux kernel developer hat on, firmware-first error handling is
really only useful for supporting legacy operating systems that do not
have native machine check handling, or for platforms that have bugs that
would otherwise cause OS native error handling to fail. Otherwise, for
modern Linux, firmware-first error handling is pure overhead and a
source of bugs.
In this case the bug is in the Linux code that translates the ACPI event
back into an MCE record.
Powered by blists - more mailing lists