[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YetdZBbt0P/QBHDr@agluck-desk2.amr.corp.intel.com>
Date: Fri, 21 Jan 2022 17:27:00 -0800
From: "Luck, Tony" <tony.luck@...el.com>
To: Jane Chu <jane.chu@...cle.com>
Cc: "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
"bp@...en8.de >> Borislav Petkov" <bp@...en8.de>,
"djwong@...nel.org" <djwong@...nel.org>,
"willy@...radead.org" <willy@...radead.org>,
"nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: Phantom PMEM poison issue
On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
> On 1/21/2022 4:31 PM, Jane Chu wrote:
> > On baremetal Intel platform with DCPMEM installed and configured to
> > provision daxfs, say a poison was consumed by a load from a user thread,
> > and then daxfs takes action and clears the poison, confirmed by "ndctl
> > -NM".
> >
> > Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> > the ghost of the previous poison will surface, and it takes
> > unload/reload the libnvdimm drivers in order to drive the phantom poison
> > away, confirmed by ARS.
> >
> > Turns out, the issue is quite reproducible with the latest stable Linux.
> >
> > Here is the relevant console message after injected 8 poisons in one
> > page via
> > # ndctl inject-error namespace0.0 -n 2 -B 8210
>
> There is a cut-n-paste error, the above line should be
> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
You say "in one page" here. What is the page size?
>
> -jane
>
> > then, cleared them all, and wait for 5+ hours, notice the time stamp.
> > BTW, the system is idle otherwise.
> >
> > [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> > 1850602400
> > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> > fsdax_poison_v1:8457 due to hardware memory corruption
> > [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> > Recovered
> > [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> > -1850603000 uncached-minus<->write-back
> > [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> > 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> > [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> > [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> > [mem 0x1850602000-0x1850602fff]
This error is reported in PFN=1850602 (at offset 0x400 = 1K)
> >
> > At this point,
> > # ndctl list -NMu -r 0
> > {
> > "dev":"namespace0.0",
> > "mode":"fsdax",
> > "map":"dev",
> > "size":"15.75 GiB (16.91 GB)",
> > "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> > "sector_size":4096,
> > "align":2097152,
> > "blockdev":"pmem0"
> > }
> >
> > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [21352.001528] {2}[Hardware Error]: event severity: recoverable
> > [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
> > [21352.014156] {2}[Hardware Error]: section_type: memory error
> > [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
Is that what you mean by "phantom error" ... from a different
address from those that were injected?
-Tony
Powered by blists - more mailing lists