linux-kernel - Re: Phantom PMEM poison issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YetdZBbt0P/QBHDr@agluck-desk2.amr.corp.intel.com>
Date:   Fri, 21 Jan 2022 17:27:00 -0800
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Jane Chu <jane.chu@...cle.com>
Cc:     "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
        "bp@...en8.de >> Borislav Petkov" <bp@...en8.de>,
        "djwong@...nel.org" <djwong@...nel.org>,
        "willy@...radead.org" <willy@...radead.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: Phantom PMEM poison issue

On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
> On 1/21/2022 4:31 PM, Jane Chu wrote:
> > On baremetal Intel platform with DCPMEM installed and configured to
> > provision daxfs, say a poison was consumed by a load from a user thread,
> > and then daxfs takes action and clears the poison, confirmed by "ndctl
> > -NM".
> > 
> > Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> > the ghost of the previous poison will surface, and it takes
> > unload/reload the libnvdimm drivers in order to drive the phantom poison
> > away, confirmed by ARS.
> > 
> > Turns out, the issue is quite reproducible with the latest stable Linux.
> > 
> > Here is the relevant console message after injected 8 poisons in one
> > page via
> >     # ndctl inject-error namespace0.0 -n 2 -B 8210
> 
> There is a cut-n-paste error, the above line should be
>    "# ndctl inject-error namespace0.0 -n 8 -B 8210"

You say "in one page" here. What is the page size? 
> 
> -jane
> 
> > then, cleared them all, and wait for 5+ hours, notice the time stamp.
> > BTW, the system is idle otherwise.
> > 
> > [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> > 1850602400
> > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> > fsdax_poison_v1:8457 due to hardware memory corruption
> > [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> > Recovered
> > [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> > -1850603000 uncached-minus<->write-back
> > [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> > 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> > [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> > [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> > [mem 0x1850602000-0x1850602fff]

This error is reported in PFN=1850602 (at offset 0x400 = 1K)

> > 
> > At this point,
> > # ndctl list -NMu -r 0
> > {
> >     "dev":"namespace0.0",
> >     "mode":"fsdax",
> >     "map":"dev",
> >     "size":"15.75 GiB (16.91 GB)",
> >     "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> >     "sector_size":4096,
> >     "align":2097152,
> >     "blockdev":"pmem0"
> > }
> > 
> > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [21352.001528] {2}[Hardware Error]: event severity: recoverable
> > [21352.007838] {2}[Hardware Error]:  Error 0, type: recoverable
> > [21352.014156] {2}[Hardware Error]:   section_type: memory error
> > [21352.020572] {2}[Hardware Error]:   physical_address: 0x0000001850603200

This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)

Is that what you mean by "phantom error" ... from a different
address from those that were injected?

-Tony