lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 22 Jan 2022 01:51:41 +0000
From:   "Tsaur, Erwin" <erwin.tsaur@...el.com>
To:     "Luck, Tony" <tony.luck@...el.com>,
        "chu, jane" <jane.chu@...cle.com>
CC:     "Williams, Dan J" <dan.j.williams@...el.com>,
        "bp@...en8.de >> Borislav Petkov" <bp@...en8.de>,
        "djwong@...nel.org" <djwong@...nel.org>,
        "willy@...radead.org" <willy@...radead.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: Phantom PMEM poison issue

Hi Jane,

Is phantom error, an poison that was injected and then cleared, but somehow shows up again?
How is "daxfs takes acation and clears the poison" by doing mailbox or writes?  
Also how are you doing ARS?

Erwin

-----Original Message-----
From: Luck, Tony <tony.luck@...el.com> 
Sent: Friday, January 21, 2022 5:27 PM
To: chu, jane <jane.chu@...cle.com>
Cc: Williams, Dan J <dan.j.williams@...el.com>; bp@...en8.de >> Borislav Petkov <bp@...en8.de>; djwong@...nel.org; willy@...radead.org; nvdimm@...ts.linux.dev; linux-kernel@...r.kernel.org
Subject: Re: Phantom PMEM poison issue

On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
> On 1/21/2022 4:31 PM, Jane Chu wrote:
> > On baremetal Intel platform with DCPMEM installed and configured to 
> > provision daxfs, say a poison was consumed by a load from a user 
> > thread, and then daxfs takes action and clears the poison, confirmed 
> > by "ndctl -NM".
> > 
> > Now, depends on the luck, after sometime(from a few seconds to 5+ 
> > hours) the ghost of the previous poison will surface, and it takes 
> > unload/reload the libnvdimm drivers in order to drive the phantom 
> > poison away, confirmed by ARS.
> > 
> > Turns out, the issue is quite reproducible with the latest stable Linux.
> > 
> > Here is the relevant console message after injected 8 poisons in one 
> > page via
> >     # ndctl inject-error namespace0.0 -n 2 -B 8210
> 
> There is a cut-n-paste error, the above line should be
>    "# ndctl inject-error namespace0.0 -n 8 -B 8210"

You say "in one page" here. What is the page size? 
> 
> -jane
> 
> > then, cleared them all, and wait for 5+ hours, notice the time stamp.
> > BTW, the system is idle otherwise.
> > 
> > [ 2439.742296] mce: Uncorrected hardware memory error in user-access 
> > at
> > 1850602400
> > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> > fsdax_poison_v1:8457 due to hardware memory corruption [ 
> > 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> > Recovered
> > [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> > -1850603000 uncached-minus<->write-back [ 2439.769984] x86/PAT: 
> > memtype_reserve failed [mem 0x1850602000-0x1850602fff], track 
> > uncached-minus, req uncached-minus [ 2439.769985] Could not 
> > invalidate pfn=0x1850602 from 1:1 map [ 2440.856351] x86/PAT: 
> > fsdax_poison_v1:8457 freeing invalid memtype [mem 
> > 0x1850602000-0x1850602fff]

This error is reported in PFN=1850602 (at offset 0x400 = 1K)

> > 
> > At this point,
> > # ndctl list -NMu -r 0
> > {
> >     "dev":"namespace0.0",
> >     "mode":"fsdax",
> >     "map":"dev",
> >     "size":"15.75 GiB (16.91 GB)",
> >     "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> >     "sector_size":4096,
> >     "align":2097152,
> >     "blockdev":"pmem0"
> > }
> > 
> > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic 
> > Hardware Error Source: 1 [21352.001528] {2}[Hardware Error]: event 
> > severity: recoverable [21352.007838] {2}[Hardware Error]:  Error 0, 
> > type: recoverable
> > [21352.014156] {2}[Hardware Error]:   section_type: memory error
> > [21352.020572] {2}[Hardware Error]:   physical_address: 0x0000001850603200

This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)

Is that what you mean by "phantom error" ... from a different address from those that were injected?

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ