lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Sat, 22 Jan 2022 02:32:43 +0000
From:   Jane Chu <jane.chu@...cle.com>
To:     "Tsaur, Erwin" <erwin.tsaur@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>
CC:     "Williams, Dan J" <dan.j.williams@...el.com>,
        "bp@...en8.de >> Borislav Petkov" <bp@...en8.de>,
        "djwong@...nel.org" <djwong@...nel.org>,
        "willy@...radead.org" <willy@...radead.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: Phantom PMEM poison issue

On 1/21/2022 5:51 PM, Tsaur, Erwin wrote:
> Hi Jane,
> 
> Is phantom error, an poison that was injected and then cleared, but somehow shows up again?
> How is "daxfs takes acation and clears the poison" by doing mailbox or writes?
> Also how are you doing ARS?

The phantom show up as soon as this console message show up
    [Hardware Error]: Hardware error from APEI Generic Hardware Error 
Source: 1
from 'ghes'.

The poisons were clear via pmem_clear_poison().

ARS was run as
   "ndctl start-scrub; ndctl wait-scrub -p 30"

thanks,
-jane


> 
> Erwin
> 
> -----Original Message-----
> From: Luck, Tony <tony.luck@...el.com>
> Sent: Friday, January 21, 2022 5:27 PM
> To: chu, jane <jane.chu@...cle.com>
> Cc: Williams, Dan J <dan.j.williams@...el.com>; bp@...en8.de >> Borislav Petkov <bp@...en8.de>; djwong@...nel.org; willy@...radead.org; nvdimm@...ts.linux.dev; linux-kernel@...r.kernel.org
> Subject: Re: Phantom PMEM poison issue
> 
> On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
>> On 1/21/2022 4:31 PM, Jane Chu wrote:
>>> On baremetal Intel platform with DCPMEM installed and configured to
>>> provision daxfs, say a poison was consumed by a load from a user
>>> thread, and then daxfs takes action and clears the poison, confirmed
>>> by "ndctl -NM".
>>>
>>> Now, depends on the luck, after sometime(from a few seconds to 5+
>>> hours) the ghost of the previous poison will surface, and it takes
>>> unload/reload the libnvdimm drivers in order to drive the phantom
>>> poison away, confirmed by ARS.
>>>
>>> Turns out, the issue is quite reproducible with the latest stable Linux.
>>>
>>> Here is the relevant console message after injected 8 poisons in one
>>> page via
>>>      # ndctl inject-error namespace0.0 -n 2 -B 8210
>>
>> There is a cut-n-paste error, the above line should be
>>     "# ndctl inject-error namespace0.0 -n 8 -B 8210"
> 
> You say "in one page" here. What is the page size?
>>
>> -jane
>>
>>> then, cleared them all, and wait for 5+ hours, notice the time stamp.
>>> BTW, the system is idle otherwise.
>>>
>>> [ 2439.742296] mce: Uncorrected hardware memory error in user-access
>>> at
>>> 1850602400
>>> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
>>> fsdax_poison_v1:8457 due to hardware memory corruption [
>>> 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
>>> Recovered
>>> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
>>> -1850603000 uncached-minus<->write-back [ 2439.769984] x86/PAT:
>>> memtype_reserve failed [mem 0x1850602000-0x1850602fff], track
>>> uncached-minus, req uncached-minus [ 2439.769985] Could not
>>> invalidate pfn=0x1850602 from 1:1 map [ 2440.856351] x86/PAT:
>>> fsdax_poison_v1:8457 freeing invalid memtype [mem
>>> 0x1850602000-0x1850602fff]
> 
> This error is reported in PFN=1850602 (at offset 0x400 = 1K)
> 
>>>
>>> At this point,
>>> # ndctl list -NMu -r 0
>>> {
>>>      "dev":"namespace0.0",
>>>      "mode":"fsdax",
>>>      "map":"dev",
>>>      "size":"15.75 GiB (16.91 GB)",
>>>      "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>>>      "sector_size":4096,
>>>      "align":2097152,
>>>      "blockdev":"pmem0"
>>> }
>>>
>>> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
>>> Hardware Error Source: 1 [21352.001528] {2}[Hardware Error]: event
>>> severity: recoverable [21352.007838] {2}[Hardware Error]:  Error 0,
>>> type: recoverable
>>> [21352.014156] {2}[Hardware Error]:   section_type: memory error
>>> [21352.020572] {2}[Hardware Error]:   physical_address: 0x0000001850603200
> 
> This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
> 
> Is that what you mean by "phantom error" ... from a different address from those that were injected?
> 
> -Tony
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ