linux-kernel - Re: Phantom PMEM poison issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0aab10e0-6f1d-d67b-8aef-7e3d18e26544@oracle.com>
Date:   Sat, 22 Jan 2022 00:40:18 +0000
From:   Jane Chu <jane.chu@...cle.com>
To:     "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        "bp@...en8.de >> Borislav Petkov" <bp@...en8.de>
CC:     "djwong@...nel.org" <djwong@...nel.org>,
        "willy@...radead.org" <willy@...radead.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: Phantom PMEM poison issue

On 1/21/2022 4:31 PM, Jane Chu wrote:
> On baremetal Intel platform with DCPMEM installed and configured to
> provision daxfs, say a poison was consumed by a load from a user thread,
> and then daxfs takes action and clears the poison, confirmed by "ndctl
> -NM".
> 
> Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> the ghost of the previous poison will surface, and it takes
> unload/reload the libnvdimm drivers in order to drive the phantom poison
> away, confirmed by ARS.
> 
> Turns out, the issue is quite reproducible with the latest stable Linux.
> 
> Here is the relevant console message after injected 8 poisons in one
> page via
>     # ndctl inject-error namespace0.0 -n 2 -B 8210

There is a cut-n-paste error, the above line should be
   "# ndctl inject-error namespace0.0 -n 8 -B 8210"

-jane

> then, cleared them all, and wait for 5+ hours, notice the time stamp.
> BTW, the system is idle otherwise.
> 
> [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> 1850602400
> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> fsdax_poison_v1:8457 due to hardware memory corruption
> [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> Recovered
> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> -1850603000 uncached-minus<->write-back
> [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> [mem 0x1850602000-0x1850602fff]
> 
> At this point,
> # ndctl list -NMu -r 0
> {
>     "dev":"namespace0.0",
>     "mode":"fsdax",
>     "map":"dev",
>     "size":"15.75 GiB (16.91 GB)",
>     "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>     "sector_size":4096,
>     "align":2097152,
>     "blockdev":"pmem0"
> }
> 
> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [21352.001528] {2}[Hardware Error]: event severity: recoverable
> [21352.007838] {2}[Hardware Error]:  Error 0, type: recoverable
> [21352.014156] {2}[Hardware Error]:   section_type: memory error
> [21352.020572] {2}[Hardware Error]:   physical_address: 0x0000001850603200
> [21352.027958] {2}[Hardware Error]:   physical_address_mask:
> 0xffffffffffffff00
> [21352.035827] {2}[Hardware Error]:   node: 0 module: 1
> [21352.041466] {2}[Hardware Error]:   DIMM location: /SYS/MB/P0 D6
> [21352.048277] Memory failure: 0x1850603: recovery action for dax page:
> Recovered
> [21352.056346] mce: [Hardware Error]: Machine check events logged
> [21352.056890] EDAC skx MC0: HANDLING MCE MEMORY ERROR
> [21352.056892] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255:
> 0xbc0000000000009f
> [21352.056894] EDAC skx MC0: TSC 0x0
> [21352.056895] EDAC skx MC0: ADDR 0x1850603200
> [21352.056897] EDAC skx MC0: MISC 0x8c
> [21352.056898] EDAC skx MC0: PROCESSOR 0:0x50656 TIME 1642758243 SOCKET
> 0 APIC 0x0
> [21352.056909] EDAC MC0: 1 UE memory read error on
> CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1850603
> offset:0x200 grain:32 -  err_code:0x0000:0x009f [..]
> 
> And now,
> 
> # ndctl list -NMu -r 0
> {
>     "dev":"namespace0.0",
>     "mode":"fsdax",
>     "map":"dev",
>     "size":"15.75 GiB (16.91 GB)",
>     "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>     "sector_size":4096,
>     "align":2097152,
>     "blockdev":"pmem0",
>     "badblock_count":1,
>     "badblocks":[
>       {
>         "offset":8217,
>         "length":1,
>         "dimms":[
>           "nmem0"
>         ]
>       }
>     ]
> }
> 
> According to my limited research, when ghes_proc_in_irq() is fired to
> report a delayed UE and it calls memory_failure() to take the page out
> and causes driver to record a badblock record, and that's how the
> phantom poison appeared.
> 
> Note, 1 phantom poison for 8 injected poisons, so, not an accurate
> phantom representation.
> 
> But that aside, it seems that the GHES mechanism and the synchronous MCE
> handling is totally at odds with each other, and that cannot be correct.
> 
> What is the right thing to do to fix the issue? Should memory_failure
> handler second-guess the GHES report?  Should the synchronous MCE
> handling mechanism manage to tell the firmware that so-and-so memory UE
> has been cleared and hence clear the record in firmware?  Other ideas?
> 
> 
> Thanks!
> -jane