linux-kernel - Phantom PMEM poison issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <1737f707-7743-4898-37d4-03098d7aaa57@oracle.com>
Date:   Sat, 22 Jan 2022 00:31:04 +0000
From:   Jane Chu <jane.chu@...cle.com>
To:     "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        "bp@...en8.de >> Borislav Petkov" <bp@...en8.de>
CC:     "djwong@...nel.org" <djwong@...nel.org>,
        "willy@...radead.org" <willy@...radead.org>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Phantom PMEM poison issue

On baremetal Intel platform with DCPMEM installed and configured to 
provision daxfs, say a poison was consumed by a load from a user thread, 
and then daxfs takes action and clears the poison, confirmed by "ndctl 
-NM".

Now, depends on the luck, after sometime(from a few seconds to 5+ hours) 
the ghost of the previous poison will surface, and it takes 
unload/reload the libnvdimm drivers in order to drive the phantom poison 
away, confirmed by ARS.

Turns out, the issue is quite reproducible with the latest stable Linux.

Here is the relevant console message after injected 8 poisons in one 
page via
   # ndctl inject-error namespace0.0 -n 2 -B 8210
then, cleared them all, and wait for 5+ hours, notice the time stamp. 
BTW, the system is idle otherwise.

[ 2439.742296] mce: Uncorrected hardware memory error in user-access at 
1850602400
[ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to 
fsdax_poison_v1:8457 due to hardware memory corruption
[ 2439.761866] Memory failure: 0x1850602: recovery action for dax page: 
Recovered
[ 2439.769949] mce: [Hardware Error]: Machine check events logged
-1850603000 uncached-minus<->write-back
[ 2439.769984] x86/PAT: memtype_reserve failed [mem 
0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
[ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
[ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype 
[mem 0x1850602000-0x1850602fff]

At this point,
# ndctl list -NMu -r 0
{
   "dev":"namespace0.0",
   "mode":"fsdax",
   "map":"dev",
   "size":"15.75 GiB (16.91 GB)",
   "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
   "sector_size":4096,
   "align":2097152,
   "blockdev":"pmem0"
}

[21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic 
Hardware Error Source: 1
[21352.001528] {2}[Hardware Error]: event severity: recoverable
[21352.007838] {2}[Hardware Error]:  Error 0, type: recoverable
[21352.014156] {2}[Hardware Error]:   section_type: memory error
[21352.020572] {2}[Hardware Error]:   physical_address: 0x0000001850603200
[21352.027958] {2}[Hardware Error]:   physical_address_mask: 
0xffffffffffffff00
[21352.035827] {2}[Hardware Error]:   node: 0 module: 1
[21352.041466] {2}[Hardware Error]:   DIMM location: /SYS/MB/P0 D6
[21352.048277] Memory failure: 0x1850603: recovery action for dax page: 
Recovered
[21352.056346] mce: [Hardware Error]: Machine check events logged
[21352.056890] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[21352.056892] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255: 
0xbc0000000000009f
[21352.056894] EDAC skx MC0: TSC 0x0
[21352.056895] EDAC skx MC0: ADDR 0x1850603200
[21352.056897] EDAC skx MC0: MISC 0x8c
[21352.056898] EDAC skx MC0: PROCESSOR 0:0x50656 TIME 1642758243 SOCKET 
0 APIC 0x0
[21352.056909] EDAC MC0: 1 UE memory read error on 
CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1850603 
offset:0x200 grain:32 -  err_code:0x0000:0x009f [..]

And now,

# ndctl list -NMu -r 0
{
   "dev":"namespace0.0",
   "mode":"fsdax",
   "map":"dev",
   "size":"15.75 GiB (16.91 GB)",
   "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
   "sector_size":4096,
   "align":2097152,
   "blockdev":"pmem0",
   "badblock_count":1,
   "badblocks":[
     {
       "offset":8217,
       "length":1,
       "dimms":[
         "nmem0"
       ]
     }
   ]
}

According to my limited research, when ghes_proc_in_irq() is fired to 
report a delayed UE and it calls memory_failure() to take the page out 
and causes driver to record a badblock record, and that's how the 
phantom poison appeared.

Note, 1 phantom poison for 8 injected poisons, so, not an accurate 
phantom representation.

But that aside, it seems that the GHES mechanism and the synchronous MCE 
handling is totally at odds with each other, and that cannot be correct.

What is the right thing to do to fix the issue? Should memory_failure 
handler second-guess the GHES report?  Should the synchronous MCE 
handling mechanism manage to tell the firmware that so-and-so memory UE 
has been cleared and hence clear the record in firmware?  Other ideas?


Thanks!
-jane