linux-kernel - Re: [PATCH] acpi/nfit: badrange report spill over to clean range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8b13446a-65ac-0cbd-6c17-0f9e1cfbb048@oracle.com>
Date:   Thu, 14 Jul 2022 23:22:15 +0000
From:   Jane Chu <jane.chu@...cle.com>
To:     Dan Williams <dan.j.williams@...el.com>,
        "hch@...radead.org" <hch@...radead.org>,
        "vishal.l.verma@...el.com" <vishal.l.verma@...el.com>,
        "dave.jiang@...el.com" <dave.jiang@...el.com>,
        "ira.weiny@...el.com" <ira.weiny@...el.com>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] acpi/nfit: badrange report spill over to clean range

On 7/13/2022 5:24 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 7/12/2022 5:48 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>>>> granularity") changed nfit_handle_mce() callback to report badrange for
>>>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
>>>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>>>> populated with Intel DCPMEM v2 dimms, it appears that
>>>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
>>>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>>>> 8 poisons.
>>>>
>>>> [29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
>>>> [..]
>>>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
>>>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>>>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>>>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>>>> [..]
>>>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
>>>> [..]
>>>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>>>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>>>> [..]
>>>> {
>>>>     "dev":"namespace0.0",
>>>>     "mode":"fsdax",
>>>>     "map":"dev",
>>>>     "size":33820770304,
>>>>     "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>>>>     "sector_size":512,
>>>>     "align":2097152,
>>>>     "blockdev":"pmem0",
>>>>     "badblock_count":8,
>>>>     "badblocks":[
>>>>       {
>>>>         "offset":8208,
>>>>         "length":8,
>>>>         "dimms":[
>>>>           "nmem0"
>>>>         ]
>>>>       }
>>>>     ]
>>>> }
>>>>
>>>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
>>>> radius and shouldn't be used.  More over, as each injected poison is being
>>>> reported independently, any alignment under 512-byte appear works:
>>>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>>>> or 512-byte.
>>>>
>>>> To get around this issue, 512-bytes is chosen as the alignment because
>>>>     a. it happens to be the badblock granularity,
>>>>     b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
>>>>     c. architecture agnostic
>>>
>>> I am failing to see the kernel bug? Yes, you injected less than 8
>>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
>>> that's not the kernel's fault, that's the hardware. What happens when
>>> hardware really does detect 8 blocks of consective poison and this
>>> implementation decides to only record 1 at a time?
>>
>> In that case, there will be 8 reports of the poisons by APEI GHES,
> 
> Why would there be 8 reports for just one poison consumption event?

I meant to say there would be 8 calls to the nfit_handle_mce() callback,
one call for each poison with accurate address.

Also, short ARS would find 2 poisons.

I attached the console output, my annotation is prefixed with "<==".

It is from these information I concluded that no poison will be missed
in terms of reporting.

> 
>> ARC scan will also report 8 poisons, each will get to be added to the
>> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
> 
> Right, that's what I'm saying about the proposed change, trim the
> reported poison by what is return from a "short" ARS. Recall that
> short-ARS just reads from a staging buffer that the BIOS knows about, it
> need not go all the way to hardware.

Okay, that confirms my understanding of your proposal. More below.

> 
>> In the above 2 poison example, the poison in 0x00000040a0602400 and in
>> 0x00000040a0602600 were separately reported.
> 
> Separately reported, each with a 4K alignment?

Yes,  and so twice nfit_handle_mce() call
nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
complete overlap.

> 
>>> It seems the fix you want is for the hardware to report the precise
>>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
>>> that precision in this case.
>>
>> That field describes a 4K range even for a single poison, it confuses
>> people unnecessarily.
> 
> I agree with you on the problem statement, it's the fix where I have
> questions.
> 
>>> However, the ARS engine likely can return the precise error ranges so I
>>> think the fix is to just use the address range indicated by 1UL <<
>>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
>>> scrub request to ask the device for the precise error list.
>>
>> You mean for nfit_handle_mce() callback to issue a short ARS per each
>> poison report over a 4K range
> 
> Over a L1_CACHE_BYTES range...
> 
>> in order to decide the precise range as a workaround of the hardware
>> issue?  if there are 8 poisoned detected, there will be 8 short ARS,
>> sure we want to do that?
> 
> Seems ok to me, short ARS is meant to be cheap. I would hope there are
> no latency concerns in this path.

Yeah, accumulated latency is the concern here. I know the upstream
user call stack has timeout for getting a response for certain
database transaction.  And the test folks might inject dozens of
back-to-back poisons to adjacent pages...
> 
>> also, for now, is it possible to log more than 1 poison per 512byte
>> block?
> 
> For the badrange tracking, no. So this would just be a check to say
> "Yes, CPU I see you think the whole 4K is gone, but lets double check
> with more precise information for what gets placed in the badrange
> tracking".

Okay, process-wise, this is what I am seeing -

- for each poison, nfit_handle_mce() issues a short ARS given (addr, 
64bytes)
- and short ARS returns to say that's actually (addr, 256bytes),
- and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes) 
anyway.

The precise badrange from short ARS is lost in the process, given the
time spent visiting the BIOS, what's the gain?  Could we defer the
precise badrange until there is consumer of the information?

thanks!
-jane





View attachment "poison_console_msg.txt" of type "text/plain" (3463 bytes)