linux-kernel - Re: [PATCH] acpi/nfit: badrange report spill over to clean range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <62d0bbb19ed1f_1643dc294cc@dwillia2-xfh.jf.intel.com.notmuch>
Date:   Thu, 14 Jul 2022 17:58:25 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Jane Chu <jane.chu@...cle.com>,
        Dan Williams <dan.j.williams@...el.com>,
        "hch@...radead.org" <hch@...radead.org>,
        "vishal.l.verma@...el.com" <vishal.l.verma@...el.com>,
        "dave.jiang@...el.com" <dave.jiang@...el.com>,
        "ira.weiny@...el.com" <ira.weiny@...el.com>,
        "nvdimm@...ts.linux.dev" <nvdimm@...ts.linux.dev>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] acpi/nfit: badrange report spill over to clean range

Jane Chu wrote:
> On 7/13/2022 5:24 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> On 7/12/2022 5:48 PM, Dan Williams wrote:
> >>> Jane Chu wrote:
> >>>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
> >>>> granularity") changed nfit_handle_mce() callback to report badrange for
> >>>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
> >>>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
> >>>> populated with Intel DCPMEM v2 dimms, it appears that
> >>>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
> >>>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
> >>>> 8 poisons.
> >>>>
> >>>> [29076.590281] {3}[Hardware Error]:   physical_address: 0x00000040a0602400
> >>>> [..]
> >>>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
> >>>> [29076.627519] mce: [Hardware Error]: Machine check events logged
> >>>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >>>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >>>> [..]
> >>>> [29078.634817] {4}[Hardware Error]:   physical_address: 0x00000040a0602600
> >>>> [..]
> >>>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >>>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >>>> [..]
> >>>> {
> >>>>     "dev":"namespace0.0",
> >>>>     "mode":"fsdax",
> >>>>     "map":"dev",
> >>>>     "size":33820770304,
> >>>>     "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
> >>>>     "sector_size":512,
> >>>>     "align":2097152,
> >>>>     "blockdev":"pmem0",
> >>>>     "badblock_count":8,
> >>>>     "badblocks":[
> >>>>       {
> >>>>         "offset":8208,
> >>>>         "length":8,
> >>>>         "dimms":[
> >>>>           "nmem0"
> >>>>         ]
> >>>>       }
> >>>>     ]
> >>>> }
> >>>>
> >>>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
> >>>> radius and shouldn't be used.  More over, as each injected poison is being
> >>>> reported independently, any alignment under 512-byte appear works:
> >>>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
> >>>> or 512-byte.
> >>>>
> >>>> To get around this issue, 512-bytes is chosen as the alignment because
> >>>>     a. it happens to be the badblock granularity,
> >>>>     b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
> >>>>     c. architecture agnostic
> >>>
> >>> I am failing to see the kernel bug? Yes, you injected less than 8
> >>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
> >>> that's not the kernel's fault, that's the hardware. What happens when
> >>> hardware really does detect 8 blocks of consective poison and this
> >>> implementation decides to only record 1 at a time?
> >>
> >> In that case, there will be 8 reports of the poisons by APEI GHES,
> > 
> > Why would there be 8 reports for just one poison consumption event?
> 
> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> one call for each poison with accurate address.
> 
> Also, short ARS would find 2 poisons.
> 
> I attached the console output, my annotation is prefixed with "<==".
> 
> It is from these information I concluded that no poison will be missed
> in terms of reporting.
> 
> > 
> >> ARC scan will also report 8 poisons, each will get to be added to the
> >> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
> > 
> > Right, that's what I'm saying about the proposed change, trim the
> > reported poison by what is return from a "short" ARS. Recall that
> > short-ARS just reads from a staging buffer that the BIOS knows about, it
> > need not go all the way to hardware.
> 
> Okay, that confirms my understanding of your proposal. More below.
> 
> > 
> >> In the above 2 poison example, the poison in 0x00000040a0602400 and in
> >> 0x00000040a0602600 were separately reported.
> > 
> > Separately reported, each with a 4K alignment?
> 
> Yes,  and so twice nfit_handle_mce() call
> nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
> complete overlap.
> 
> > 
> >>> It seems the fix you want is for the hardware to report the precise
> >>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> >>> that precision in this case.
> >>
> >> That field describes a 4K range even for a single poison, it confuses
> >> people unnecessarily.
> > 
> > I agree with you on the problem statement, it's the fix where I have
> > questions.
> > 
> >>> However, the ARS engine likely can return the precise error ranges so I
> >>> think the fix is to just use the address range indicated by 1UL <<
> >>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> >>> scrub request to ask the device for the precise error list.
> >>
> >> You mean for nfit_handle_mce() callback to issue a short ARS per each
> >> poison report over a 4K range
> > 
> > Over a L1_CACHE_BYTES range...
> > 
> >> in order to decide the precise range as a workaround of the hardware
> >> issue?  if there are 8 poisoned detected, there will be 8 short ARS,
> >> sure we want to do that?
> > 
> > Seems ok to me, short ARS is meant to be cheap. I would hope there are
> > no latency concerns in this path.
> 
> Yeah, accumulated latency is the concern here.

An actual concern in practice? I.e. you see the round trip call to the
BIOS violating application latency concerns?

> I know the upstream user call stack has timeout for getting a response
> for certain database transaction.

I would expect that's on the order of seconds, no?

> And the test folks might inject dozens of back-to-back poisons to
> adjacent pages...

I am not sure the extreme scenarios of test folks should be guiding
kernel design, the interfaces need to make sense first, and then the
performance can be fixed up if kernel architecture causes practical
problems.

> > 
> >> also, for now, is it possible to log more than 1 poison per 512byte
> >> block?
> > 
> > For the badrange tracking, no. So this would just be a check to say
> > "Yes, CPU I see you think the whole 4K is gone, but lets double check
> > with more precise information for what gets placed in the badrange
> > tracking".
> 
> Okay, process-wise, this is what I am seeing -
> 
> - for each poison, nfit_handle_mce() issues a short ARS given (addr, 
> 64bytes)

Why would the short-ARS be performed over a 64-byte span when the MCE
reported a 4K aligned event?

> - and short ARS returns to say that's actually (addr, 256bytes),
> - and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes) 
> anyway.

Right, I am reacting to the fact that the patch is picking 512 as an
arbtitrary blast radius. It's ok to expand the blast radius from
hardware when, for example, recording a 64-byte MCE in badrange which
only understands 512 byte records, but it's not ok to take a 4K MCE and
trim it to 512 bytes without asking hardware for a more precise report.

Recall that the NFIT driver supports platforms that may not offer ARS.
In that case the 4K MCE from the CPU is all that the driver gets and
there is no data source for a more precise answer.

So the ask is to avoid trimming the blast radius of MCE reports unless
and until a short-ARS says otherwise.

> The precise badrange from short ARS is lost in the process, given the
> time spent visiting the BIOS, what's the gain?

Generic support for not under-recording poison on platforms that do not
support ARS.

> Could we defer the precise badrange until there is consumer of the
> information?

Ideally the consumer is immediate and this precise information can make
it to the filesystem which might be able to make a better decision about
what data got clobbered.

See dax_notify_failure() infrastructure currently in linux-next that can
convey poison events to filesystems. That might be a path to start
tracking and reporting precise failure information to address the
constraints of the badrange implementation.