lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6780610bc33e9_9b92294cd@dwillia2-mobl3.amr.corp.intel.com.notmuch>
Date: Thu, 9 Jan 2025 15:51:39 -0800
From: Dan Williams <dan.j.williams@...el.com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>, Borislav Petkov
	<bp@...en8.de>
CC: Shiju Jose <shiju.jose@...wei.com>, "linux-edac@...r.kernel.org"
	<linux-edac@...r.kernel.org>, "linux-cxl@...r.kernel.org"
	<linux-cxl@...r.kernel.org>, "linux-acpi@...r.kernel.org"
	<linux-acpi@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"tony.luck@...el.com" <tony.luck@...el.com>, "rafael@...nel.org"
	<rafael@...nel.org>, "lenb@...nel.org" <lenb@...nel.org>,
	"mchehab@...nel.org" <mchehab@...nel.org>, "dan.j.williams@...el.com"
	<dan.j.williams@...el.com>, "dave@...olabs.net" <dave@...olabs.net>,
	"dave.jiang@...el.com" <dave.jiang@...el.com>, "alison.schofield@...el.com"
	<alison.schofield@...el.com>, "vishal.l.verma@...el.com"
	<vishal.l.verma@...el.com>, "ira.weiny@...el.com" <ira.weiny@...el.com>,
	"david@...hat.com" <david@...hat.com>, "Vilas.Sridharan@....com"
	<Vilas.Sridharan@....com>, "leo.duran@....com" <leo.duran@....com>,
	"Yazen.Ghannam@....com" <Yazen.Ghannam@....com>, "rientjes@...gle.com"
	<rientjes@...gle.com>, "jiaqiyan@...gle.com" <jiaqiyan@...gle.com>,
	"Jon.Grimm@....com" <Jon.Grimm@....com>, "dave.hansen@...ux.intel.com"
	<dave.hansen@...ux.intel.com>, "naoya.horiguchi@....com"
	<naoya.horiguchi@....com>, "james.morse@....com" <james.morse@....com>,
	"jthoughton@...gle.com" <jthoughton@...gle.com>, "somasundaram.a@....com"
	<somasundaram.a@....com>, "erdemaktas@...gle.com" <erdemaktas@...gle.com>,
	"pgonda@...gle.com" <pgonda@...gle.com>, "duenwen@...gle.com"
	<duenwen@...gle.com>, "gthelen@...gle.com" <gthelen@...gle.com>,
	"wschwartz@...erecomputing.com" <wschwartz@...erecomputing.com>,
	"dferguson@...erecomputing.com" <dferguson@...erecomputing.com>,
	"wbs@...amperecomputing.com" <wbs@...amperecomputing.com>,
	"nifan.cxl@...il.com" <nifan.cxl@...il.com>, tanxiaofei
	<tanxiaofei@...wei.com>, "Zengtao (B)" <prime.zeng@...ilicon.com>, "Roberto
 Sassu" <roberto.sassu@...wei.com>, "kangkang.shen@...urewei.com"
	<kangkang.shen@...urewei.com>, wanghuiqiang <wanghuiqiang@...wei.com>,
	Linuxarm <linuxarm@...wei.com>
Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature

Jonathan Cameron wrote:
> Ok. Best path is drop the available range support then (so no min_ max_ or
> anything to replace them for now).

I think less is more in this case.

The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
wide for userspace to have a chance at writing a competent tool. At
least I am struggling with where to even begin with those ABIs if I was
asked to write a tool. Does a tool already exist for those?

Some questions that read on those ABIs are:

1/ What if the platform has translation between HPA (CXL decode) and SPA
(physical addresses reported in trace points that PIO and DMA see)?

2/ What if memory is interleaved across repair domains? 

3/ What if the device does not use DDR terminology / topology terms for
repair?

I expect the flow rasdaemon would want is that the current PFA (leaky
bucket Pre-Failure Analysis) decides that the number of soft-offlines it
has performed exceeds some threshold and it wants to attempt to repair
memory.

However, what is missing today for volatile memory is that some failures
can be repaired with in-band writes and some failures need heavier
hammers like Post-Package-Repair to actively swap in whole new banks of
memory. So don't we need something like "soft-offline-undo" on the way
to PPR?

So, yes, +1 to simpler for now where software effectively just needs to
deal with a handful of "region repair" buttons and the semantics of
those are coarse and sub-optimal. Wait for a future where a tool author
says, "we have had good success getting bulk offlined pages back into
service, but now we need this specific finer grained kernel interface to
avoid wasting spare banks prematurely".

Anything more complex than a set of /sys/devices/system/memory/
devices has a /sys/bus/edac/devices/devX/repair button, feels like a
generation ahead of where the initial sophistication needs to lie.

That said, I do not closely follow ras tooling to say whether someone
has already identified the critical need for a fine grained repair ABI?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ