linux-kernel - Re: [PATCH v15 11/15] EDAC: Add memory repair control feature

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241115121415.00005c76@huawei.com>
Date: Fri, 15 Nov 2024 12:14:15 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Borislav Petkov <bp@...en8.de>
CC: Shiju Jose <shiju.jose@...wei.com>, "linux-edac@...r.kernel.org"
	<linux-edac@...r.kernel.org>, "linux-cxl@...r.kernel.org"
	<linux-cxl@...r.kernel.org>, "linux-acpi@...r.kernel.org"
	<linux-acpi@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"tony.luck@...el.com" <tony.luck@...el.com>, "rafael@...nel.org"
	<rafael@...nel.org>, "lenb@...nel.org" <lenb@...nel.org>,
	"mchehab@...nel.org" <mchehab@...nel.org>, "dan.j.williams@...el.com"
	<dan.j.williams@...el.com>, "dave@...olabs.net" <dave@...olabs.net>,
	"gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
	"sudeep.holla@....com" <sudeep.holla@....com>, "jassisinghbrar@...il.com"
	<jassisinghbrar@...il.com>, "dave.jiang@...el.com" <dave.jiang@...el.com>,
	"alison.schofield@...el.com" <alison.schofield@...el.com>,
	"vishal.l.verma@...el.com" <vishal.l.verma@...el.com>, "ira.weiny@...el.com"
	<ira.weiny@...el.com>, "david@...hat.com" <david@...hat.com>,
	"Vilas.Sridharan@....com" <Vilas.Sridharan@....com>, "leo.duran@....com"
	<leo.duran@....com>, "Yazen.Ghannam@....com" <Yazen.Ghannam@....com>,
	"rientjes@...gle.com" <rientjes@...gle.com>, "jiaqiyan@...gle.com"
	<jiaqiyan@...gle.com>, "Jon.Grimm@....com" <Jon.Grimm@....com>,
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
	"naoya.horiguchi@....com" <naoya.horiguchi@....com>, "james.morse@....com"
	<james.morse@....com>, "jthoughton@...gle.com" <jthoughton@...gle.com>,
	"somasundaram.a@....com" <somasundaram.a@....com>, "erdemaktas@...gle.com"
	<erdemaktas@...gle.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
	"duenwen@...gle.com" <duenwen@...gle.com>, "gthelen@...gle.com"
	<gthelen@...gle.com>, "wschwartz@...erecomputing.com"
	<wschwartz@...erecomputing.com>, "dferguson@...erecomputing.com"
	<dferguson@...erecomputing.com>, "wbs@...amperecomputing.com"
	<wbs@...amperecomputing.com>, "nifan.cxl@...il.com" <nifan.cxl@...il.com>,
	tanxiaofei <tanxiaofei@...wei.com>, "Zengtao (B)" <prime.zeng@...ilicon.com>,
	"Roberto Sassu" <roberto.sassu@...wei.com>, "kangkang.shen@...urewei.com"
	<kangkang.shen@...urewei.com>, wanghuiqiang <wanghuiqiang@...wei.com>,
	Linuxarm <linuxarm@...wei.com>
Subject: Re: [PATCH v15 11/15] EDAC: Add memory repair control feature

Hi Borislav,

I'll just jump in on one element.

> > This will work for the CXL PPR feature where the result of the query operation for resources  availability
> > return to the command, however for the CXL memory sparing features,  the result of the query resources 
> > availability command returned later in a Memory Sparing Event Record from the device. 
> > Userspace shall issue repair operation with the attributes values received on the Memory Sparing trace event.
> > Thus for the CXL memory sparing feature, query for resources availability and repair operation 
> > cannot be combined.  
> 
> What happens if the resources availability changes between the query and the
> start of the repair operation?
>
Short answer, you get an error return. 

The query is an optional step / optimization. You can just skip it.
There is no point in  querying if you are going to immediately issue the command to repair
(as that will report an error if you can't do it). 

A typical flow where it might be useful is:
1) Lots of corrected errors reported on a particular part of the memory.
2) OS decides enough is enough, that row/bank/nibble should be replaced.
3) Before doing so it checks it can actually replace it - otherwise maybe we will be disrupting a
   gigantic page or similar where the perf cost of just off lining is higher than we want.
4) After query the page is offlined etc (may or may not be necessary depending on the
   hardware design - we may be able to do it 'live').
5) 'Try' to repair. Hopefully no one raced with us and used up the remaining resources.
  Given this is typically only driven by something like RASDaemon that race should be
  a corner case only (very unlikely)
6) If repair fails can just bring the memory back - but this dance was expensive and
   we will carry on working with less than ideal memory (probably schedule some
   real maintenance to swap out the device).
7) If repair succeeds bring the memory back as now we have shiny new memory.

We could drop the query for now and bring it back later once more of the surrounding
infrastructure becomes clearer.  To me it's a useful feature, but I appreciate
this is early days and we shouldn't always try for all the bells and whistles on
day 1.

> The cat catches fire?
Dog person? :) Just a nice normal error return to indicate no resources.

Jonathan

>