[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6786bc792c66a_20f3294ce@dwillia2-xfh.jf.intel.com.notmuch>
Date: Tue, 14 Jan 2025 11:35:21 -0800
From: Dan Williams <dan.j.williams@...el.com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>, Dan Williams
<dan.j.williams@...el.com>
CC: Borislav Petkov <bp@...en8.de>, Shiju Jose <shiju.jose@...wei.com>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "tony.luck@...el.com" <tony.luck@...el.com>,
"rafael@...nel.org" <rafael@...nel.org>, "lenb@...nel.org" <lenb@...nel.org>,
"mchehab@...nel.org" <mchehab@...nel.org>, "dave@...olabs.net"
<dave@...olabs.net>, "dave.jiang@...el.com" <dave.jiang@...el.com>,
"alison.schofield@...el.com" <alison.schofield@...el.com>,
"vishal.l.verma@...el.com" <vishal.l.verma@...el.com>, "ira.weiny@...el.com"
<ira.weiny@...el.com>, "david@...hat.com" <david@...hat.com>,
"Vilas.Sridharan@....com" <Vilas.Sridharan@....com>, "leo.duran@....com"
<leo.duran@....com>, "Yazen.Ghannam@....com" <Yazen.Ghannam@....com>,
"rientjes@...gle.com" <rientjes@...gle.com>, "jiaqiyan@...gle.com"
<jiaqiyan@...gle.com>, "Jon.Grimm@....com" <Jon.Grimm@....com>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"naoya.horiguchi@....com" <naoya.horiguchi@....com>, "james.morse@....com"
<james.morse@....com>, "jthoughton@...gle.com" <jthoughton@...gle.com>,
"somasundaram.a@....com" <somasundaram.a@....com>, "erdemaktas@...gle.com"
<erdemaktas@...gle.com>, "pgonda@...gle.com" <pgonda@...gle.com>,
"duenwen@...gle.com" <duenwen@...gle.com>, "gthelen@...gle.com"
<gthelen@...gle.com>, "wschwartz@...erecomputing.com"
<wschwartz@...erecomputing.com>, "dferguson@...erecomputing.com"
<dferguson@...erecomputing.com>, "wbs@...amperecomputing.com"
<wbs@...amperecomputing.com>, "nifan.cxl@...il.com" <nifan.cxl@...il.com>,
tanxiaofei <tanxiaofei@...wei.com>, "Zengtao (B)" <prime.zeng@...ilicon.com>,
Roberto Sassu <roberto.sassu@...wei.com>, "kangkang.shen@...urewei.com"
<kangkang.shen@...urewei.com>, wanghuiqiang <wanghuiqiang@...wei.com>,
Linuxarm <linuxarm@...wei.com>
Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
Jonathan Cameron wrote:
> On Fri, 10 Jan 2025 14:49:03 -0800
> Dan Williams <dan.j.williams@...el.com> wrote:
[..]
> > This is where you lose me. The error record is a point in time snapshot
> > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
> > model for memory operations is based on coordinating with the kernel's
> > understanding of how that SPA is *currently* being used.
>
> Whilst it is being used I agree. Key is to only do disruptive / data
> changing actions when it is not being used.
Sure.
> > The kernel can not just take userspace's word for it that potentially
> > data changing, or temporary loss-of-use operations are safe to execute
> > just because once upon a time userspace saw an error record that
> > implicated a given SPA in the past, especially over reboot.
>
> There are two cases (discoverable from hardware)
>
> 1) Non disruptive. No security concern as the device guarantees to
> not interrupt traffic and the memory contents is copied to the new
> location. Basically software never knows it happened.
> 2) Disruptive. We only allow this if the memory is offline. In the CXL case
> the CXL specific code must check no memory on the device is online so
> we aren't disrupting anything. The other implementation we have code
> for (will post after this lands) has finer granularity constraints and only
> the page needs to be offline.
> As it is offline the content is not preserved anyway. We may need to add extra
> constraints along with future support for temporal persistence / sharing but
> we can do that as part of adding that support in general.
> (Personally I think in those cases memory repair is a job for the out of
> band management anyway).
>
> In neither case am I seeing a security concern. Am I missing something?
s/security/system-integrity/
1/ Hardware engineers may have a different definition of "non-disuptive"
than software. See the history around hibernate_quiet_exec() to work
around the disruption of latency spikes. If this is poorly specified
across vendors we are going to wish that we did not build a "take
userspace's word for it" interface.
2/ Yes, if the device is not actively decoding any in use memory feel
free to run destructive operations on the device. However, is sysfs the
right interface for "submit multi-parameter atomic operation with
transient result"? I lean heavily into sysfs, but ioctl and netlink have
a role to play in scenarios like this. Otherwise userspace can inject
error records back into the kernel with the expectation that the kernel
can only accept the DIMM address and not any of the translation data
which might be stale.
[..]
> > Again, the repair control assumes that the kernel can just trust
> > userspace to get it right. When the kernel knows the SPA implications it
> > can add safety like "you are going to issue sparing on deviceA that will
> > temporarily take deviceA offline. CXL subsystem tells me deviceA is
> > interleaved with deviceB in SPA so the whole SPA range needs to be
> > offline before this operation proceeds". That is not someting that
> > userspace can reliably coordinate.
>
> Absolutely he kernel has to enforce this. Same way we protect against
> poison injection in some cases. Right now the enforcement is slightly
> wrong (Shiju is looking at it again) as we were enforcing at wrong
> granularity (specific dpa, not device). Identifying that hole is a good
> outcome of this discussion making us take another look.
>
> Enforcing this is one of the key jobs of the CXL specific driver.
> We considered doing it in the core, but the granularity differences
> between our initial few examples meant we decided on specific driver
> implementations of the checks for now.
Which specific driver? Is this not just a callback provided via the EDAC
registration interface to say "sparing allowed"?
Yes, this needs to avoid the midlayer mistake, but I expect more CXL
memory exporting devices can live with the CXL core's determination that
HDM decode is live or not.
> > > > 3/ What if the device does not use DDR terminology / topology terms for
> > > > repair?
> > >
> > > Then we provide the additional interfaces assuming the correspond to well
> > > known terms. If someone is using a magic key then we can get grumpy
> > > with them, but that can also be supported.
> > >
> > > Mostly I'd expect a new technology to overlap a lot of the existing
> > > interface and maybe add one or two more; which layer in the stack for
> > > HBM for instance.
> >
> > The concern is the assertion that sysfs needs to care about all these
> > parameters vs an ABI that says "repair errorX". If persistence and
> > validity of error records is the concern lets build an ABI for that and
> > not depend upon trust in userspace to properly coordinate memory
> > integrity concerns.
>
> It doesn't have to. It just has to ensure that the memory device is in the correct
> state. So check, not coordinate. At a larger scale, coordination is already doable
> (subject to races that we must avoid by holding locks), tear down the regions
> so there are no mappings on the device you want to repair. Don't bring them
> up again until after you are done.
>
> The main use case is probably do it before you bring the mappings up, but
> same result.
Agree.
>
> >
> > >
> > > The main alternative is where the device takes an HPA / SPA / DPA. We have one
> > > driver that does that queued up behind this series that uses HPA. PPR uses
> > > DPA. In that case userspace first tries to see if it can repair by HPA then
> > > DPA and if not moves on to see if it it can use the fuller description.
> > > We will see devices supporting HPA / DPA (which to use depends on when you
> > > are doing the operation and what has been configured) but mostly I'd expect
> > > either HPA/DPA or fine grained on a given repair instance.
> > >
> > > HPA only works if the address decoders are always configured (so not on CXL)
> > > What is actually happening in that case is typically that a firmware is
> > > involved that can look up address decoders etc, and map the control HPA
> > > to Bank / row etc to issue the actual low level commands. This keeps
> > > the memory mapping completely secret rather than exposing it in error
> > > records.
> > >
> > > >
> > > > I expect the flow rasdaemon would want is that the current PFA (leaky
> > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > > > has performed exceeds some threshold and it wants to attempt to repair
> > > > memory.
> > >
> > > Sparing may happen prior to point where we'd have done a soft offline
> > > if non disruptive (whether it is can be read from another bit of the
> > > ABI). Memory repair might be much less disruptive than soft-offline!
> > > I rather hope memory manufacturers build that, but I'm aware of at least
> > > one case where they didn't and the memory must be offline.
> >
> > That's a good point, spare before offline makes sense.
>
> If transparent an resources not constrained.
> Very much not if we have to tear down the memory first.
>
> >
> > [..]
> > > However, there are other usecases where this isn't needed which is why
> > > that isn't a precursor for this series.
> > >
> > > Initial enablement targets two situations:
> > > 1) Repair can be done in non disruptive way - no need to soft offline at all.
> >
> > Modulo needing to quiesce access over the sparing event?
>
> Absolutely. This is only doable in devices that don't need to quiesce.
>
> >
> > > 2) Repair can be done at boot before memory is onlined or on admin
> > > action to take the whole region offline, then repair specific chunks of
> > > memory before bringing it back online.
> >
> > Which is userspace racing the kernel to online memory?
>
> If you are doing this scheme you don't automatically online memory. So
> both are in userspace control and can be easily sequenced.
> If you aren't auto onlining then buy devices with hard PPR and do it by offlining
> manually, repairing and rebooting. Or buy devices that don't need to quiecse
> and cross your fingers the dodgy ram doesn't throw an error before you get
> that far. Little choice if you decide to online right at the start as normal
> memory.
>
> >
> > > > So, yes, +1 to simpler for now where software effectively just needs to
> > > > deal with a handful of "region repair" buttons and the semantics of
> > > > those are coarse and sub-optimal. Wait for a future where a tool author
> > > > says, "we have had good success getting bulk offlined pages back into
> > > > service, but now we need this specific finer grained kernel interface to
> > > > avoid wasting spare banks prematurely".
> > >
> > > Depends on where you think that interface is. I can absolutely see that
> > > as a control to RAS Daemon. Option 2 above, region is offline, repair
> > > all dodgy looking fine grained buckets.
> > >
> > > Note though that a suboptimal repair may mean permanent use of very rare
> > > resources. So there needs to be a control a the finest granularity as well.
> > > Which order those get added to userspace tools doesn't matter to me.
> > >
> > > If you mean that interface in kernel it brings some non trivial requirements.
> > > The kernel would need all of:
> > > 1) Tracking interface for all error records so the reverse map from region
> > > to specific bank / row etc is available for a subset of entries. The
> > > kernel would need to know which of those are important (soft offline
> > > might help in that use case, otherwise that means decision algorithms
> > > are in kernel or we have fine grained queue for region repair in parallel
> > > with soft-offline).
> > > 2) A way to inject the reverse map information from a userspace store
> > > (to deal with reboot etc).
> >
> > Not a way to inject the reverse map information, a way to inject the
> > error records and assert that memory topology changes have not
> > invalidated those records.
>
> There is no way to tell that the topology hasn't changed.
> For the reasons above, I don't think we care. Instead of trying to stop
> userspace reparing the wrong memory, make sure it is safe for it to do that.
> (The kernel is rarely in the business of preventing the slightly stupid)
If the policy is "error records with SPA from the current boot can be
trusted" and "userspace requests outside of current boot error records
must only be submitted to known offline" then I think we are aligned.
> > > That sounds a lot harder to deal with than relying on the usespace program
> > > that already does the tracking across boots.
> >
> > I am stuck behind the barrier of userspace must not assume it knows
> > better than the kernel about the SPA impact of a DIMM sparing
> > event. The kernel needs evidence either live records from within the
> > same kernel boot or validated records from a previous boot.
>
> I think this is the wrong approach. The operation must be 'safe'.
> With that in place we absolutely can let userspace assume it knows better than
> the kernel.
Violent agreement? Operation must be safe, yes, next what are the criteria
for kernel management of safety. Offline-only repair is great place to
be.
Powered by blists - more mailing lists