[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2687702.9iZYToFQE1@tjmaciei-mobl5>
Date: Sat, 12 Nov 2022 18:35:23 -0800
From: Thiago Macieira <thiago.macieira@...el.com>
To: Borislav Petkov <bp@...en8.de>, "Luck, Tony" <tony.luck@...el.com>
Cc: "Joseph, Jithu" <jithu.joseph@...el.com>,
"hdegoede@...hat.com" <hdegoede@...hat.com>,
"markgross@...nel.org" <markgross@...nel.org>,
"tglx@...utronix.de" <tglx@...utronix.de>,
"mingo@...hat.com" <mingo@...hat.com>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
"gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
"Raj, Ashok" <ashok.raj@...el.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"platform-driver-x86@...r.kernel.org"
<platform-driver-x86@...r.kernel.org>,
"patches@...ts.linux.dev" <patches@...ts.linux.dev>,
"Shankar, Ravi V" <ravi.v.shankar@...el.com>,
"Jimenez Gonzalez, Athenas" <athenas.jimenez.gonzalez@...el.com>,
"Mehta, Sohil" <sohil.mehta@...el.com>
Subject: Re: [PATCH v2 12/14] platform/x86/intel/ifs: Add current_batch sysfs entry
On Saturday, 12 November 2022 15:32:47 PST Luck, Tony wrote:
> > Because if this is going to be run during downtime, as Thiago says, then
> > you can just as well use debugfs for this. And then there's no need to
> > cast any API in stone and so on.
>
> Did Thiago say “during downtime”? I think
> he talked about some users opportunistic
> use of scan tests. But that’s far from only
> during downtime. We fully expect CSPs to
> run these scans periodically on production
> machines.
Let me clarify. I did not mean full system downtime for maintenance, but I did
mean that there's a gap in consumer workload, for both threads of one or more
cores. As Tony said, it should have little observable effect on any other core,
meaning an IFS run can be scheduled *as* any other workload (albeit a
privileged one) for a subset of the machine, while the rest of the system
remains in production. This allows them a lot of flexibility and is the reason
I am talking about containers, with the implied constraint that the
container's view of the filesystem is narrower than the kernel's.
There'll be some coordination required to get all cores to have run all tests,
but it should be doable over a period of time, and I'm thinking days, not
years. This should still be short enough to reveal if the system can detect a
defect or wear-out before any real workload is impacted by it.
If an issue is detected, the admin can decide whether to offline the core(s)
reporting problems but keep the rest serving workloads and generating revenue,
or offline the entire machine for full maintenance and to run more invasive and
time-consuming tests.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Cloud Software Architect - Intel DCAI Cloud Engineering
Powered by blists - more mailing lists