[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <208647816.nNe6ejF2h0@tjmaciei-mobl5>
Date: Sat, 12 Nov 2022 10:21:35 -0800
From: Thiago Macieira <thiago.macieira@...el.com>
To: Jithu Joseph <jithu.joseph@...el.com>,
Borislav Petkov <bp@...en8.de>
Cc: hdegoede@...hat.com, markgross@...nel.org, tglx@...utronix.de,
mingo@...hat.com, dave.hansen@...ux.intel.com, x86@...nel.org,
hpa@...or.com, gregkh@...uxfoundation.org, ashok.raj@...el.com,
tony.luck@...el.com, linux-kernel@...r.kernel.org,
platform-driver-x86@...r.kernel.org, patches@...ts.linux.dev,
ravi.v.shankar@...el.com, athenas.jimenez.gonzalez@...el.com,
sohil.mehta@...el.com
Subject: Re: [PATCH v2 12/14] platform/x86/intel/ifs: Add current_batch sysfs entry
On Saturday, 12 November 2022 08:26:28 PST Borislav Petkov wrote:
> > Introduce "current_batch" file which accepts a number. Writing a
> > number to the current_batch file would load the test image file by name
> > ff-mm-ss-<xy>.scan, where <xy> is the number written to the
> > "current_batch" file in hex. Range check of the input is done to verify
> > it not greater than 0xff.
>
> Dunno - sounds silly to me. Means one needs to go and look up which
> files are there and echo those batch numbers into sysfs and so on.
Not exactly. That's what this file is there for. It allows the algorithm to
read the current batch file, add 1, then echo back. If the load succeeds, the
the batch exists; if not, then the algorithm should simply go back to 0.
That's what we're implementing here:
https://github.com/opendcdiag/opendcdiag/pull/163
> What I would do is make it real trivial for the user so that latter can
> simply do:
>
> for f in $(ls /lib/firmware/intel/ifs_0/*.scan);
> do
> echo $f > /sys/devices/virtual/misc/intel_ifs_0/test_file
> done
>
> and simply supply the full filename.
Unfortunately, there are other limitations that make such a simple algorithm
not possible in the first place.
First, there's the question of the ability to see into /lib/firmware. I'm not a
kernel dev but I'm told that request_firmware() only operates on the root
container's filesystem view. We're expecting that the application may get
deployed as a container (with full privileges so it can write to /sys, sure),
so it won't be able to see the host system's /lib to know what files are
available. It could "guess" at the file names, based on the current processor's
family/model/stepping and a natural number, but that's sub-optimal.
Unless the driver were allowed to load any file named by the application, from
its own view of the filesystem, permitting the firmware files being distributed
inside the container.
Second, for electrical reasons, we expect that certain processor generations
will need a timeout between tests before testing can be done again on a given
core, whether the same batch or the next one. This time out can be in the
order of many minutes, which is longer than any hyperscaler is willing to
allocate for a system self-test hogging a core or the whole system, just
waiting. For example, let's say that the timeout is 15 minutes and there are 4
batches: this means the whole testing procedure takes one hour, even though
the actual downtime for each core was less than 1 second. This is lost
revenue.
Instead, they wish the next available maintenance window to simply resume
testing at the point where the last one stopped. These windows need not be
scheduled; they can also be opportunistic, when the orchestrator determines
the machine or a subset of one is going to be idle. That's what the algorithm
in the pull request above implements: if the current_batch's result was
"untested", it is attempted again, otherwise it tries the next one, rolling
back to 0 if the loading failed. This removes the need to know anything about
the timeout on the current processor or even whether there is one, or how many
batches there are.242
> So the kernel would simply open it, sanity-check it, if it passes, it
> would run it - otherwise it would ignore it.
>
> A usability win-win.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Cloud Software Architect - Intel DCAI Cloud Engineering
Powered by blists - more mailing lists