lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 15 Mar 2022 16:10:59 +0000
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Greg KH <gregkh@...uxfoundation.org>
CC:     "Joseph, Jithu" <jithu.joseph@...el.com>,
        "hdegoede@...hat.com" <hdegoede@...hat.com>,
        "markgross@...nel.org" <markgross@...nel.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        "corbet@....net" <corbet@....net>,
        "andriy.shevchenko@...ux.intel.com" 
        <andriy.shevchenko@...ux.intel.com>,
        "Raj, Ashok" <ashok.raj@...el.com>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "platform-driver-x86@...r.kernel.org" 
        <platform-driver-x86@...r.kernel.org>,
        "patches@...ts.linux.dev" <patches@...ts.linux.dev>,
        "Shankar, Ravi V" <ravi.v.shankar@...el.com>,
        "Williams, Dan J" <dan.j.williams@...el.com>
Subject: RE: [RFC 00/10] Introduce In Field Scan driver

> Again, I have no idea what you are doing at all with this driver, nor
> what you want to do with it.
>
> Start over please.

TL;DR is that silicon ages and some things break that don't have parity/ECC checks.
So systems start behaving erratically. If you are lucky they crash. If you are less lucky
they give incorrect results.

There's a paper (and even a movie 11 minutes) that describe the research by
Google on this.
https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf  
(https://www.youtube.com/watch?v=QMF3rqhjYuM)    

> What is the hardware you have to support?

Feature first available in Sapphire Rapids (Xeon: coming later this year)

> What is the expectation from userspace with regards to using the
> hardware?

Expectation from users is that they can run these tests frequently (many times
per day) to catch silicon that has developed faults quickly and take action to
isolate the cores that have issues.

On HT enabled systems both threads that share a core need to be put into
test mode together. The current version of tests takes around 50 milli-seconds
(so for many workloads doesn't need much prep ... those with high sensitivity
to latency would need to do some additional userspace task binding to make
sure those workloads were moved to another core while the h/w test runs).

There are three outcomes from running a test:

1) The test passes all stages.
2) The test did not complete (for a variety of reasons, e.g. power states)
3) The test indicates failure. Recommendation is to run one more time in case
    the failure was transient .. e.g. cause by a neutron/alpha strike.

-Tony

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ