linux-kernel - Re: [PATCH 1/3] x86/MCE/AMD: Provide an "Unknown" MCA bank type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Ya+LukojuewlomeF@yaz-ubuntu>
Date:   Tue, 7 Dec 2021 16:28:42 +0000
From:   Yazen Ghannam <yazen.ghannam@....com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
        tony.luck@...el.com, x86@...nel.org,
        Smita.KoralahalliChannabasappa@....com, mukul.joshi@....com,
        alexander.deucher@....com, william.roche@...cle.com
Subject: Re: [PATCH 1/3] x86/MCE/AMD: Provide an "Unknown" MCA bank type

On Fri, Dec 03, 2021 at 11:17:45PM +0100, Borislav Petkov wrote:
> On Fri, Dec 03, 2021 at 02:00:15AM +0000, Yazen Ghannam wrote:
> > The AMD MCA Thresholding sysfs interface populates directories for each
> > bank and thresholding block. The name used for each directory is looked
> > up in a table of known bank types. However, new bank types won't match
> > in this list and will return NULL for the name. This will cause the
> > machinecheck sysfs interface to fail to be populated.
> > 
> > Set new and unknown MCA bank types to the "unknown" type. Also,
> > ensure that the bank's thresholding block directories have unique names.
> > This will ensure that the machinecheck sysfs interface can be
> > initialized.
> 
> What is the advantage of having a sysfs directory structure headed with
> an "unknown" entry vs not having that structure at all when the kernel
> runs on a machine for which it has not been enabled yet?
> 
> IOW, if those new banks would need additional enablement, what's the
> point of having "unknown" on older kernels which do not have any
> functionality?
> 
> IOW, how does this:
> 
> /sys/devices/system/machinecheck/machinecheck0/unknown/unknown/
> ├── error_count
> ├── interrupt_enable
> └── threshold_limit
> 
> help a user?

Yeah, I see your point.

> 
> Btw, looking at the current layout:
> 
> ...
> ├── insn_fetch
> │   └── insn_fetch
> │       ├── error_count
> │       ├── interrupt_enable
> │       └── threshold_limit
> ├── l2_cache
> │   └── l2_cache
> │       ├── error_count
> │       ├── interrupt_enable
> │       └── threshold_limit
> ...
> 
> we have those names repeated which looks wonky and useless too. I'd
> expect them to be:
> 
> ...
> ├── insn_fetch
> │   ├── error_count
> │   ├── interrupt_enable
> │   └── threshold_limit
> ├── l2_cache
> │   ├── error_count
> │   ├── interrupt_enable
> │   └── threshold_limit
> ...
> 
> Can we fix that too pls?
> 

Sure thing. But I don't think removing the second directory will be okay. The
layout is "bank"/"block". If the "block" has special use like DRAM ECC, or L3
Cache on older systems, then it'll have a unique name. Otherwise, the block
will take the name of the bank.

I think the more robust solution is to drop the unique names and use generic
names like "bank"/"block". A new file called "type" can be introduced into the
directory structure, and this can return the name of the bank/block. New bank
types will return "<null>" for the "type", but the directory structure should
remain the same and functional.

I've seen this in other sysfs interfaces like cpuidle,
e.g. /sys/devices/system/cpu/cpu0/cpuidle/stateX

The "blockX/type" file is like the "stateX/desc" file. Or the "type" file can
be called "desc", since it's a description of what the bank or block
represent.

Here are a couple of examples:

/sys/devices/system/machinecheck/machinecheck0/
├── th_bank0
│   ├── type ("Instruction Fetch")
│   └── th_block0
│       ├── type ("All Errors")
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
├── th_bank1
│   ├── type ("Northbridge")
│   ├── th_block0
│   │   ├── type ("DRAM Errors")
│   │   ├── error_count
│   │   ├── interrupt_enable
│   │   └── threshold_limit
│   └── th_block1
│       ├── type ("Link Errors")
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
...

OR

/sys/devices/system/machinecheck/machinecheck0/thresholding
├── bank0
│   ├── desc ("Instruction Fetch")
│   └── block0
│       ├── desc ("All Errors")
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
├── bank1
│   ├── desc ("Northbridge")
│   ├── block0
│   │   ├── desc ("DRAM Errors")
│   │   ├── error_count
│   │   ├── interrupt_enable
│   │   └── threshold_limit
│   └── block1
│       ├── desc ("Link Errors")
│       ├── error_count
│       ├── interrupt_enable
│       └── threshold_limit
...

I'm inclined to the second option, since it keeps all the thresholding
functionality under a single directory.

What do you think?

Thanks,
Yazen