linux-kernel - RE: [PATCH] x86/mce: Dynamically size space for machine check records

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <SJ1PR11MB6083F8541423267B852C510CFC212@SJ1PR11MB6083.namprd11.prod.outlook.com>
Date: Wed, 6 Mar 2024 22:07:38 +0000
From: "Luck, Tony" <tony.luck@...el.com>
To: "Naik, Avadhut" <avadnaik@....com>, Borislav Petkov <bp@...en8.de>
CC: "Mehta, Sohil" <sohil.mehta@...el.com>, "x86@...nel.org" <x86@...nel.org>,
	"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"yazen.ghannam@....com" <yazen.ghannam@....com>, Avadhut Naik
	<avadhut.naik@....com>
Subject: RE: [PATCH] x86/mce: Dynamically size space for machine check records

> > +   mce_numrecords = max(80, num_possible_cpus() * 4);
>
> Per Boris's below suggestion, shouldn't this be:
>       mce_numrecords = max(80, num_possible_cpus() * 16);
>
> >>    min(4*PAGE_SIZE, num_possible_cpus() * PAGE_SIZE);
> >
> > max() ofc.
> >
> >> There's a sane minimum and one page pro logical CPU should be fine on
> >> pretty much every configuration...
>
> 4 MCE records per CPU equates to 1024 bytes, considering the genpool intrinsic
> behavior you explained in the other subthread.

Picking a good number of records-per-core may be more art than science. Boris
is right that a page per CPU shouldn't cause any significant issue to systems with
many CPUs, because they should have copious amounts of memory to make a
balanced configuration. But 16 records per CPU feels way too high to me. The
theoretical limit in a single scan of machine check banks on Intel is 32 (since
Intel never has more than 32 banks). But those banks cover diverse h/w devices
and it seems improbable that all, or even most, of them would log errors at the
same time, with all CPUs on all sockets doing the same.

After I posted the version with num_possible_cpus() * 4 I began to wonder whether
"2" would be enough.

> Apart from this, tested the patch on a couple of AMD systems. Didn't observe any
> issues.

Thanks very much for testing.

-Tony