linux-kernel - Re: [PATCH] x86/MCE: Remove MCP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20161107180853.4uxlvtoychzhwr2q@pd.tnic>
Date:   Mon, 7 Nov 2016 19:08:54 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     linux-edac <linux-edac@...r.kernel.org>, X86 ML <x86@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] x86/MCE: Remove MCP_TIMESTAMP

On Mon, Nov 07, 2016 at 05:48:46PM +0000, Luck, Tony wrote:
> > So, get rid of all that and simply log an MCE with a TSC value always.
> > Simplifies the code a bit too.
> 
> I'm not necessarily opposed to this ... but there was once some logic behind when
> logged TSC, and when we didn't.  Essentially we wanted the TSC when we were
> logging from #CMCI or #MC .... because the detection of the error was fresh, and
> wanted as much precision on the logged time as possible to compare with logged
> errors from other banks/cpus. This might allow us to distinguish multiple errors logged
> in the same #CMCI, from errors logged in separate #CMCI a tenth of a second apart.
> 
> If we found the error while polling, we didn’t want to provide a false sense of precision.
> The error could have been logged up to five minutes previously (or when logging
> errors during the initial poll of the banks an arbitrary time in the past).

Right, looks like we've lost that logic:

Functions calling this function: machine_check_poll

  File         Function                  Line
0 mce-inject.c raise_poll                  57 machine_check_poll(0, &b);
1 mce.c        mce_timer_fn              1358 machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_poll_banks));
2 mce.c        __mcheck_cpu_init_generic 1508 machine_check_poll(MCP_UC | m_fl, &all_banks);
3 mce_intel.c  mce_intel_cmci_poll        133 if (machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned)))
4 mce_intel.c  intel_threshold_interrupt  253 machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
5 mce_intel.c  cmci_recheck               345 machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));

So the TSC timestamp will be possibly inexact now in mce_timer_fn(),
__mcheck_cpu_init_generic(), mce_intel_cmci_poll() and cmci_recheck().

Should we bother and add a flag to struct mce - maybe somewhere in the
padding __u8 pad; - to denote that the logged TSC may not be exact?

Mind you, there's also

	m->time = get_seconds();

which also collects time and which could also be possibly inexact.

One other possibility would be to use ->time and write ->tsc *only*
when exact - i.e., in the handler - and this is then enough info about
timing.

->time will give you somewhere around where it happened and ->tsc - only
if set - will give you exact, well, *timestamp* :)

This sounds like a pretty straightforward logic to me...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.