linux-kernel - Re: [PATCH 09/10] MCE: run through processors with more severe problems first

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <BANLkTi=sLTFtVKdHryRiPvKfj1_Pzc3AEA@mail.gmail.com>
Date:	Mon, 13 Jun 2011 20:04:58 -0700
From:	Tony Luck <tony.luck@...el.com>
To:	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Borislav Petkov <bp@...64.org>,
	linux-kernel@...r.kernel.org, "Huang, Ying" <ying.huang@...el.com>,
	Avi Kivity <avi@...hat.com>
Subject: Re: [PATCH 09/10] MCE: run through processors with more severe
 problems first

2011/6/13 Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>:
> BTW in case of "no_way_out" events, we don't clear banks because they
> could be carried over to the next boot (expecting logged as bootlog).
> So we may need to have some trick for some known cases; e.g. ignore
> observed AR by bystanders, anyway.

Yes. The overall plan is that we should leave the machine check banks
alone for fatal errors (so that the BIOS, or next OS after the reset can
do something with them). Non-fatal errors can be handled, logged and
cleared.  But this leaves us in a pickle if we initially think we can handle
an error, and later decide that we can't.  Leaving errors for too long in
the machine check banks has its own problems too - overwrite rules
mean that two errors in the same bank which are each non-fatal, may
become a fatal error for the OS.

>> +     u64     mask = MCG_STATUS_MCIP;
>
> Why do you check the MCG_STATUS_MCIP too here?
> What happens if there is a problematic cpu that could not read
> MCG register properly so indicates "PANIC with !MCIP"?

You figured out the answer later - but perhaps I should have given better
clues in the comments. I think that the !MCIP panic is a "can't
happen" case.

>> +             cpu = cpumask_next(cpu, cpu_possible_mask);
>
> possible? online?

The old code has "for_each_possible_cpu" when scanning through
mces_seen - and I didn't want to change this functionality at this
point.

> Ah, I guess you assumed that all cpus checked in should have
> mces_seen with MCIP while offline cpus have cleaned mces_seen.
>
> Though we know there might be races with cpu hotplug, now we
> already use num_online_cpus() in this rendezvous mechanism,
> it is OK to use cpu_online_mask here at the moment, I think.
>
> Or we should invent new, better rendezvous mechanism...

Eventually we need something better. Currently we may do some
very strange things if someone has taken cpus offline (since they
will still arrive to rendezvous and we'll get more than num_online_cpus()
showing up.  Ditto if someone hot-added another cpu board but hasn't
yet brought the cpus online. Or if we booted with less than all cpus
by kernel command line argument. Etc.  Unfortunately I don't have
good ideas on how to do this better - ideally we'd have some very small
time interval in which to expect that cpus would arrive at the handler.
But the SDM gives no guidance on this.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/