linux-kernel - Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 11 Jul 2014 22:10:12 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Havard Skinnemoen <hskinnemoen@...gle.com>
Cc:	Tony Luck <tony.luck@...il.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Ewout van Bekkum <ewout@...gle.com>,
	linux-edac <linux-edac@...r.kernel.org>
Subject: Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for
 small check_interval values.

I'm going to reply with multiple mails so that we can keep the things
separate and not let replies grow out of proportion.

On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
> So a short burst of CMCIs would send us instantly into polling mode,
> which would probably be suboptimal if things are quiet after that.
> Counting is a lot more robust against this.

Yes, but CMCI_STORM_THRESHOLD is arbitrary too. How is getting 15 CMCIs
per second an interrupt storm? Apparently boxes can handle couple of
hundred CMCIs per second just fine...

> If we see two errors every 2 seconds (for example due to a bug causing
> us to see duplicate MCEs), we'd ping-pong back and forth between CMCI
> and polling mode on every error, polling 51 times per second on
> average. This seems a lot more expensive than just staying in CMCI
> mode. And we risk losing information if there are instead, say, 4
> errors every 2 seconds.
> 
> > After a second where we haven't seen any errors, we switch back to CMCI.
> > check_interval relaxes back to 5 min and all gets to its normal boring
> > existence. Otherwise, we enter storm mode quickly again.
> 
> Since the storm detection is now independent of check_interval, we
> don't need to place any restrictions on it right?

Ok, so my initial storm detection was dumb, ok. Counting the way we do
it now is purely sucked out of thin air too.

Instead, the criteria should probably be something like: what is the
number of CMCIs per second which we can process while leaving system
operation relatively unaffected? Anything above that number constitutes
a CMCI storm.

Now, how we'll come up with an answer to that question is a whole
another story...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/