linux-kernel - Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 11 Jul 2014 17:10:07 -0700
From:	Havard Skinnemoen <hskinnemoen@...gle.com>
To:	Borislav Petkov <bp@...en8.de>
Cc:	Tony Luck <tony.luck@...il.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Ewout van Bekkum <ewout@...gle.com>,
	linux-edac <linux-edac@...r.kernel.org>
Subject: Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for
 small check_interval values.

On Fri, Jul 11, 2014 at 1:22 PM, Borislav Petkov <bp@...en8.de> wrote:
> On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
>> > * max number of CMCIs per second a system can sustain fine, i.e. the 100
>> > above
>>
>> What's the definition of "fine"? 1% performance hit? 10%? How can we
>> make that decision without knowing how hard the users are pushing
>> their systems?
>
> Those are definitely unchartered territories we're moving into so yes,
> answering that won't be easy.
>
> Let's try it: if the anount of time we spend per second in the CMCI
> handler becomes higher than the amount of time we spend polling for that
> same second, then we might just as well poll and save us the interrupt
> load.
>
> So, for example, say for 10ms poll rate and single poll duration of
> 2ms, if time spent in CMCI exceeds 200ms for that second, we switch to
> polling. Hypothetical numbers, of course.

200ms per second means we're using 20% of that CPU. I'd say that's
definitely too much. But I like the general approach.

> Can we measure it on every system? Probably not. So we need to
> approximate it somehow.
>
>>
>> > * total polling duration during storm, i.e. the 1 second above
>> >
>> > and if those are chosen generously for every system out there, then we
>> > don't need to dynamically adjust the polling interval.
>>
>> I'm not sure how we can be generous when there's a tradeoff involved.
>> If we make the interval "generously low", we end up hurting
>> performance. if we make it "generously high", we'll lose information.
>
> Yeah, by "generous" I meant, choose values which fit all. But I realize
> now that this is a dumb idea. Maybe we could measure it on each system,
> read the TSC on CMCI entry and exit and thus get an average CMCI
> duration...

Sounds interesting. Some things that may need some more thought:

1. What percentage of CPU is OK to use before we consider it a storm?

2. How do we map that number to polling mode, where we may not see all
the errors? If we get it wrong, we may end up bouncing at a very high
rate.

3. If we go for a fixed polling rate, how do we make sure it doesn't
require more CPU than what we determined in (1)?

Havard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/