lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FBD9BAA.7070902@linux.intel.com>
Date:	Thu, 24 May 2012 10:23:38 +0800
From:	Chen Gong <gong.chen@...ux.intel.com>
To:	"Luck, Tony" <tony.luck@...el.com>
CC:	Thomas Gleixner <tglx@...utronix.de>,
	"bp@...64.org" <bp@...64.org>, "x86@...nel.org" <x86@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] x86: auto poll/interrupt mode switch for CMC to stop
 CMC storm

于 2012/5/24 4:53, Luck, Tony 写道:
>> If that's the case, then I really can't understand the 5 CMCIs per
>> second threshold for defining the storm and switching to poll mode.
>> I'd rather expect 5 of them in a row.
> We don't have a lot of science to back up the "5" number (and
> can change it to conform to any better numbers if someone has
> some real data).
>
> My general approximation for DRAM corrected error rates is
> "one per gigabyte per month, plus or minus two orders of
>  magnitude". So if I saw 1600 errors per month on a 16GB
> workstation, I'd think that was a high rate - but still
> plausible from natural causes (especially if the machine
> was some place 5000 feet above sea level with a lot less
> atmosphere to block neutrons). That only amounts to a couple
> of errors per hour. So five in a second is certainly a storm!
>
> Looking at this from another perspective ... how many
> CMCIs can we take per second before we start having a
> noticeable impact on system performance. RT answer may
> be quite a small number, generic throughput computing
> answer might be several hundred per second.
>
> The situation we are trying to avoid is a stuck bit on
> some very frequently accessed piece of memory generating
> a solid stream of CMCI that make the system unusable. In
> this case the question is for how long do we let the storm
> rage before we turn of CMCI to get some real work done.
>
> Once we are in polling mode, we do lose data on the location
> of some corrected errors. But I don't think that this is
> too serious. If there are few errors, we want to know about
> them all. If there are so many that we have difficulty
> counting them all - then sampling from a subset will
> give us reasonable data most of the time (the exception
> being the case where we have one error source that is
> 100,000 times as noisy as some other sources that we'd
> still like to keep tabs on ... we'll need a *lot* of samples
> to see the quieter error sources amongst the noise).
>
> So I think there are justifications for numbers in the
> 2..1000 range. We could punt it to the user by making
> it configurable/tunable ... but I think we already have
> too many tunables that end-users don't have enough information
> to really set in meaningful ways to meet their actual
> needs - so I'd prefer to see some "good enough" number
> that meets the needs, rather than yet another /sys/...
> file that people can tweak.
>
> -Tony
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Thanks very much for your elaboration, Tony. You give so much detail
than I want
to tell :-).

Hi, Thomas, yes, you can say 5 is just a arbitraty value and I can't
give you too
many proofs though I ever found some guys to help to test on the real
platform.
I only can say it works based on our internal test bench, but I really
hope someone
can use this patch on their actual machines and give me the feedback. I
can decide
what value is proper or if we need a tunable switch. By now, as Tony
said, there are
too many switches for end users so I don't want to add more.

BTW, I will update the description in the next version.

Hi, Boris, when I write these codes I don't care if it is specific for
Intel or AMD. I just
noticed it should be general for x86 platform and all related codes are
general too,
which in mce.c, so I think it should be fine to place the codes in mce.c.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ