lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100126160913.GD6567@basil.fritz.box>
Date:	Tue, 26 Jan 2010 17:09:13 +0100
From:	Andi Kleen <andi@...stfloor.org>
To:	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>
Cc:	Borislav Petkov <petkovbb@...glemail.com>,
	Andi Kleen <andi@...stfloor.org>, Ingo Molnar <mingo@...e.hu>,
	mingo@...hat.com, hpa@...or.com, linux-kernel@...r.kernel.org,
	tglx@...utronix.de, Andreas Herrmann <andreas.herrmann3@....com>,
	linux-tip-commits@...r.kernel.org,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Fr??d??ric Weisbecker <fweisbec@...il.com>,
	Mauro Carvalho Chehab <mchehab@...radead.org>,
	Aristeu Rozanski <aris@...hat.com>,
	Doug Thompson <norsk5@...oo.com>,
	Huang Ying <ying.huang@...el.com>,
	Arjan van de Ven <arjan@...radead.org>
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to
	mce_cpu_specific_poll

On Tue, Jan 26, 2010 at 06:06:26PM +0900, Hidetoshi Seto wrote:
> How about having a system file which can be maintained with kernel,
> e.g. like /proc/hwinfo, /sys/devices/platform/hwinfo, or directory
> with some files like /somewhere/hwinfo/{dmi,acpi,pci,...} etc.?

Why not do that in user space?

In fact it's often already done.

Just because we're kernel programmers doesn't mean that everything
needs to be solved inside the kernel.

> >> And since it's kernel
> >> based it cannot do most of the interesting reactions. And it doesn't
> >> have a usable interface to add user events.
> >>
> >> And yes having all that crap in syslog is completely useless, unless
> >> you're debugging code.
> > 
> > So basically, IMHO we need:
> > 
> > 1. Resilient error reporting that reliably pushes decoded error info to
> > userspace and/or network. That one might be tricky to do but we'll get
> > there.
> 
> I think it would be better to think "error" is a subset of "event",
> which could be reported if interested but otherwise be filtered.
> Use of TRACE_EVENT() for mce event aim such approach at least.

The whole trace event infrastructure right now is not really
aimed/useful for "always on low overhead background monitoring" like 
standard error handling requires.

In principle it could be probably fixed (although I'm a bit 
sceptical on the "low overhead" part), but I suspect the result
would be neither optimized for error handling nor optimized
for performance monitoring anymore. They simply have
very different requirements.

When you do full event tracing anyways it makes some sense to get events
for errors too, but that's a quite different use-case.

For the "standard" error handling I think we're better of with
something optimized for the job.

> > 2. Error severity grading and acting upon each type accordingly. This
> > might need to be vendor-specific.
> 
> I think you mean severity grading in kernel.
> Even if hardware reported an error and graded it as corrected, kernel
> can escalate the severity, likely based on some threshold.

I don't think the kernel should do that, it's so much a policy
decision and these are best kept as near the administrator
as possible (= user space)

That is for some cases it might make sense to have limited thresholds
in the kernel, but I suspect they are limited. Mostly it would
be the case when the hardware itselfs already keeps these counters.

> 
> > 3. Proper error format suiting all types of errors.
> 
> As mentioned in Andi's PDF, CPER format is one of good candidate
> available today, I think.

Yes for hardware errors. It's definitely not perfect and somewhat
overdesigned, but I'm not sure we could come up with a much better one.
At least a subset of it with some extensions might do. Also in some
cases the error is already in this format.

The advantage of it is that it's at least well understood and documented.

> > 4. Vendor-specific hooks where it is needed for in-kernel handling of
> > certain errors (L3 cache index disable, for example).
> 
> Some difficulty would be there to add such hook in the UE handling path,
> but anyway we can have it for the CE path.  Just need to be organized.
> 
> > 5. Error thresholding, representation, etc all done in userspace (maybe
> > even on a different machine).
> 
> (...BTW, how about putting mcelog tree under the /tools, Andi?)

I don't see the advantage. Linux has always been a collection
of packages, not a unified single big tree.  Also my current
impression is that the in tree user space builds don't work
very well.

-Andi

-- 
ak@...ux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ