[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101026101536.GC16552@elte.hu>
Date: Tue, 26 Oct 2010 12:15:36 +0200
From: Ingo Molnar <mingo@...e.hu>
To: Huang Ying <ying.huang@...el.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, Len Brown <lenb@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
Andi Kleen <andi@...stfloor.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
Borislav Petkov <petkovbb@...glemail.com>,
"H. Peter Anvin" <hpa@...or.com>, Don Zickus <dzickus@...hat.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Mauro Carvalho Chehab <mchehab@...hat.com>,
"Luck, Tony" <tony.luck@...el.com>
Subject: Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware Error
Source POLL/IRQ/NMI notification type support
* Huang Ying <ying.huang@...el.com> wrote:
> Hi, Thomas,
>
> On Tue, 2010-10-26 at 12:53 +0800, Thomas Gleixner wrote:
> > B1;2401;0cLen,
> >
> > On Mon, 25 Oct 2010, Len Brown wrote:
> >
> > > > NAKed-by: Ingo Molnar <mingo@...e.hu>
> > >
> > > Everybody knows that Linux has a lot to learn about RAS.
> > >
> > > I think to catch up, we need to play to Linux's strengths
> > > of continuous improvement. If we halt patches in this area
> > > then we could wait forever for the "perfect design".
> >
> > it's not about perfect design. It's about creating new user space
> > ABIs. The patches introduce another error reporting user space ABI
> > with an ad hoc "fits the needs" design.
> >
> > This is my major point of objection.
> >
> > I agree that Linux needs improvement on the RAS side, but does this
> > lack of features justify a new user space ABI which is totally
> > disconnected to existing RAS facilities ?
> >
> > No, it does not. It's not our problem that Intel wasted time on
> > creating another character device driver to report errors to user
> > space. The time spent to do so would have been sufficient to do a
> > proper integration into the existing infrastructure.
> >
> > I would not care at all if these patches would just introduce some
> > weird in kernel interfaces as we can clean that up at will. But
> > introducing a new user space ABI is setting the disconnect of RAS
> > related facilities into stone.
> >
> > From Kconfig:
> >
> > EDAC is designed to report errors in the core system.
> > These are low-level errors that are reported in the CPU or
> > supporting chipset or other subsystems:
> > memory errors, cache errors, PCI errors, thermal throttling, etc..
> > If unsure, select 'Y'.
> >
> > So please explain why your error reporting is so different from the
> > above that it justifies a separate facility. And you better come up
> > with a real good explanation other than we looked at EDAC and it did
> > not fit our needs.
>
> As far as I know, EDAC guys plan to use some other "perfect interface" in the
> future. So I think the current state is really waiting for the "perfect design".
Not sure what you mean by this, but Boris has posted links to his latest patch-set
in this thread, see:
http://kerneltrap.org/mailarchive/linux-kernel/2010/8/6/4603847
The Git coordinates are:
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git, branch tip/perf/parse-events
The 'persistent events' facility he has prototyped there appears to be a good
potential match for the ERST store.
It would be very useful to have another feature there: to mark persistent events as
'dump into syslog on bootup', so that for example the contents of the ERST log could
be dumped right on bootup. [but ERST would not be the only persistent event that
could be marked like that.]
Note that we dont need/want other ABI accesses to the ERST log (i.e. we dont want
/dev/erst-dbg), because we want the benefits of the generalization: tooling (RAS and
other tooling) should learn how to deal with persistent events - not learn how to
deal with ERST logs ... or with warm bootup RAM-embedded logs ... or to deal with
kcrash embedded kernel logs ... etc.
There are many obvious advantages from implementing it like that: there's no need to
special-code ERST to printk or ERST to whatever other facility cross links - it
would be part of a generic/uniform event logging facility to begin with. ERST would
only implement its own, narrow, hardware-specific event accessor methods - nothing
else. Basically a small 'event driver'. This would be the most optimal, smallest,
easiest to maintain approach - with no facility duplication and no fragmentation.
It's certainly more work as well _for the first such example_ - but from that point
on any new hardware facility can be added with ease, and those too will fit into
existing tooling in a very natural way.
So please help out with the persistent events work. If you need any pointers we'd be
glad to help.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists