linux-kernel - Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to mce_cpu_specific

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100222094739.GA20844@elte.hu>
Date:	Mon, 22 Feb 2010 10:47:39 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Borislav Petkov <petkovbb@...glemail.com>, mingo@...hat.com,
	hpa@...or.com, linux-kernel@...r.kernel.org, andi@...stfloor.org,
	tglx@...utronix.de, Andreas Herrmann <andreas.herrmann3@....com>,
	Hidetoshi Seto <seto.hidetoshi@...fujitsu.com>,
	linux-tip-commits@...r.kernel.org,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Fr??d??ric Weisbecker <fweisbec@...il.com>,
	Mauro Carvalho Chehab <mchehab@...radead.org>,
	Aristeu Rozanski <aris@...hat.com>,
	Doug Thompson <norsk5@...oo.com>,
	Huang Ying <ying.huang@...el.com>,
	Arjan van de Ven <arjan@...radead.org>,
	Mauro Carvalho Chehab <mchehab@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Arnaldo Carvalho de Melo <acme@...hat.com>
Subject: Re: [tip:x86/mce] x86, mce: Rename cpu_specific_poll to
 mce_cpu_specific_poll


* Borislav Petkov <petkovbb@...glemail.com> wrote:

> From: Ingo Molnar <mingo@...e.hu>
> Date: Tue, Feb 16, 2010 at 10:02:15PM +0100
> Hi,
> 
> > I like it.
> > 
> > You can do it as a 'perf hw' subcommand - or start off a fork as the 'hw' 
> > utility, if you'd like to maintain it separately. It would have a daemon 
> > component as well, to receive and log hardware events continuously, to 
> > trigger policy action, etc.
> > 
> > I'd suggest you start to do it in small steps, always having something that 
> > works - and extend it gradually.
> 
> I had the chance to meditate over the weekend a bit more on the whole
> RAS thing after rereading all the discussion points more carefully.
> Here are some aspects I think are important which I'd like to drop here
> rather sooner than later so that we're in sync and don't waste time
> implementing the wrong stuff:
> 
> * Critical errors: we need to switch to a console and dump decoded error 
> there at least, before panicking. Nowadays, almost everyone has a camera 
> with which that information can be extracted from the screen. I'm afraid we 
> won't be able to send the error over a network since climbing up the TCP 
> stack takes relatively long and we cannot risk error propagation...? We 
> could try to do it on a core which is not affected by the error though as a 
> last step in the sequence...
>
> I think this is much more user-friendly than the current panicking which is 
> never seen when running X except when the user has a serial/netconsole 
> sending to some other machine.

Yep.

> All other non-that-critical errors are copied to userspace over a mmapped 
> buffer and then the uspace daemon is being poked with a uevent to dump the 
> error/signal over network/parse its contents and do policy stuff.

If you use perf here you get the events and can poll() the event channel. 
User-space can decide which events to listen in on. uevent/user-notifier is a 
bit clumsy for that.

> * receive commands by syscall, also for hw config: I like the idea of 
> sending commands to the kernel over a syscall, we can reuse perf 
> functionality here and make those reused bits generic.
>
> * do not bind to error format etc: not a big fan of slaving to an error 
> format - just dump error info into the buffer and let userspace format it. 
> We can do the formatting if we absolutely have to.


If you use perf and tracepoints to shape the event log format then this is all 
taken care of already, you get structured event format descriptors in 
/debug/tracing/events/*. For example there's already an MCE tracepoint in the 
upstream kernel today (for thermal events):

phoenix:/home/mingo> cat /debug/tracing/events/mce/mce_record/format 
name: mce_record
ID: 28
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;
	field:int common_lock_depth;	offset:8;	size:4;	signed:1;

	field:u64 mcgcap;	offset:16;	size:8;	signed:0;
	field:u64 mcgstatus;	offset:24;	size:8;	signed:0;
	field:u8 bank;	offset:32;	size:1;	signed:0;
	field:u64 status;	offset:40;	size:8;	signed:0;
	field:u64 addr;	offset:48;	size:8;	signed:0;
	field:u64 misc;	offset:56;	size:8;	signed:0;
	field:u64 ip;	offset:64;	size:8;	signed:0;
	field:u8 cs;	offset:72;	size:1;	signed:0;
	field:u64 tsc;	offset:80;	size:8;	signed:0;
	field:u64 walltime;	offset:88;	size:8;	signed:0;
	field:u32 cpu;	offset:96;	size:4;	signed:0;
	field:u32 cpuid;	offset:100;	size:4;	signed:0;
	field:u32 apicid;	offset:104;	size:4;	signed:0;
	field:u32 socketid;	offset:108;	size:4;	signed:0;
	field:u8 cpuvendor;	offset:112;	size:1;	signed:0;

print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank, REC->status, REC->addr, REC->misc, REC->cs, REC->ip, REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime, REC->socketid, REC->apicid

tools/perf/util/trace-event-parse.c contains the above structured format 
descriptor parsing code, and can turn it into records that you can read out 
from C code - and provides all sorts of standard functionality over it.

I'd strongly suggest to reuse that - we _really_ want health monitoring and 
general system performance monitoring to share a single facility: as they are 
both one and the same thing, just from different viewpoints.

In other words: 'system component failure' is another metric of 'system 
performance', so there's strong synergies all around.

> * can also configure hw: The tool can also send commands over the syscall to 
> configure certain aspects of the hardware, like:
> 
> - disable L3 cache indices which are faulty
> - enable/disable MCE error sources: toggle MCi_CTL, MCi_CTL_MASK bits
> - disable whole DIMMs: F2x[1, 0][5C:40][CSEnable]
> - control ECC checking
> - enable/disable powering down of DRAM regions for power savings
> - set memory clock frequency
> - some other relevant aspects of hw/CPU configuration

Once the hardware's structure is enumerated (into a tree/hiearchy), and events 
are attached to individual components, then 'commands' are the next logical 
step: they are methods of a given component/object.

One such method could be 'injection' functionality btw: to simulate rare 
hardware failures and to make sure policy logic is ready for all 
eventualities.

But ... while that is clearly the 'big grand' end goal, the panacea of RAS 
design, i'd suggest to start with a small but useful base and pick up low 
hanging fruits - then work towards this end goal. This is how perf is 
developed/maintained as well.

So i'd suggest to start with _something_ that other people can try and have a 
look at and extend, for example something that replaces basic mcelog 
functionality. That alone should be fairly easy and immediately gives it a 
short-term purpose. It would also be highly beneficial to the x86 code to get 
rid of the mcelog abonimation.

> * keep all info in sysfs so that no tool is needed for accessing it,
> similar to ftrace: All knobs needed for user interaction should appear
> redundantly as sysfs files/dirs so that configuration/query can be done
> "by hand" even when the hw tool is missing

Please share this code with perf. Profiling needs the same kind of 'hardware 
structure' enumeration - combined with 'software component enumeration'.

Currently we have that info /debug/tracing/events/. Some hw structure is in 
there as well, but not much - most of it is kernel subsystem event structure.

sysfs would be an option but IMO it's even better to put ftrace's 
/debug/tracing/events/ hiearchy into a separate eventfs - and extend it with 
'hardware structure' details.

This would not only crystalise the RAS purpose, but would nicely extend perf 
as well. With every hardware component you add from the RAS angle we'd get new 
events for tracing/profiling use as well - and vice versa. There's no reason 
why RAS should be limited to hw component failure events: a RAS policy action 
could be defined over OOM events too for example, or over checksum failures in 
network packets - etc.

RAS is not just about hardware, and profiling isnt just about software. We 
want event logging to be a unified design - there's big advantages to that.

So please go for an integrated design. The easiest and most useful way for 
that would be to factor out /debug/tracing/events/ into /eventfs.

> * gradually move pieces of RAS code into kernel proper: important 
> codepaths/aspects from the HW which are being queried often (e.g., DIMM 
> population and config) should be moved gradually into the kernel proper.

Yeah. Good plans.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/