linux-kernel - Re: [RFC] Energy/power monitoring within the kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1351014187.8467.24.camel@gandalf.local.home>
Date:	Tue, 23 Oct 2012 13:43:07 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Pawel Moll <pawel.moll@....com>
Cc:	Amit Daniel Kachhap <amit.kachhap@...aro.org>,
	Zhang Rui <rui.zhang@...el.com>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Daniel Lezcano <daniel.lezcano@...aro.org>,
	Jean Delvare <khali@...ux-fr.org>,
	Guenter Roeck <linux@...ck-us.net>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...e.hu>, Jesper Juhl <jj@...osbits.net>,
	Thomas Renninger <trenn@...e.de>,
	Jean Pihet <jean.pihet@...oldbits.com>,
	linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
	lm-sensors@...sensors.org, linaro-dev@...ts.linaro.org
Subject: Re: [RFC] Energy/power monitoring within the kernel

On Tue, 2012-10-23 at 18:30 +0100, Pawel Moll wrote:

> 
> === Option 1: Trace event ===
> 
> This seems to be the "cheapest" option. Simply defining a trace event
> that can be generated by a hwmon (or any other) driver makes the
> interesting data immediately available to any ftrace/perf user. Of
> course it doesn't really help with the cpufreq case, but seems to be
> a good place to start with.
> 
> The question is how to define it... I've came up with two prototypes:
> 
> = Generic hwmon trace event =
> 
> This one allows any driver to generate a trace event whenever any
> "hwmon attribute" (measured value) gets updated. The rate at which the
> updates happen can be controlled by already existing "update_interval"
> attribute.
> 
> 8<-------------------------------------------
> TRACE_EVENT(hwmon_attr_update,
> 	TP_PROTO(struct device *dev, struct attribute *attr, long long input),
> 	TP_ARGS(dev, attr, input),
> 
> 	TP_STRUCT__entry(
> 		__string(       dev,		dev_name(dev))
> 		__string(	attr,		attr->name)
> 		__field(	long long,	input)
> 	),
> 
> 	TP_fast_assign(
> 		__assign_str(dev, dev_name(dev));
> 		__assign_str(attr, attr->name);
> 		__entry->input = input;
> 	),
> 
> 	TP_printk("%s %s %lld", __get_str(dev), __get_str(attr), __entry->input)
> );
> 8<-------------------------------------------
> 
> It generates such ftrace message:
> 
> <...>212.673126: hwmon_attr_update: hwmon4 temp1_input 34361
> 
> One issue with this is that some external knowledge is required to
> relate a number to a processor core. Or maybe it's not an issue at all
> because it should be left for the user(space)?

If the external knowledge can be characterized in a userspace tool with
the given data here, I see no issues with this.

> 
> = CPU power/energy/temperature trace event =
> 
> This one is designed to emphasize the relation between the measured
> value (whether it is energy, temperature or any other physical
> phenomena, really) and CPUs, so it is quite specific (too specific?)
> 
> 8<-------------------------------------------
> TRACE_EVENT(cpus_environment,
> 	TP_PROTO(const struct cpumask *cpus, long long value, char unit),
> 	TP_ARGS(cpus, value, unit),
> 
> 	TP_STRUCT__entry(
> 		__array(	unsigned char,	cpus,	sizeof(struct cpumask))
> 		__field(	long long,	value)
> 		__field(	char,		unit)
> 	),
> 
> 	TP_fast_assign(
> 		memcpy(__entry->cpus, cpus, sizeof(struct cpumask));

Copying the entire cpumask seems like overkill. Especially when you have
4096 CPU machines.

> 		__entry->value = value;
> 		__entry->unit = unit;
> 	),
> 
> 	TP_printk("cpus %s %lld[%c]",
> 		__print_cpumask((struct cpumask *)__entry->cpus),
> 		__entry->value, __entry->unit)
> );
> 8<-------------------------------------------
> 
> And the equivalent ftrace message is:
> 
> <...>127.063107: cpus_environment: cpus 0,1,2,3 34361[C]
> 
> It's a cpumask, not just single cpu id, because the sensor may measure
> the value per set of CPUs, eg. a temperature of the whole silicon die
> (so all the cores) or an energy consumed by a subset of cores (this
> is my particular use case - two meters monitor a cluster of two
> processors and a cluster of three processors, all working as a SMP
> system).
> 
> Of course the cpus __array could be actually a special __cpumask field
> type (I've just hacked the __print_cpumask so far). And I've just
> realised that the unit field should actually be a string to allow unit
> prefixes to be specified (the above should obviously be "34361[mC]"
> not "[C]"). Also - excuse the "cpus_environment" name - this was the
> best I was able to come up with at the time and I'm eager to accept
> any alternative suggestions :-)

Perhaps making a field that can be a subset of cpus may be better. That
way we don't waste the ring buffer with lots of zeros. I'm guessing that
it will only be a group of cpus, and not a scattered list? Of course,
I've seen boxes where the cpu numbers went from core to core. That is,
cpu 0 was on core 1, cpu 1 was on core 2, and then it would repeat. 
cpu 8 was on core 1, cpu 9 was on core 2, etc.

But still, this could be compressed somehow.

I'll let others comment on the rest.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/