linux-kernel - Re: [RFC][PATCH] x86, mce, therm_throt: Optimize notifications of thermal throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191105144411.GC28418@zn.tnic>
Date:   Tue, 5 Nov 2019 15:44:11 +0100
From:   Borislav Petkov <bp@...en8.de>
To:     Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
Cc:     tony.luck@...el.com, tglx@...utronix.de, mingo@...hat.com,
        hpa@...or.com, bberg@...hat.com, x86@...nel.org,
        linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
        hdegoede@...hat.com, ckellner@...hat.com
Subject: Re: [RFC][PATCH] x86, mce, therm_throt: Optimize notifications of
 thermal throttle

On Thu, Oct 24, 2019 at 05:19:24PM -0700, Srinivas Pandruvada wrote:
> Some modern systems have very tight thermal tolerances. Because of this
> they may cross thermal thresholds when running normal workloads (even
> during boot). The CPU hardware will react by limiting power/frequency
> and using duty cycles to bring the temperature back into normal range.
> 
> Thus users may see a "critical" message about the "temperature above
> threshold" which is soon followed by "temperature/speed normal". These
> messages are rate limited, but still may repeat every few minutes.

rate-limited

> A test run on a laptop with Intel 8th Gen i5 core for two hours with a
> workload resulted in 20K+ thermal interrupts per CPU for core level and
> another 20K+ interrupts at package level. The kernel logs were full of
> throttling messages.
> 
> Brief background, on why there are so many thermal interrupts in a
> modern system:
> From IvyBridge, there is another offset called TCC offset is introduced.

That sentence needs fixing.

> This adds an offset to the real PROCHOT temperature target. So effectively
> this interrupt is generated much before the PROCHOT. There will be several
> very short throttling by the processor using adaptive thermal monitoring
  ^^^^^^^^^^^^^^^^^^^^^^

"There will be several very short throttling" reads funny.

> at this threshold, instead of more aggressive action close to PROCHOT.
> This offset is configured by OEMs and some tend to be more conservative
> than others. So logging such events just generates noise in the logs.
> 
> The real value of these threshold interrupts, is to debug problems with
> the external cooling solutions and performance issues due to excessive
> throttling.
> 
> So the solution here:
> - Show in the current thermal_throttle folder, the maximum time for one
> throttling event and total amount of time, the system was in throttling
> state.
> - Don't log short excursions.
> - Log only when, in spite of thermal throttling the temperature is rising.
> This is done by monitoring temperature trend using three point moving
> average. On the high threshold interrupt trigger a delayed workqueue,
> which monitors the threshold violation log bit, calculates moving moving

What is the "threshold violation log bit" ? THERM_STATUS_PROCHOT_LOG ?

s/moving moving/moving/

Please read your commit message before sending to check whether it makes
any sense. Commit messages are not write-only.

> average and logs when temperature trend is raising. When the log bit is
> clear and temperature is below threshold temperature, it will print
> "Normal" message. Once a high threshold event is logged, it rate limits
> number of log messages.
> - Reduce the logging severity to warning.

I already took the reducing printk severity patch, you'd need to redo
yours ontop of tip/master.

> With this patch, on the same test laptop, no warnings are printed in logs
> as the max time the processor could bring the temperature under control is
> only 280 ms.
> 
> This implementation is done with the inputs from Alan Cox and Tony Luck.
> 
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
> ---
>  arch/x86/kernel/cpu/mce/therm_throt.c | 193 +++++++++++++++++++++++---
>  1 file changed, 172 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/therm_throt.c b/arch/x86/kernel/cpu/mce/therm_throt.c
> index 6e2becf547c5..e06f8d475207 100644
> --- a/arch/x86/kernel/cpu/mce/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mce/therm_throt.c
> @@ -47,8 +47,19 @@ struct _thermal_state {
>  	bool			new_event;
>  	int			event;
>  	u64			next_check;
> +	u64			last_interrupt_time;
> +	struct delayed_work	therm_work;
>  	unsigned long		count;
>  	unsigned long		last_count;
> +	unsigned long		max_time_ms;
> +	unsigned long		total_time_ms;
> +	int			rate_control_active;

That wants to be a bool judging by the context it is used in.

> +	int			level;
> +	int			sample_index;
> +	int			sample_count;
> +	int			average;
> +	int			baseline_temp;
> +	u8			temp_samples[3];

All these new members (and old members) should be documented what they
are. Like, what is "max_time_ms", for example? I can find out from the
sysfs func names below but having comments explaining what those are is
much better.

>  };
>  
>  struct thermal_state {
> @@ -121,8 +132,22 @@ define_therm_throt_device_one_ro(package_throttle_count);
>  define_therm_throt_device_show_func(package_power_limit, count);
>  define_therm_throt_device_one_ro(package_power_limit_count);
>  
> +define_therm_throt_device_show_func(core_throttle, max_time_ms);
> +define_therm_throt_device_one_ro(core_throttle_max_time_ms);
> +
> +define_therm_throt_device_show_func(package_throttle, max_time_ms);
> +define_therm_throt_device_one_ro(package_throttle_max_time_ms);
> +
> +define_therm_throt_device_show_func(core_throttle, total_time_ms);
> +define_therm_throt_device_one_ro(core_throttle_total_time_ms);
> +
> +define_therm_throt_device_show_func(package_throttle, total_time_ms);
> +define_therm_throt_device_one_ro(package_throttle_total_time_ms);
> +
>  static struct attribute *thermal_throttle_attrs[] = {
>  	&dev_attr_core_throttle_count.attr,
> +	&dev_attr_core_throttle_max_time_ms.attr,
> +	&dev_attr_core_throttle_total_time_ms.attr,
>  	NULL
>  };
>  
> @@ -135,6 +160,95 @@ static const struct attribute_group thermal_attr_group = {
>  #define CORE_LEVEL	0
>  #define PACKAGE_LEVEL	1
>  
> +#define THERM_THROT_POLL_INTERVAL	HZ
> +#define THERM_STATUS_PROCHOT_LOG	BIT(1)
> +
> +static void therm_throt_clear_therm_status_log(int level)

That's a static function, called only once so prepending its name with
the "therm_throt_" prefix is pointless and doesn't help readability.
Having simply

	clear_therm_status_log(state->level);

in the code is clearer and shorter. Ditto for the other helper functions.

> +{
> +	u64 msr_val;
> +	int msr;
> +
> +	msr = (level == CORE_LEVEL) ? MSR_IA32_THERM_STATUS :
> +			MSR_IA32_PACKAGE_THERM_STATUS;

Make that a normal if-else statement for better readability:

	if (level == CORE_LEVEL)
		msr = MSR_IA32_THERM_STATUS;
	else
		msr = MSR_IA32_PACKAGE_THERM_STATUS;

> +	rdmsrl(msr, msr_val);

Is that rdmsrl() always going to succeed here or you need to handle a
possible error it returns?

> +	wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG);

Same here.

> +}
> +
> +static void therm_throt_get_therm_status(int level, int *proc_hot, int *temp)
> +{
> +	u64 msr_val;
> +	int msr;
> +
> +	msr = (level == CORE_LEVEL) ? MSR_IA32_THERM_STATUS :
> +			MSR_IA32_PACKAGE_THERM_STATUS;
> +	rdmsrl(msr, msr_val);
> +	*proc_hot = msr_val & THERM_STATUS_PROCHOT_LOG ? 1 : 0;
> +	*temp = (msr_val >> 16) & 0x7F;

Same comments as for the therm_throt_clear_therm_status_log() function above.

...

> @@ -178,27 +292,23 @@ static void therm_throt_process(bool new_event, int event, int level)
>  	if (new_event)
>  		state->count++;
>  
> -	if (time_before64(now, state->next_check) &&
> -			state->count != state->last_count)
> -		return;
> +	if (event == THERMAL_THROTTLING_EVENT) {

Save an indentation level:

	if (event != THERMAL_THROTTLING_EVENT)
		return;

	/* next statement starts here */

> +		if (new_event && !state->last_interrupt_time) {
> +			int hot;
>  
> -	state->next_check = now + CHECK_INTERVAL;
> -	state->last_count = state->count;
> +			therm_throt_get_therm_status(state->level, &hot, &state->baseline_temp);
>  
> -	/* if we just entered the thermal event */
> -	if (new_event) {
> -		if (event == THERMAL_THROTTLING_EVENT)
> -			pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",
> -				this_cpu,
> -				level == CORE_LEVEL ? "Core" : "Package",
> -				state->count);
> -		return;
> -	}
> -	if (old_event) {
> -		if (event == THERMAL_THROTTLING_EVENT)
> -			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
> -				level == CORE_LEVEL ? "Core" : "Package");
> -		return;
> +			state->last_interrupt_time = now;
> +			schedule_delayed_work_on(this_cpu, &state->therm_work, THERM_THROT_POLL_INTERVAL);
> +		} else if (old_event && state->last_interrupt_time) {
> +			unsigned long throttle_time;
> +
> +			throttle_time = jiffies_delta_to_msecs(now - state->last_interrupt_time);
> +			if (throttle_time > state->max_time_ms)
> +				state->max_time_ms = throttle_time;
> +			state->total_time_ms += throttle_time;
> +			state->last_interrupt_time = 0;
> +		}
>  	}
>  }
>  

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette