linux-kernel - Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191015085257.GE2311@hirez.programming.kicks-ass.net>
Date:   Tue, 15 Oct 2019 10:52:57 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     Borislav Petkov <bp@...en8.de>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        tglx@...utronix.de, mingo@...hat.com, hpa@...or.com,
        bberg@...hat.com, x86@...nel.org, linux-edac@...r.kernel.org,
        linux-kernel@...r.kernel.org, hdegoede@...hat.com,
        ckellner@...hat.com
Subject: Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal
 throttle messages

On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote:
> On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote:
> > This description is already *begging* for this delay value to be
> > automatically set by the kernel. Putting yet another knob in front of
> > the user who doesn't have a clue most of the time shows one more time
> > that we haven't done our job properly by asking her to know what we
> > already do.
> > 
> > IOW, a simple history feedback mechanism which sets the timeout based on
> > the last couple of values is much smarter. The thing would have a max
> > value, of course, which, when exceeded should mean an anomaly, etc, but
> > almost anything else is better than merely asking the user to make an
> > educated guess.
> 
> You need a plausible start point for the "when to worry the user"
> message.  Maybe that is your "max value"?
> 
> So if the system has a couple of excursions above temperature lasting
> 1 second and then 2 seconds ... would you like to see those ignored
> (because they are below the initial max)? But now we have a couple
> of data points pick some new value to be the threshold for reporting?
> 
> What value should we pick (based on 1 sec, then 2 sec)?
> 
> I would be worried that it would self tune to the point where it
> does report something that it really didn't need to (e.g. as a result
> of a few consecutive very short excursions).

I'm guessing Boris is thinking of a simple IIR like avg filter.

	avg = avg + (sample-avg) / 4

And then only print when sample > 2*avg. If you initialize that with
some appropriately large value, it should settle down into what it
'normal' for that particular piece of hardware.

Still, I'm boggled by the whole idea that hitting critical hard throttle
is considered 'normal' at all.

> We also need to take into account the "typical sampling interval"
> for user space thermal control software.

Why is control of critical thermal crud in userspace? That seems like a
massive design fail.