linux-kernel - Re: [PATCH] x86/aperfmperf: Fix arch_scale_freq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YxdfO/5/Yfodm18i@hirez.programming.kicks-ass.net>
Date:   Tue, 6 Sep 2022 16:54:51 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Yair Podemsky <ypodemsk@...hat.com>
Cc:     x86@...nel.org, tglx@...utronix.de, mingo@...hat.com,
        rafael.j.wysocki@...el.com, pauld@...hat.com, frederic@...nel.org,
        ggherdovich@...e.cz, linux-kernel@...r.kernel.org, lenb@...nel.org,
        vschneid@...hat.com, jlelli@...hat.com, mtosatti@...hat.com,
        ppandit@...hat.com, alougovs@...hat.com, lcapitul@...hat.com,
        nsaenz@...nel.org
Subject: Re: [PATCH] x86/aperfmperf: Fix arch_scale_freq_tick() on tickless
 systems

On Thu, Aug 04, 2022 at 04:17:28PM +0300, Yair Podemsky wrote:
> In order for the scheduler to be frequency invariant we measure the
> ratio between the maximum cpu frequency and the actual cpu frequency.
> During long tickless periods of time the calculations that keep track
> of that might overflow, in the function scale_freq_tick():
> 
> if (check_shl_overflow(acnt, 2*SCHED_CAPACITY_SHIFT, &acnt))
> »       goto error;
> 
> eventually forcing the kernel to disable the feature with the
> message "Scheduler frequency invariance went wobbly, disabling!".
> Let's avoid that by detecting long tickless periods and bypassing
> the calculation for that tick.
> 
> This calculation updates the value of arch_freq_scale, used by the
> capacity-aware scheduler to correct cpu duty cycles:
> task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) /
> max_frequency(cpu))
> 
> However Consider a long tickless period, It takes should take 60 minutes
> for a tickless CPU running at 5GHz to trigger the acnt overflow,
> pick 10 minutes as a staleness threshold to be on the safe side,
> In our testing it took over 30 minutes for the overflow to happen,
> but since it's frequency/platform dependent we choose a smaller value
> to be on the safe side.
> 
> Fixes: e2b0d619b400 ("x86, sched: check for counters overflow in frequency invariant accounting")
> Signed-off-by: Yair Podemsky <ypodemsk@...hat.com>
> ---
>  arch/x86/kernel/cpu/aperfmperf.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c
> index 1f60a2b27936..dfe356034a60 100644
> --- a/arch/x86/kernel/cpu/aperfmperf.c
> +++ b/arch/x86/kernel/cpu/aperfmperf.c
> @@ -23,6 +23,13 @@
>  
>  #include "cpu.h"
>  
> +/*
> + * Samples older then 10 minutes should not be proccessed,
> + * This time is long enough to prevent unneeded drops of data
> + * But short enough to prevent overflows
> + */
> +#define MAX_SAMPLE_AGE_NOHZ	((unsigned long)HZ * 600)
> +
>  struct aperfmperf {
>  	seqcount_t	seq;
>  	unsigned long	last_update;
> @@ -373,6 +380,7 @@ static inline void scale_freq_tick(u64 acnt, u64 mcnt) { }
>  void arch_scale_freq_tick(void)
>  {
>  	struct aperfmperf *s = this_cpu_ptr(&cpu_samples);
> +	unsigned long last  = s->last_update;
>  	u64 acnt, mcnt, aperf, mperf;
>  
>  	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
> @@ -392,7 +400,12 @@ void arch_scale_freq_tick(void)
>  	s->mcnt = mcnt;
>  	raw_write_seqcount_end(&s->seq);
>  
> -	scale_freq_tick(acnt, mcnt);
> +	/*
> +	 * Avoid calling scale_freq_tick() when the last update was too long ago,
> +	 * as it might overflow during calulation.
> +	 */
> +	if ((jiffies - last) <= MAX_SAMPLE_AGE_NOHZ)
> +		scale_freq_tick(acnt, mcnt);
>  }

All this patch does is avoid the warning; but afaict it doesn't make it
behave in a sane way.

I'm thinking that on nohz_full cpus you don't have load balancing, I'm
also thinking that on nohz_full cpus you don't have DVFS.

So *why* the heck are we setting this stuff to random values ? Should
you not simply kill th entire thing for nohz_full cpus?