linux-kernel - Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <195c95b2-f47c-f3d0-5663-97dd4c929ea4@arm.com>
Date:   Fri, 7 Apr 2023 10:31:30 +0200
From:   Pierre Gondois <pierre.gondois@....com>
To:     Yang Shi <yang@...amperecomputing.com>
Cc:     viresh.kumar@...aro.org, scott@...amperecomputing.com,
        linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
        "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid
 high error

Hello Yang,

On 4/6/23 23:52, Yang Shi wrote:
> 
> 
> On 4/5/23 10:57 AM, Pierre Gondois wrote:
>>
>>
>> On 4/4/23 21:07, Yang Shi wrote:
>>>
>>>
>>> On 3/29/23 11:43 AM, Rafael J. Wysocki wrote:
>>>> On Tue, Mar 28, 2023 at 9:39 PM Yang Shi
>>>> <yang@...amperecomputing.com> wrote:
>>>>> When testing CPPC cpufreq on our platform, we noticed the error may
>>>>> be quite
>>>>> high and the high error may happen quite often.  For example, on a
>>>>> platform
>>>>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded
>>>>> (100% load),
>>>>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is >
>>>>> 40%.  And the
>>>>> high error (> 1%) happened 256 times out of 2127 samples (sampled
>>>>> every 3
>>>>> seconds) in an approximate 2hrs test.
>>>> The description above is a bit cryptic IMV.  For example, it is not
>>>> particularly clear what "high error" means.
>>>>
>>>>> We tried to enlarge the delay, and tested with 100us, 1ms and
>>>>> 10ms.  The
>>>>> below is the results.
>>>>>
>>>>> 100us:
>>>>> The highest error is 4GHz, 22 times out of 3623 samples
>>>>>
>>>>> 1ms:
>>>>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>>>>
>>>>> 10ms:
>>>>> No high error anymore
>>>>>
>>>>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to
>>>>> avoid
>>>>> high measurement errors.
>>>>>
>>>>> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
>>>>> ---
>>>>>     drivers/cpufreq/cppc_cpufreq.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>>>> b/drivers/cpufreq/cppc_cpufreq.c
>>>>> index 022e3555407c..c2bf65448d3d 100644
>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>> @@ -851,7 +851,7 @@ static unsigned int
>>>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>>>            if (ret)
>>>>>                    return ret;
>>>>>
>>>>> -       udelay(2); /* 2usec delay between sampling */
>>>>> +       mdelay(10); /* 10msec delay between sampling */
>>>> This function can be called with interrupts off, so it cannot spin
>>>> for 10 ms.
>>>
>>> Per Pierre's comment, the delay may still be ms. Is it still too long? A
>>> quick look at the code shows cpufreq_quick_get() is the only caller with
>>> irq off IIRC. So can we have another callback for it, for example,
>>> get_quick() which does spin for shorter time (for example, keep 2us
>>> delay). Then have ->get() callback use longer delay?
>>
>> -
>> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
>> paragraph,
>> and assuming we are in the 'Autonomous Selection is not enabled' case,
>> the OS is supposed to write (not read) the delta between successive
>> reads of the
>> counter, so using this field as is would be bending the definition I
>> think.
>>
>> -
>> It is correct that the "Time Window Register" field specifies a value
>> in ms,
>> but it seems a long time to wait for with irqs off.
> 
> AFAIK, our platforms don't support "Time Window Register".
> 
>>
>> -
>> Theoretically, the perf/ref counters should accumulate to allow computing
>> a correct frequency. Is it possible to know how these counters are
>> accessed ?
>> Is it through PCC channels and there is some undesired delay between the
>> reads of the perf/ref counters ?
> 
> The counters are implemented via mmio instead of PCC channels. So the
> cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.
> 
>>
>> -
>> About making the delay:
>>      max(cppc_cpufreq_get_transition_delay_us(), Time Winder Register)
>> I think it would be good to know why the values of the counters
>> don't accumulate correctly, ideally by getting a trace where a frequency
>> above the maximum frequency is computed, and with the timestamps at which
>> the counters are read.
>> If the values are coming from PCC channels / the firmware, it might be
>> difficult
>> to get.
> 
> I wrote a bpftrace script to trace the below data:
>       - The CPU number
>       - The frequency
>       - The start and end timestamp of the first cppc_get_perf_ctrs() call
>       - The duration/latency of the first cppc_get_perf_ctrs() call
>       - The start and end timestamp of the second cppc_get_perf_ctrs() call
>       - The duration/latency of the second cppc_get_perf_ctrs() call
> 
> The typical logs look like below.
> Good
> CPU: 1
> Freq: 2801485KHz
> First:  2489382384  2489387084 4700ns
> Second: 2489390824  2489394024  3200ns
> --------------------------------------------------
> CPU:    2
> Freq:   2797956KHz
> First:  2490406524  2490411204  4680ns
> Second: 2490414764  2490417684  2920ns
> 
> Bad:
> CPU:    55
> Freq:   3969372KHz
> First:  875659868  875721568  61700ns
> Second: 875725148  875727708  2560ns
> --------------------------------------------------
> CPU: 65
> Freq: 3829744KHz
> First:  3854951136  3854995896 44760ns
> Second: 3854999416  3855002696 3280ns
> --------------------------------------------------
> CPU: 21
> Freq: 4279242KHz
> First:  240834204  240910484 76280ns
> Second: 240914264  240916944  2680ns
> 
> 
> The first line is cpu number, the second line is frequency returned by
> cppc_cpufreq_get_rate(), the third line is the start and end timestamps
> and duration of the first cppc_get_perf_ctrs(), the fourth line is the
> start and end timestamps and duration of the second cppc_get_perf_ctrs().
> 
> So per the log I think we can tell basically the longer the duration the
> higher the error. The 2us delay is not long enough to offset the impact
> from unexpected latency of reading the counters.
> 
> In the worst case the frequency is 4279242KHz, comparing 2800000KHz the
> error is over 50%. So the delay should be 4ms ~ 5ms in order to offset
> the impact from reading the counters if I do the math correctly.
> 
> Hope the trace data is helpful to diagnose the problem.


Thanks for the data. I was thinking the following was happening:

  cppc_get_perf_ctrs()[0]                    cppc_get_perf_ctrs()[1]
/                    \                    /                         \
ref[0]    delivered[0]                    ref[1]          delivered[1]
   |            |                              |                  |
   v            v                              v                  v
----------------------------------------------------------------------> time
    <-delta[0]-> <-------------2us------------> <----delta[1]---->

If delta[0] is really different from delta[1] like above, then the
reference and delivered counters would have accumulated during different
intervals, resulting in a wrong frequency.
If more/less than 2us elapse between the two cppc_get_perf_ctrs() calls,
then it shouldn't have any impact. So waiting ~10ms should theoretically
not solve the issue.

freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])

If the counters are accessed through mmio, I don't see anything that would
make delta[x] vary when calling cppc_get_perf_ctrs(), cf. cpc_read().
Do you know if the address represents real counters or a place in memory
representing something else ?

Would it be possible to try setting the CPU frequency to one unique value
and get a serie of values like:
[timestamp, ref_counter_value, deliverd_counter_value]

This would allow to check that the counters are accumulating at a valid
pace. Also you said there were frequencies above the maximum value, but
are there also frequencies below the minimum value ?

Regards,
Pierre