linux-kernel - Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3e239024-91d8-ea06-25a4-631496576319@os.amperecomputing.com>
Date:   Fri, 7 Apr 2023 15:19:25 -0700
From:   Yang Shi <yang@...amperecomputing.com>
To:     Pierre Gondois <pierre.gondois@....com>
Cc:     viresh.kumar@...aro.org, scott@...amperecomputing.com,
        linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
        "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid
 high error



On 4/7/23 1:31 AM, Pierre Gondois wrote:
> Hello Yang,
>
> On 4/6/23 23:52, Yang Shi wrote:
>>
>>
>> On 4/5/23 10:57 AM, Pierre Gondois wrote:
>>>
>>>
>>> On 4/4/23 21:07, Yang Shi wrote:
>>>>
>>>>
>>>> On 3/29/23 11:43 AM, Rafael J. Wysocki wrote:
>>>>> On Tue, Mar 28, 2023 at 9:39 PM Yang Shi
>>>>> <yang@...amperecomputing.com> wrote:
>>>>>> When testing CPPC cpufreq on our platform, we noticed the error may
>>>>>> be quite
>>>>>> high and the high error may happen quite often.  For example, on a
>>>>>> platform
>>>>>> with a maximum frequency of 2.8GHz when the CPUs were fully loaded
>>>>>> (100% load),
>>>>>> we saw cpuinfo_cur_freq may show 4GHz, it means the error is >
>>>>>> 40%.  And the
>>>>>> high error (> 1%) happened 256 times out of 2127 samples (sampled
>>>>>> every 3
>>>>>> seconds) in an approximate 2hrs test.
>>>>> The description above is a bit cryptic IMV.  For example, it is not
>>>>> particularly clear what "high error" means.
>>>>>
>>>>>> We tried to enlarge the delay, and tested with 100us, 1ms and
>>>>>> 10ms.  The
>>>>>> below is the results.
>>>>>>
>>>>>> 100us:
>>>>>> The highest error is 4GHz, 22 times out of 3623 samples
>>>>>>
>>>>>> 1ms:
>>>>>> The highest error is 3.3GHz, 3 times out of 2814 samples
>>>>>>
>>>>>> 10ms:
>>>>>> No high error anymore
>>>>>>
>>>>>> Increase the measurement delay in cppc_cpufreq_get_rate to 10ms to
>>>>>> avoid
>>>>>> high measurement errors.
>>>>>>
>>>>>> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
>>>>>> ---
>>>>>>     drivers/cpufreq/cppc_cpufreq.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>>>>> b/drivers/cpufreq/cppc_cpufreq.c
>>>>>> index 022e3555407c..c2bf65448d3d 100644
>>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>>> @@ -851,7 +851,7 @@ static unsigned int
>>>>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>>>>            if (ret)
>>>>>>                    return ret;
>>>>>>
>>>>>> -       udelay(2); /* 2usec delay between sampling */
>>>>>> +       mdelay(10); /* 10msec delay between sampling */
>>>>> This function can be called with interrupts off, so it cannot spin
>>>>> for 10 ms.
>>>>
>>>> Per Pierre's comment, the delay may still be ms. Is it still too 
>>>> long? A
>>>> quick look at the code shows cpufreq_quick_get() is the only caller 
>>>> with
>>>> irq off IIRC. So can we have another callback for it, for example,
>>>> get_quick() which does spin for shorter time (for example, keep 2us
>>>> delay). Then have ->get() callback use longer delay?
>>>
>>> -
>>> According to the same ACPI 6.5 s8.4.6.1.2.5 "Time Window Register"
>>> paragraph,
>>> and assuming we are in the 'Autonomous Selection is not enabled' case,
>>> the OS is supposed to write (not read) the delta between successive
>>> reads of the
>>> counter, so using this field as is would be bending the definition I
>>> think.
>>>
>>> -
>>> It is correct that the "Time Window Register" field specifies a value
>>> in ms,
>>> but it seems a long time to wait for with irqs off.
>>
>> AFAIK, our platforms don't support "Time Window Register".
>>
>>>
>>> -
>>> Theoretically, the perf/ref counters should accumulate to allow 
>>> computing
>>> a correct frequency. Is it possible to know how these counters are
>>> accessed ?
>>> Is it through PCC channels and there is some undesired delay between 
>>> the
>>> reads of the perf/ref counters ?
>>
>> The counters are implemented via mmio instead of PCC channels. So the
>> cpc_read() calls should go to ACPI_ADR_SPACE_SYSTEM_MEMORY IIRC.
>>
>>>
>>> -
>>> About making the delay:
>>>      max(cppc_cpufreq_get_transition_delay_us(), Time Winder Register)
>>> I think it would be good to know why the values of the counters
>>> don't accumulate correctly, ideally by getting a trace where a 
>>> frequency
>>> above the maximum frequency is computed, and with the timestamps at 
>>> which
>>> the counters are read.
>>> If the values are coming from PCC channels / the firmware, it might be
>>> difficult
>>> to get.
>>
>> I wrote a bpftrace script to trace the below data:
>>       - The CPU number
>>       - The frequency
>>       - The start and end timestamp of the first cppc_get_perf_ctrs() 
>> call
>>       - The duration/latency of the first cppc_get_perf_ctrs() call
>>       - The start and end timestamp of the second 
>> cppc_get_perf_ctrs() call
>>       - The duration/latency of the second cppc_get_perf_ctrs() call
>>
>> The typical logs look like below.
>> Good
>> CPU: 1
>> Freq: 2801485KHz
>> First:  2489382384  2489387084 4700ns
>> Second: 2489390824  2489394024  3200ns
>> --------------------------------------------------
>> CPU:    2
>> Freq:   2797956KHz
>> First:  2490406524  2490411204  4680ns
>> Second: 2490414764  2490417684  2920ns
>>
>> Bad:
>> CPU:    55
>> Freq:   3969372KHz
>> First:  875659868  875721568  61700ns
>> Second: 875725148  875727708  2560ns
>> --------------------------------------------------
>> CPU: 65
>> Freq: 3829744KHz
>> First:  3854951136  3854995896 44760ns
>> Second: 3854999416  3855002696 3280ns
>> --------------------------------------------------
>> CPU: 21
>> Freq: 4279242KHz
>> First:  240834204  240910484 76280ns
>> Second: 240914264  240916944  2680ns
>>
>>
>> The first line is cpu number, the second line is frequency returned by
>> cppc_cpufreq_get_rate(), the third line is the start and end timestamps
>> and duration of the first cppc_get_perf_ctrs(), the fourth line is the
>> start and end timestamps and duration of the second 
>> cppc_get_perf_ctrs().
>>
>> So per the log I think we can tell basically the longer the duration the
>> higher the error. The 2us delay is not long enough to offset the impact
>> from unexpected latency of reading the counters.
>>
>> In the worst case the frequency is 4279242KHz, comparing 2800000KHz the
>> error is over 50%. So the delay should be 4ms ~ 5ms in order to offset
>> the impact from reading the counters if I do the math correctly.
>>
>> Hope the trace data is helpful to diagnose the problem.
>
>
> Thanks for the data. I was thinking the following was happening:
>
>  cppc_get_perf_ctrs()[0] cppc_get_perf_ctrs()[1]
> /                    \ /                         \
> ref[0]    delivered[0]                    ref[1] delivered[1]
>   |            |                              |                  |
>   v            v                              v                  v
> ----------------------------------------------------------------------> 
> time
>    <-delta[0]-> <-------------2us------------> <----delta[1]---->
>
> If delta[0] is really different from delta[1] like above, then the
> reference and delivered counters would have accumulated during different
> intervals, resulting in a wrong frequency.

Yeah, it looks like so.

> If more/less than 2us elapse between the two cppc_get_perf_ctrs() calls,
> then it shouldn't have any impact. So waiting ~10ms should theoretically
> not solve the issue.

I'm not sure whether the 10ms delay really resolved the issue, but it 
did reduce the magnitude of the error.

BTW, I don't see irq is disabled when reading cpuinfo_cur_freq, so it 
looks like interrupts could easily result in the difference between 
delta[0] and delta[1]. And it seems like the difference matters.

And the counters are accessed through an interconnect on our platform, 
so the interconnect congestion may result in the difference as well.

>
> freq = ref_freq * (delivered[1] - delivered[0]) / (ref[1] - ref[0])
>
> If the counters are accessed through mmio, I don't see anything that 
> would
> make delta[x] vary when calling cppc_get_perf_ctrs(), cf. cpc_read().
> Do you know if the address represents real counters or a place in memory
> representing something else ?

The address does represent real counters.

>
> Would it be possible to try setting the CPU frequency to one unique value
> and get a serie of values like:
> [timestamp, ref_counter_value, deliverd_counter_value]

Could you please elaborate regarding "setting the CPU frequency to one 
unique value"? What value is unique?

>
> This would allow to check that the counters are accumulating at a valid
> pace. Also you said there were frequencies above the maximum value, but
> are there also frequencies below the minimum value ?

I've never seen the frequency below the minimum value.

>
> Regards,
> Pierre