lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3186fce7-679c-f1da-363b-bbc31cf68f06@nvidia.com>
Date:   Fri, 28 Apr 2023 16:32:19 +0530
From:   Sumit Gupta <sumitg@...dia.com>
To:     Yang Shi <yang@...amperecomputing.com>,
        Ionela Voinescu <ionela.voinescu@....com>
CC:     Pierre Gondois <pierre.gondois@....com>, <viresh.kumar@...aro.org>,
        <scott@...amperecomputing.com>, <linux-pm@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Sumit Gupta <sumitg@...dia.com>,
        Krishna Sitaraman <ksitaraman@...dia.com>,
        Sanjay Chandrashekara <sanjayc@...dia.com>,
        Bibek Basu <bbasu@...dia.com>
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid
 high error



On 28/04/23 02:10, Yang Shi wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 4/26/23 12:01 PM, Ionela Voinescu wrote:
>> Hi,
>>
>> + Sumit
>>
>> On Tuesday 25 Apr 2023 at 18:32:55 (-0700), Yang Shi wrote:
>>>
>>> On 4/24/23 4:44 AM, Ionela Voinescu wrote:
>>>> Hey,
>>>>
>>>> On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
>>>>> On 4/20/23 9:01 AM, Pierre Gondois wrote:
>>>>>>>> You say that the cause of this is a congestion in the 
>>>>>>>> interconnect. I
>>>>>>>> don't
>>>>>>>> see a way to check that right now.
>>>>>>>> However your trace is on the CPU0, so maybe all the other cores 
>>>>>>>> were
>>>>>>>> shutdown
>>>>>>>> in your test. If this is the case, do you think a congestion could
>>>>>>>> happen with
>>>>>>>> only one CPU ?
>>>>>>> No, other CPUs were not shut down in my test. I just ran "yes" on 
>>>>>>> all
>>>>>>> cores except CPU 0, then ran the reading freq script. Since all 
>>>>>>> other
>>>>>>> cores are busy, so the script should be always running on CPU 0.
>>>>>>>
>>>>>>> Since the counters, memory and other devices are on the 
>>>>>>> interconnect, so
>>>>>>> the congestion may be caused by plenty of factors IIUC.
>>>>>> +Ionela
>>>>>>
>>>>>> Ionela pointed me to the following patch-set, which seems realated:
>>>>>> https://lore.kernel.org/all/20230418113459.12860-5-sumitg@nvidia.com/
>>>>> Thanks for the information. I think we do have the similar 
>>>>> syndrome. But I'm
>>>>> not sure how their counters are implemented, we may not have 
>>>>> similar root
>>>>> cause.
>>>> Yes, my bad, I did not get the chance to read this full thread before
>>>> talking with Pierre. In your case you have AMUs accessed via MMIO 
>>>> and in that
>>>> case they are accessed though FFH (IPIs and system registers). The root
>>>> cause is indeed different.
>>> Yeah, but it seems like using larger delay could mitigate both issues.
>> Yes, there is a minimum delay required there of 25us due to the way that
>> the counters accumulate, which is not too bad even with interrupts
>> disabled (to cater to cpufreq_quick_get()).
>>
>> [..]
>>>>>>> Yeah, we should be able to find a smaller irq disable section.
>>>>>>>
>>>>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>>>>> index c51d3ccb4cca..105a7e2ffffa 100644
>>>>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>>>>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>>>            struct cppc_pcc_data *pcc_ss_data = NULL;
>>>>>>>>            u64 delivered, reference, ref_perf, ctr_wrap_time;
>>>>>>>>            int ret = 0, regs_in_pcc = 0;
>>>>>>>> +       unsigned long flags;
>>>>>>>>
>>>>>>>>            if (!cpc_desc) {
>>>>>>>>                    pr_debug("No CPC descriptor for CPU:%d\n", 
>>>>>>>> cpunum);
>>>>>>>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>>>                    }
>>>>>>>>            }
>>>>>>>>
>>>>>>>> +       local_irq_save(flags);
>>>>>>>> +
>>>>>>>>            cpc_read(cpunum, delivered_reg, &delivered);
>>>>>>>>            cpc_read(cpunum, reference_reg, &reference);
>>>>>>>>            cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>>>>
>>>>>>>> +       local_irq_restore(flags);
>>>>>>>> +
>>>>>>> cpc_read_ffh() would return -EPERM if irq is disabled.
>>>>>>>
>>>>>>> So, the irq disabling must happen for mmio only in cpc_read(), for
>>>>>>> example:
>>>>>> I thought the issue was that irqs could happen in between cpc_read()
>>>>>> functions,
>>>>>> the patch below would not cover it. If the frequency is more accurate
>>>>>> with this patch, I think I don't understand something.
>>>>> Yeah, you are correct. The irq disabling window has to cover all the
>>>>> cpc_read(). I didn't test with this patch. My test was done 
>>>>> conceptually
>>>>> with:
>>>>>
>>>>> disable irq
>>>>> cppc_get_perf_ctrs(t0)
>>>>> udelay(2)
>>>>> cppc_get_perf_ctrs(t1)
>>>>> enable irq
>>>>>
>>>>> But this will break cpc_read_ffh().
>>>> Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
>>>> are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
>>>> register and the reference register, which should have minimal delay in
>>>> between?
>>>>
>>>> As in:
>>>>
>>>> if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
>>>>       CPC_IN_SYSTEM_MEMORY(reference_reg))
>>>>     ...
>>>>
>>>> This will and should not affect FFH - the fix for that would have to be
>>>> different.
>>> It won't work, right? The problem is cppc_get_perf_ctrs() calls 
>>> cpc_read()s
>>> to read delivered and reference respectively, we just can tell 
>>> whether they
>>> are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in 
>>> cppc_get_perf_ctrs().
>>> So the resulting code should conceptually look like:
>>>
>>> disable irq
>>> read delivered
>>> enable irq
>>>
>>> disable irq
>>> read reference
>>> enable irq
>>>
>>> But there still may be interrupts between the two reads. We actually 
>>> want:
>>>
>>> disable irq
>>> read delivered
>>> read reference
>>> enable irq
>> Yes, this is what I was suggesting above.
>>
>> I've hacked up the following code. It covers the FFH case as well, with a
>> modified solution that Sumit proposed on the other thread:
>>
>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>> index 0f17b1c32718..7e828aed3693 100644
>> --- a/drivers/acpi/cppc_acpi.c
>> +++ b/drivers/acpi/cppc_acpi.c
>> @@ -110,6 +110,11 @@ static DEFINE_PER_CPU(struct cpc_desc *, 
>> cpc_desc_ptr);
>>                               (cpc)->cpc_entry.reg.space_id ==        \
>>                               ACPI_ADR_SPACE_SYSTEM_IO)
>>
>> +/* Check if a CPC register is in FFH */
>> +#define CPC_IN_FFH(cpc) ((cpc)->type == ACPI_TYPE_BUFFER &&  \
>> +                             (cpc)->cpc_entry.reg.space_id ==        \
>> +                             ACPI_ADR_SPACE_FIXED_HARDWARE)
>> +
>>   /* Evaluates to True if reg is a NULL register descriptor */
>>   #define IS_NULL_REG(reg) ((reg)->space_id ==  
>> ACPI_ADR_SPACE_SYSTEM_MEMORY && \
>>                               (reg)->address == 0 &&                  \
>> @@ -1292,6 +1297,24 @@ EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
>>    *
>>    * Return: 0 for success with perf_fb_ctrs populated else -ERRNO.
>>    */
>> +
>> +struct cycle_counters {
>> +     int cpunum;
>> +     struct cpc_register_resource *delivered_reg;
>> +     struct cpc_register_resource *reference_reg;
>> +     u64 *delivered;
>> +     u64 *reference;
>> +};
>> +
>> +static int cppc_get_cycle_ctrs(void *cycle_ctrs) {
>> +     struct cycle_counters *ctrs = cycle_ctrs;
>> +
>> +     cpc_read(ctrs->cpunum, ctrs->delivered_reg, ctrs->delivered);
>> +     cpc_read(ctrs->cpunum, ctrs->reference_reg, ctrs->reference);
>> +
>> +     return 0;
>> +}
>> +
>>   int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs 
>> *perf_fb_ctrs)
>>   {
>>       struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpunum);
>> @@ -1300,7 +1323,9 @@ int cppc_get_perf_ctrs(int cpunum, struct 
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>       int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum);
>>       struct cppc_pcc_data *pcc_ss_data = NULL;
>>       u64 delivered, reference, ref_perf, ctr_wrap_time;
>> +     struct cycle_counters ctrs = {0};
>>       int ret = 0, regs_in_pcc = 0;
>> +     unsigned long flags;
>>
>>       if (!cpc_desc) {
>>               pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>> @@ -1336,8 +1361,25 @@ int cppc_get_perf_ctrs(int cpunum, struct 
>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>               }
>>       }
>>
>> -     cpc_read(cpunum, delivered_reg, &delivered);
>> -     cpc_read(cpunum, reference_reg, &reference);
>> +     ctrs.cpunum = cpunum;
>> +     ctrs.delivered_reg = delivered_reg;
>> +     ctrs.reference_reg = reference_reg;
>> +     ctrs.delivered = &delivered;
>> +     ctrs.reference = &reference;
>> +
>> +     if (CPC_IN_FFH(delivered_reg) && CPC_IN_FFH(reference_reg)) {
>> +             ret = smp_call_on_cpu(cpunum, cppc_get_cycle_ctrs, 
>> &ctrs, false);
>> +     } else {
>> +             if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
>> +                 CPC_IN_SYSTEM_MEMORY(reference_reg)) {
>> +                     local_irq_save(flags);
>> +                     cppc_get_cycle_ctrs(&ctrs);krish
>> +                     local_irq_restore(flags);
>> +             } else {
>> +                     cppc_get_cycle_ctrs(&ctrs);
>> +             }
>> +     }
>> +
>>       cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>
>>       /*
>>
>> I've only tested this on a model using FFH, with 10us delay, and it
>> worked well for me. Yang, Sumit, could you give it a try?
> 
> Thanks for the patch. I tested it with 10us delay, it works well for me.
> There was 0 high error in my 3 hours test.
>
With this change, the delta between the set and get freq has reduced
from ~25% to <10% and occasionally ~15%. Tried with the delay of
10, 25, 50, 75 us.

With the change in [1] & '25us' delay in [2], the delta was <5%.
[1] https://lore.kernel.org/all/20230418113459.12860-7-sumitg@nvidia.com/
[2] https://lore.kernel.org/all/20230418113459.12860-5-sumitg@nvidia.com/

Thank you,
Sumit Gupta

>>
>> Even with a solution like the above (more refined, of course) there is an
>> additional improvement possible: we can implement arch_freq_get_on_cpu()
>> for arm64 systems that will use cached (on the tick) AMU cycle counter
>> samples and have this function called from show_cpuinfo_cur_freq()
>> before/instead of calling the .get() function. But this will only help
>> arm64 systems with AMUs accessible though system registers. We'll try to
>> submit patches on this soon. But as I mentioned to Sumit on the other
>> thread, the returned frequency value from this will be an average over 
>> 4ms
>> (with CONFIG_HZ=250) and could be up to 4ms old (more than that only 
>> if the
>> CPU was idle/isolated).
>>
>> Thanks,
>> Ionela.
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ