[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <14ab4f0c-4789-a1d2-5b01-c3e0be9a823f@os.amperecomputing.com>
Date: Thu, 27 Apr 2023 13:40:19 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Ionela Voinescu <ionela.voinescu@....com>
Cc: Pierre Gondois <pierre.gondois@....com>, viresh.kumar@...aro.org,
scott@...amperecomputing.com, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org,
"Rafael J. Wysocki" <rafael@...nel.org>,
Sumit Gupta <sumitg@...dia.com>
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid
high error
On 4/26/23 12:01 PM, Ionela Voinescu wrote:
> Hi,
>
> + Sumit
>
> On Tuesday 25 Apr 2023 at 18:32:55 (-0700), Yang Shi wrote:
>>
>> On 4/24/23 4:44 AM, Ionela Voinescu wrote:
>>> Hey,
>>>
>>> On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
>>>> On 4/20/23 9:01 AM, Pierre Gondois wrote:
>>>>>>> You say that the cause of this is a congestion in the interconnect. I
>>>>>>> don't
>>>>>>> see a way to check that right now.
>>>>>>> However your trace is on the CPU0, so maybe all the other cores were
>>>>>>> shutdown
>>>>>>> in your test. If this is the case, do you think a congestion could
>>>>>>> happen with
>>>>>>> only one CPU ?
>>>>>> No, other CPUs were not shut down in my test. I just ran "yes" on all
>>>>>> cores except CPU 0, then ran the reading freq script. Since all other
>>>>>> cores are busy, so the script should be always running on CPU 0.
>>>>>>
>>>>>> Since the counters, memory and other devices are on the interconnect, so
>>>>>> the congestion may be caused by plenty of factors IIUC.
>>>>> +Ionela
>>>>>
>>>>> Ionela pointed me to the following patch-set, which seems realated:
>>>>> https://lore.kernel.org/all/20230418113459.12860-5-sumitg@nvidia.com/
>>>> Thanks for the information. I think we do have the similar syndrome. But I'm
>>>> not sure how their counters are implemented, we may not have similar root
>>>> cause.
>>> Yes, my bad, I did not get the chance to read this full thread before
>>> talking with Pierre. In your case you have AMUs accessed via MMIO and in that
>>> case they are accessed though FFH (IPIs and system registers). The root
>>> cause is indeed different.
>> Yeah, but it seems like using larger delay could mitigate both issues.
> Yes, there is a minimum delay required there of 25us due to the way that
> the counters accumulate, which is not too bad even with interrupts
> disabled (to cater to cpufreq_quick_get()).
>
> [..]
>>>>>> Yeah, we should be able to find a smaller irq disable section.
>>>>>>
>>>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>>>> index c51d3ccb4cca..105a7e2ffffa 100644
>>>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>>>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>> struct cppc_pcc_data *pcc_ss_data = NULL;
>>>>>>> u64 delivered, reference, ref_perf, ctr_wrap_time;
>>>>>>> int ret = 0, regs_in_pcc = 0;
>>>>>>> + unsigned long flags;
>>>>>>>
>>>>>>> if (!cpc_desc) {
>>>>>>> pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>>>>>>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> + local_irq_save(flags);
>>>>>>> +
>>>>>>> cpc_read(cpunum, delivered_reg, &delivered);
>>>>>>> cpc_read(cpunum, reference_reg, &reference);
>>>>>>> cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>>>
>>>>>>> + local_irq_restore(flags);
>>>>>>> +
>>>>>> cpc_read_ffh() would return -EPERM if irq is disabled.
>>>>>>
>>>>>> So, the irq disabling must happen for mmio only in cpc_read(), for
>>>>>> example:
>>>>> I thought the issue was that irqs could happen in between cpc_read()
>>>>> functions,
>>>>> the patch below would not cover it. If the frequency is more accurate
>>>>> with this patch, I think I don't understand something.
>>>> Yeah, you are correct. The irq disabling window has to cover all the
>>>> cpc_read(). I didn't test with this patch. My test was done conceptually
>>>> with:
>>>>
>>>> disable irq
>>>> cppc_get_perf_ctrs(t0)
>>>> udelay(2)
>>>> cppc_get_perf_ctrs(t1)
>>>> enable irq
>>>>
>>>> But this will break cpc_read_ffh().
>>> Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
>>> are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
>>> register and the reference register, which should have minimal delay in
>>> between?
>>>
>>> As in:
>>>
>>> if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
>>> CPC_IN_SYSTEM_MEMORY(reference_reg))
>>> ...
>>>
>>> This will and should not affect FFH - the fix for that would have to be
>>> different.
>> It won't work, right? The problem is cppc_get_perf_ctrs() calls cpc_read()s
>> to read delivered and reference respectively, we just can tell whether they
>> are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in cppc_get_perf_ctrs().
>> So the resulting code should conceptually look like:
>>
>> disable irq
>> read delivered
>> enable irq
>>
>> disable irq
>> read reference
>> enable irq
>>
>> But there still may be interrupts between the two reads. We actually want:
>>
>> disable irq
>> read delivered
>> read reference
>> enable irq
> Yes, this is what I was suggesting above.
>
> I've hacked up the following code. It covers the FFH case as well, with a
> modified solution that Sumit proposed on the other thread:
>
> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> index 0f17b1c32718..7e828aed3693 100644
> --- a/drivers/acpi/cppc_acpi.c
> +++ b/drivers/acpi/cppc_acpi.c
> @@ -110,6 +110,11 @@ static DEFINE_PER_CPU(struct cpc_desc *, cpc_desc_ptr);
> (cpc)->cpc_entry.reg.space_id == \
> ACPI_ADR_SPACE_SYSTEM_IO)
>
> +/* Check if a CPC register is in FFH */
> +#define CPC_IN_FFH(cpc) ((cpc)->type == ACPI_TYPE_BUFFER && \
> + (cpc)->cpc_entry.reg.space_id == \
> + ACPI_ADR_SPACE_FIXED_HARDWARE)
> +
> /* Evaluates to True if reg is a NULL register descriptor */
> #define IS_NULL_REG(reg) ((reg)->space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY && \
> (reg)->address == 0 && \
> @@ -1292,6 +1297,24 @@ EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
> *
> * Return: 0 for success with perf_fb_ctrs populated else -ERRNO.
> */
> +
> +struct cycle_counters {
> + int cpunum;
> + struct cpc_register_resource *delivered_reg;
> + struct cpc_register_resource *reference_reg;
> + u64 *delivered;
> + u64 *reference;
> +};
> +
> +static int cppc_get_cycle_ctrs(void *cycle_ctrs) {
> + struct cycle_counters *ctrs = cycle_ctrs;
> +
> + cpc_read(ctrs->cpunum, ctrs->delivered_reg, ctrs->delivered);
> + cpc_read(ctrs->cpunum, ctrs->reference_reg, ctrs->reference);
> +
> + return 0;
> +}
> +
> int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
> {
> struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpunum);
> @@ -1300,7 +1323,9 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
> int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum);
> struct cppc_pcc_data *pcc_ss_data = NULL;
> u64 delivered, reference, ref_perf, ctr_wrap_time;
> + struct cycle_counters ctrs = {0};
> int ret = 0, regs_in_pcc = 0;
> + unsigned long flags;
>
> if (!cpc_desc) {
> pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
> @@ -1336,8 +1361,25 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
> }
> }
>
> - cpc_read(cpunum, delivered_reg, &delivered);
> - cpc_read(cpunum, reference_reg, &reference);
> + ctrs.cpunum = cpunum;
> + ctrs.delivered_reg = delivered_reg;
> + ctrs.reference_reg = reference_reg;
> + ctrs.delivered = &delivered;
> + ctrs.reference = &reference;
> +
> + if (CPC_IN_FFH(delivered_reg) && CPC_IN_FFH(reference_reg)) {
> + ret = smp_call_on_cpu(cpunum, cppc_get_cycle_ctrs, &ctrs, false);
> + } else {
> + if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
> + CPC_IN_SYSTEM_MEMORY(reference_reg)) {
> + local_irq_save(flags);
> + cppc_get_cycle_ctrs(&ctrs);
> + local_irq_restore(flags);
> + } else {
> + cppc_get_cycle_ctrs(&ctrs);
> + }
> + }
> +
> cpc_read(cpunum, ref_perf_reg, &ref_perf);
>
> /*
>
> I've only tested this on a model using FFH, with 10us delay, and it
> worked well for me. Yang, Sumit, could you give it a try?
Thanks for the patch. I tested it with 10us delay, it works well for me.
There was 0 high error in my 3 hours test.
>
> Even with a solution like the above (more refined, of course) there is an
> additional improvement possible: we can implement arch_freq_get_on_cpu()
> for arm64 systems that will use cached (on the tick) AMU cycle counter
> samples and have this function called from show_cpuinfo_cur_freq()
> before/instead of calling the .get() function. But this will only help
> arm64 systems with AMUs accessible though system registers. We'll try to
> submit patches on this soon. But as I mentioned to Sumit on the other
> thread, the returned frequency value from this will be an average over 4ms
> (with CONFIG_HZ=250) and could be up to 4ms old (more than that only if the
> CPU was idle/isolated).
>
> Thanks,
> Ionela.
Powered by blists - more mailing lists