linux-kernel - Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid high error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <14ab4f0c-4789-a1d2-5b01-c3e0be9a823f@os.amperecomputing.com>
Date:   Thu, 27 Apr 2023 13:40:19 -0700
From:   Yang Shi <yang@...amperecomputing.com>
To:     Ionela Voinescu <ionela.voinescu@....com>
Cc:     Pierre Gondois <pierre.gondois@....com>, viresh.kumar@...aro.org,
        scott@...amperecomputing.com, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Sumit Gupta <sumitg@...dia.com>
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid
 high error



On 4/26/23 12:01 PM, Ionela Voinescu wrote:
> Hi,
>
> + Sumit
>
> On Tuesday 25 Apr 2023 at 18:32:55 (-0700), Yang Shi wrote:
>>
>> On 4/24/23 4:44 AM, Ionela Voinescu wrote:
>>> Hey,
>>>
>>> On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
>>>> On 4/20/23 9:01 AM, Pierre Gondois wrote:
>>>>>>> You say that the cause of this is a congestion in the interconnect. I
>>>>>>> don't
>>>>>>> see a way to check that right now.
>>>>>>> However your trace is on the CPU0, so maybe all the other cores were
>>>>>>> shutdown
>>>>>>> in your test. If this is the case, do you think a congestion could
>>>>>>> happen with
>>>>>>> only one CPU ?
>>>>>> No, other CPUs were not shut down in my test. I just ran "yes" on all
>>>>>> cores except CPU 0, then ran the reading freq script. Since all other
>>>>>> cores are busy, so the script should be always running on CPU 0.
>>>>>>
>>>>>> Since the counters, memory and other devices are on the interconnect, so
>>>>>> the congestion may be caused by plenty of factors IIUC.
>>>>> +Ionela
>>>>>
>>>>> Ionela pointed me to the following patch-set, which seems realated:
>>>>> https://lore.kernel.org/all/20230418113459.12860-5-sumitg@nvidia.com/
>>>> Thanks for the information. I think we do have the similar syndrome. But I'm
>>>> not sure how their counters are implemented, we may not have similar root
>>>> cause.
>>> Yes, my bad, I did not get the chance to read this full thread before
>>> talking with Pierre. In your case you have AMUs accessed via MMIO and in that
>>> case they are accessed though FFH (IPIs and system registers). The root
>>> cause is indeed different.
>> Yeah, but it seems like using larger delay could mitigate both issues.
> Yes, there is a minimum delay required there of 25us due to the way that
> the counters accumulate, which is not too bad even with interrupts
> disabled (to cater to cpufreq_quick_get()).
>
> [..]
>>>>>> Yeah, we should be able to find a smaller irq disable section.
>>>>>>
>>>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>>>> index c51d3ccb4cca..105a7e2ffffa 100644
>>>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>>>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>>            struct cppc_pcc_data *pcc_ss_data = NULL;
>>>>>>>            u64 delivered, reference, ref_perf, ctr_wrap_time;
>>>>>>>            int ret = 0, regs_in_pcc = 0;
>>>>>>> +       unsigned long flags;
>>>>>>>
>>>>>>>            if (!cpc_desc) {
>>>>>>>                    pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>>>>>>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>>>                    }
>>>>>>>            }
>>>>>>>
>>>>>>> +       local_irq_save(flags);
>>>>>>> +
>>>>>>>            cpc_read(cpunum, delivered_reg, &delivered);
>>>>>>>            cpc_read(cpunum, reference_reg, &reference);
>>>>>>>            cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>>>
>>>>>>> +       local_irq_restore(flags);
>>>>>>> +
>>>>>> cpc_read_ffh() would return -EPERM if irq is disabled.
>>>>>>
>>>>>> So, the irq disabling must happen for mmio only in cpc_read(), for
>>>>>> example:
>>>>> I thought the issue was that irqs could happen in between cpc_read()
>>>>> functions,
>>>>> the patch below would not cover it. If the frequency is more accurate
>>>>> with this patch, I think I don't understand something.
>>>> Yeah, you are correct. The irq disabling window has to cover all the
>>>> cpc_read(). I didn't test with this patch. My test was done conceptually
>>>> with:
>>>>
>>>> disable irq
>>>> cppc_get_perf_ctrs(t0)
>>>> udelay(2)
>>>> cppc_get_perf_ctrs(t1)
>>>> enable irq
>>>>
>>>> But this will break cpc_read_ffh().
>>> Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
>>> are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
>>> register and the reference register, which should have minimal delay in
>>> between?
>>>
>>> As in:
>>>
>>> if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
>>>       CPC_IN_SYSTEM_MEMORY(reference_reg))
>>> 	...
>>>
>>> This will and should not affect FFH - the fix for that would have to be
>>> different.
>> It won't work, right? The problem is cppc_get_perf_ctrs() calls cpc_read()s
>> to read delivered and reference respectively, we just can tell whether they
>> are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in cppc_get_perf_ctrs().
>> So the resulting code should conceptually look like:
>>
>> disable irq
>> read delivered
>> enable irq
>>
>> disable irq
>> read reference
>> enable irq
>>
>> But there still may be interrupts between the two reads. We actually want:
>>
>> disable irq
>> read delivered
>> read reference
>> enable irq
> Yes, this is what I was suggesting above.
>
> I've hacked up the following code. It covers the FFH case as well, with a
> modified solution that Sumit proposed on the other thread:
>
> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> index 0f17b1c32718..7e828aed3693 100644
> --- a/drivers/acpi/cppc_acpi.c
> +++ b/drivers/acpi/cppc_acpi.c
> @@ -110,6 +110,11 @@ static DEFINE_PER_CPU(struct cpc_desc *, cpc_desc_ptr);
>   				(cpc)->cpc_entry.reg.space_id ==	\
>   				ACPI_ADR_SPACE_SYSTEM_IO)
>   
> +/* Check if a CPC register is in FFH */
> +#define CPC_IN_FFH(cpc) ((cpc)->type == ACPI_TYPE_BUFFER &&	\
> +				(cpc)->cpc_entry.reg.space_id ==	\
> +				ACPI_ADR_SPACE_FIXED_HARDWARE)
> +
>   /* Evaluates to True if reg is a NULL register descriptor */
>   #define IS_NULL_REG(reg) ((reg)->space_id ==  ACPI_ADR_SPACE_SYSTEM_MEMORY && \
>   				(reg)->address == 0 &&			\
> @@ -1292,6 +1297,24 @@ EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
>    *
>    * Return: 0 for success with perf_fb_ctrs populated else -ERRNO.
>    */
> +
> +struct cycle_counters {
> +	int cpunum;
> +	struct cpc_register_resource *delivered_reg;
> +	struct cpc_register_resource *reference_reg;
> +	u64 *delivered;
> +	u64 *reference;
> +};
> +
> +static int cppc_get_cycle_ctrs(void *cycle_ctrs) {
> +	struct cycle_counters *ctrs = cycle_ctrs;
> +
> +	cpc_read(ctrs->cpunum, ctrs->delivered_reg, ctrs->delivered);
> +	cpc_read(ctrs->cpunum, ctrs->reference_reg, ctrs->reference);
> +
> +	return 0;
> +}
> +
>   int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
>   {
>   	struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpunum);
> @@ -1300,7 +1323,9 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
>   	int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum);
>   	struct cppc_pcc_data *pcc_ss_data = NULL;
>   	u64 delivered, reference, ref_perf, ctr_wrap_time;
> +	struct cycle_counters ctrs = {0};
>   	int ret = 0, regs_in_pcc = 0;
> +	unsigned long flags;
>   
>   	if (!cpc_desc) {
>   		pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
> @@ -1336,8 +1361,25 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
>   		}
>   	}
>   
> -	cpc_read(cpunum, delivered_reg, &delivered);
> -	cpc_read(cpunum, reference_reg, &reference);
> +	ctrs.cpunum = cpunum;
> +	ctrs.delivered_reg = delivered_reg;
> +	ctrs.reference_reg = reference_reg;
> +	ctrs.delivered = &delivered;
> +	ctrs.reference = &reference;
> +
> +	if (CPC_IN_FFH(delivered_reg) && CPC_IN_FFH(reference_reg)) {
> +		ret = smp_call_on_cpu(cpunum, cppc_get_cycle_ctrs, &ctrs, false);
> +	} else {
> +		if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
> +		    CPC_IN_SYSTEM_MEMORY(reference_reg)) {
> +			local_irq_save(flags);
> +			cppc_get_cycle_ctrs(&ctrs);
> +			local_irq_restore(flags);
> +		} else {
> +			cppc_get_cycle_ctrs(&ctrs);
> +		}
> +	}
> +
>   	cpc_read(cpunum, ref_perf_reg, &ref_perf);
>   
>   	/*
>
> I've only tested this on a model using FFH, with 10us delay, and it
> worked well for me. Yang, Sumit, could you give it a try?

Thanks for the patch. I tested it with 10us delay, it works well for me. 
There was 0 high error in my 3 hours test.

>
> Even with a solution like the above (more refined, of course) there is an
> additional improvement possible: we can implement arch_freq_get_on_cpu()
> for arm64 systems that will use cached (on the tick) AMU cycle counter
> samples and have this function called from show_cpuinfo_cur_freq()
> before/instead of calling the .get() function. But this will only help
> arm64 systems with AMUs accessible though system registers. We'll try to
> submit patches on this soon. But as I mentioned to Sumit on the other
> thread, the returned frequency value from this will be an average over 4ms
> (with CONFIG_HZ=250) and could be up to 4ms old (more than that only if the
> CPU was idle/isolated).
>
> Thanks,
> Ionela.