[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZEl1Fms/JmdEZsVn@arm.com>
Date: Wed, 26 Apr 2023 20:01:42 +0100
From: Ionela Voinescu <ionela.voinescu@....com>
To: Yang Shi <yang@...amperecomputing.com>
Cc: Pierre Gondois <pierre.gondois@....com>, viresh.kumar@...aro.org,
scott@...amperecomputing.com, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org,
"Rafael J. Wysocki" <rafael@...nel.org>,
Sumit Gupta <sumitg@...dia.com>
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid
high error
Hi,
+ Sumit
On Tuesday 25 Apr 2023 at 18:32:55 (-0700), Yang Shi wrote:
>
>
> On 4/24/23 4:44 AM, Ionela Voinescu wrote:
> > Hey,
> >
> > On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
> > > On 4/20/23 9:01 AM, Pierre Gondois wrote:
> > > > > > You say that the cause of this is a congestion in the interconnect. I
> > > > > > don't
> > > > > > see a way to check that right now.
> > > > > > However your trace is on the CPU0, so maybe all the other cores were
> > > > > > shutdown
> > > > > > in your test. If this is the case, do you think a congestion could
> > > > > > happen with
> > > > > > only one CPU ?
> > > > > No, other CPUs were not shut down in my test. I just ran "yes" on all
> > > > > cores except CPU 0, then ran the reading freq script. Since all other
> > > > > cores are busy, so the script should be always running on CPU 0.
> > > > >
> > > > > Since the counters, memory and other devices are on the interconnect, so
> > > > > the congestion may be caused by plenty of factors IIUC.
> > > > +Ionela
> > > >
> > > > Ionela pointed me to the following patch-set, which seems realated:
> > > > https://lore.kernel.org/all/20230418113459.12860-5-sumitg@nvidia.com/
> > > Thanks for the information. I think we do have the similar syndrome. But I'm
> > > not sure how their counters are implemented, we may not have similar root
> > > cause.
> > Yes, my bad, I did not get the chance to read this full thread before
> > talking with Pierre. In your case you have AMUs accessed via MMIO and in that
> > case they are accessed though FFH (IPIs and system registers). The root
> > cause is indeed different.
>
> Yeah, but it seems like using larger delay could mitigate both issues.
Yes, there is a minimum delay required there of 25us due to the way that
the counters accumulate, which is not too bad even with interrupts
disabled (to cater to cpufreq_quick_get()).
>
[..]
> > > > > Yeah, we should be able to find a smaller irq disable section.
> > > > >
> > > > > > diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
> > > > > > index c51d3ccb4cca..105a7e2ffffa 100644
> > > > > > --- a/drivers/acpi/cppc_acpi.c
> > > > > > +++ b/drivers/acpi/cppc_acpi.c
> > > > > > @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
> > > > > > cppc_perf_fb_ctrs *perf_fb_ctrs)
> > > > > > struct cppc_pcc_data *pcc_ss_data = NULL;
> > > > > > u64 delivered, reference, ref_perf, ctr_wrap_time;
> > > > > > int ret = 0, regs_in_pcc = 0;
> > > > > > + unsigned long flags;
> > > > > >
> > > > > > if (!cpc_desc) {
> > > > > > pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
> > > > > > @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
> > > > > > cppc_perf_fb_ctrs *perf_fb_ctrs)
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > + local_irq_save(flags);
> > > > > > +
> > > > > > cpc_read(cpunum, delivered_reg, &delivered);
> > > > > > cpc_read(cpunum, reference_reg, &reference);
> > > > > > cpc_read(cpunum, ref_perf_reg, &ref_perf);
> > > > > >
> > > > > > + local_irq_restore(flags);
> > > > > > +
> > > > > cpc_read_ffh() would return -EPERM if irq is disabled.
> > > > >
> > > > > So, the irq disabling must happen for mmio only in cpc_read(), for
> > > > > example:
> > > > I thought the issue was that irqs could happen in between cpc_read()
> > > > functions,
> > > > the patch below would not cover it. If the frequency is more accurate
> > > > with this patch, I think I don't understand something.
> > > Yeah, you are correct. The irq disabling window has to cover all the
> > > cpc_read(). I didn't test with this patch. My test was done conceptually
> > > with:
> > >
> > > disable irq
> > > cppc_get_perf_ctrs(t0)
> > > udelay(2)
> > > cppc_get_perf_ctrs(t1)
> > > enable irq
> > >
> > > But this will break cpc_read_ffh().
> > Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
> > are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
> > register and the reference register, which should have minimal delay in
> > between?
> >
> > As in:
> >
> > if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
> > CPC_IN_SYSTEM_MEMORY(reference_reg))
> > ...
> >
> > This will and should not affect FFH - the fix for that would have to be
> > different.
>
> It won't work, right? The problem is cppc_get_perf_ctrs() calls cpc_read()s
> to read delivered and reference respectively, we just can tell whether they
> are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in cppc_get_perf_ctrs().
> So the resulting code should conceptually look like:
>
> disable irq
> read delivered
> enable irq
>
> disable irq
> read reference
> enable irq
>
> But there still may be interrupts between the two reads. We actually want:
>
> disable irq
> read delivered
> read reference
> enable irq
Yes, this is what I was suggesting above.
I've hacked up the following code. It covers the FFH case as well, with a
modified solution that Sumit proposed on the other thread:
diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
index 0f17b1c32718..7e828aed3693 100644
--- a/drivers/acpi/cppc_acpi.c
+++ b/drivers/acpi/cppc_acpi.c
@@ -110,6 +110,11 @@ static DEFINE_PER_CPU(struct cpc_desc *, cpc_desc_ptr);
(cpc)->cpc_entry.reg.space_id == \
ACPI_ADR_SPACE_SYSTEM_IO)
+/* Check if a CPC register is in FFH */
+#define CPC_IN_FFH(cpc) ((cpc)->type == ACPI_TYPE_BUFFER && \
+ (cpc)->cpc_entry.reg.space_id == \
+ ACPI_ADR_SPACE_FIXED_HARDWARE)
+
/* Evaluates to True if reg is a NULL register descriptor */
#define IS_NULL_REG(reg) ((reg)->space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY && \
(reg)->address == 0 && \
@@ -1292,6 +1297,24 @@ EXPORT_SYMBOL_GPL(cppc_perf_ctrs_in_pcc);
*
* Return: 0 for success with perf_fb_ctrs populated else -ERRNO.
*/
+
+struct cycle_counters {
+ int cpunum;
+ struct cpc_register_resource *delivered_reg;
+ struct cpc_register_resource *reference_reg;
+ u64 *delivered;
+ u64 *reference;
+};
+
+static int cppc_get_cycle_ctrs(void *cycle_ctrs) {
+ struct cycle_counters *ctrs = cycle_ctrs;
+
+ cpc_read(ctrs->cpunum, ctrs->delivered_reg, ctrs->delivered);
+ cpc_read(ctrs->cpunum, ctrs->reference_reg, ctrs->reference);
+
+ return 0;
+}
+
int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
{
struct cpc_desc *cpc_desc = per_cpu(cpc_desc_ptr, cpunum);
@@ -1300,7 +1323,9 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpunum);
struct cppc_pcc_data *pcc_ss_data = NULL;
u64 delivered, reference, ref_perf, ctr_wrap_time;
+ struct cycle_counters ctrs = {0};
int ret = 0, regs_in_pcc = 0;
+ unsigned long flags;
if (!cpc_desc) {
pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
@@ -1336,8 +1361,25 @@ int cppc_get_perf_ctrs(int cpunum, struct cppc_perf_fb_ctrs *perf_fb_ctrs)
}
}
- cpc_read(cpunum, delivered_reg, &delivered);
- cpc_read(cpunum, reference_reg, &reference);
+ ctrs.cpunum = cpunum;
+ ctrs.delivered_reg = delivered_reg;
+ ctrs.reference_reg = reference_reg;
+ ctrs.delivered = &delivered;
+ ctrs.reference = &reference;
+
+ if (CPC_IN_FFH(delivered_reg) && CPC_IN_FFH(reference_reg)) {
+ ret = smp_call_on_cpu(cpunum, cppc_get_cycle_ctrs, &ctrs, false);
+ } else {
+ if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
+ CPC_IN_SYSTEM_MEMORY(reference_reg)) {
+ local_irq_save(flags);
+ cppc_get_cycle_ctrs(&ctrs);
+ local_irq_restore(flags);
+ } else {
+ cppc_get_cycle_ctrs(&ctrs);
+ }
+ }
+
cpc_read(cpunum, ref_perf_reg, &ref_perf);
/*
I've only tested this on a model using FFH, with 10us delay, and it
worked well for me. Yang, Sumit, could you give it a try?
Even with a solution like the above (more refined, of course) there is an
additional improvement possible: we can implement arch_freq_get_on_cpu()
for arm64 systems that will use cached (on the tick) AMU cycle counter
samples and have this function called from show_cpuinfo_cur_freq()
before/instead of calling the .get() function. But this will only help
arm64 systems with AMUs accessible though system registers. We'll try to
submit patches on this soon. But as I mentioned to Sumit on the other
thread, the returned frequency value from this will be an average over 4ms
(with CONFIG_HZ=250) and could be up to 4ms old (more than that only if the
CPU was idle/isolated).
Thanks,
Ionela.
Powered by blists - more mailing lists