lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <06ca8066-fce6-d3cf-db37-584c3666f7df@os.amperecomputing.com>
Date:   Tue, 25 Apr 2023 18:32:55 -0700
From:   Yang Shi <yang@...amperecomputing.com>
To:     Ionela Voinescu <ionela.voinescu@....com>
Cc:     Pierre Gondois <pierre.gondois@....com>, viresh.kumar@...aro.org,
        scott@...amperecomputing.com, linux-pm@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [PATCH] cpufreq: CPPC: use 10ms delay instead of 2us to avoid
 high error



On 4/24/23 4:44 AM, Ionela Voinescu wrote:
> Hey,
>
> On Thursday 20 Apr 2023 at 13:49:24 (-0700), Yang Shi wrote:
>> On 4/20/23 9:01 AM, Pierre Gondois wrote:
>>>>> You say that the cause of this is a congestion in the interconnect. I
>>>>> don't
>>>>> see a way to check that right now.
>>>>> However your trace is on the CPU0, so maybe all the other cores were
>>>>> shutdown
>>>>> in your test. If this is the case, do you think a congestion could
>>>>> happen with
>>>>> only one CPU ?
>>>> No, other CPUs were not shut down in my test. I just ran "yes" on all
>>>> cores except CPU 0, then ran the reading freq script. Since all other
>>>> cores are busy, so the script should be always running on CPU 0.
>>>>
>>>> Since the counters, memory and other devices are on the interconnect, so
>>>> the congestion may be caused by plenty of factors IIUC.
>>> +Ionela
>>>
>>> Ionela pointed me to the following patch-set, which seems realated:
>>> https://lore.kernel.org/all/20230418113459.12860-5-sumitg@nvidia.com/
>> Thanks for the information. I think we do have the similar syndrome. But I'm
>> not sure how their counters are implemented, we may not have similar root
>> cause.
> Yes, my bad, I did not get the chance to read this full thread before
> talking with Pierre. In your case you have AMUs accessed via MMIO and in that
> case they are accessed though FFH (IPIs and system registers). The root
> cause is indeed different.

Yeah, but it seems like using larger delay could mitigate both issues.

>>> One thing that we didn't check I believe is and that Ionela pointed out
>>> is that we don't know whether we are accessing the present CPU or a
>>> remote
>>> CPU'AMUs. In the latter case there would be IPIs and possible delays in
>>> waking up/accessing the remote CPU).
>>>
>>>>> Just 2 other comments:
>>>>> a-
>>>>> It might be interesting to change the order in which cpc registers are
>>>>> read
>>>>> just to see if it has an impact, but even if it has, I m not sure how
>>>>> this
>>>>> could be exploitable.
>>>>> Just in case, I mean doing that, but I think that b. might be better
>>>>> to try.
>>>>>
>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>> index c51d3ccb4cca..479b55006020 100644
>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>> @@ -1350,8 +1350,8 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>                   }
>>>>>           }
>>>>>
>>>>> -       cpc_read(cpunum, delivered_reg, &delivered);
>>>>>           cpc_read(cpunum, reference_reg, &reference);
>>>>> +       cpc_read(cpunum, delivered_reg, &delivered);
>>>>>           cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>
>>>>>           /*
>>>>>
>>>>> b-
>>>>> In the trace that you shared, the cpc_read() calls in the fist
>>>>> cppc_get_perf_ctrs() calls seem to always take a bit more time than in
>>>>> the
>>>>> second cppc_get_perf_ctrs() call.
>>>>> Would it be possible to collect traces similar as above with 3 or 4
>>>>> calls to
>>>>> cppc_get_perf_ctrs() instead of 2 ? It would allow to check whether in
>>>>> the first
>>>>> call, accessing the cpc registers takes more time than in the
>>>>> following calls,
>>>>> due to cache misses or other reasons.
>>>> Cache miss? The counters should be not cached and reading the counters
>>>> should not hit cache IIUC.
>>> Yes you are correct, what I said is copmletely wrong.
>>>
>>>>> Ideally statistics on the result would be the best, or if you have a
>>>>> trace.dat
>>>>> to share containing a trace with multiple
>>>>> cppc_cpufreq_get_rate() calls.
>>>> Tried option b, I managed to get histogram:
>>>>
>>>> @hist_first_ns[cat]:
>>>> [4K, 8K)          112321
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [8K, 16K)            212
>>>> |                                                    |
>>>> [16K, 32K)            25
>>>> |                                                    |
>>>> [32K, 64K)            59
>>>> |                                                    |
>>>> [64K, 128K)            6
>>>> |                                                    |
>>>> [128K, 256K)           9
>>>> |                                                    |
>>>>
>>>> @hist_second_ns[cat]:
>>>> [2K, 4K)          112590
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [4K, 8K)               4
>>>> |                                                    |
>>>> [8K, 16K)              0
>>>> |                                                    |
>>>> [16K, 32K)            15
>>>> |                                                    |
>>>> [32K, 64K)            18
>>>> |                                                    |
>>>> [64K, 128K)            1
>>>> |                                                    |
>>>> [128K, 256K)           4
>>>> |                                                    |
>>>>
>>>> The "first" means the first cppc_get_perf_ctrs() call. But the bpftrace
>>>> script can't tell the second, the third and the fourth, so all them are
>>>> shown as "second". Anyway it seems fine. We can tell the first read took
>>>> longer than the later ones for the most time.
>>>>
>>>> And a typical func_graph trace shows:
>>>>
>>>> # tracer: function_graph
>>>> #
>>>> #     TIME        CPU  DURATION                  FUNCTION CALLS
>>>> #      |          |     |   |                     |   |   | |
>>>>     4447.171333 |     0)               |  cppc_cpufreq_get_rate
>>>> [cppc_cpufreq]() {
>>>>     4447.171334 |     0)               |    cpufreq_cpu_get() {
>>>>     4447.171334 |     0)   1.060 us    | _raw_read_lock_irqsave();
>>>>     4447.171336 |     0)   0.560 us    | _raw_read_unlock_irqrestore();
>>>>     4447.171337 |     0)   3.480 us    | }
>>>>     4447.171338 |     0)   0.400 us    | cpufreq_cpu_put();
>>>>     4447.171338 |     0)               |    cppc_get_perf_ctrs() {
>>>>     4447.171339 |     0)   0.720 us    | cpc_read.isra.0();
>>>>     4447.171341 |     0)   0.700 us    | cpc_read.isra.0();
>>>>     4447.171342 |     0)   0.380 us    | cpc_read.isra.0();
>>>>     4447.171342 |     0)   0.600 us    | cpc_read.isra.0();
>>>>     4447.171343 |     0)   4.900 us    | }
>>>>     4447.171344 |     0)               |    __delay() {
>>>>     4447.171344 |     0)   0.540 us    | arch_timer_evtstrm_available();
>>>>     4447.171346 |     0)   2.420 us    | }
>>>>     4447.171347 |     0)               |    cppc_get_perf_ctrs() {
>>>>     4447.171347 |     0)   0.540 us    | cpc_read.isra.0();
>>>>     4447.171348 |     0)   0.520 us    | cpc_read.isra.0();
>>>>     4447.171349 |     0)   0.400 us    | cpc_read.isra.0();
>>>>     4447.171350 |     0)   0.440 us    | cpc_read.isra.0();
>>>>     4447.171350 |     0)   3.660 us    | }
>>>>     4447.171351 |     0)               |    __delay() {
>>>>     4447.171351 |     0)   0.400 us    | arch_timer_evtstrm_available();
>>>>     4447.171353 |     0)   2.400 us    | }
>>>>     4447.171353 |     0)               |    cppc_get_perf_ctrs() {
>>>>     4447.171354 |     0)   0.540 us    | cpc_read.isra.0();
>>>>     4447.171355 |     0)   0.540 us    | cpc_read.isra.0();
>>>>     4447.171356 |     0)   0.380 us    | cpc_read.isra.0();
>>>>     4447.171356 |     0)   0.420 us    | cpc_read.isra.0();
>>>>     4447.171357 |     0)   3.640 us    | }
>>>>     4447.171357 |     0)               |    __delay() {
>>>>     4447.171358 |     0)   0.380 us    | arch_timer_evtstrm_available();
>>>>     4447.171360 |     0)   2.380 us    |    }
>>>>     4447.171360 |     0)               |    cppc_get_perf_ctrs() {
>>>>     4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>>>>     4447.171361 |     0)   0.520 us    |      cpc_read.isra.0();
>>>>     4447.171362 |     0)   0.400 us    |      cpc_read.isra.0();
>>>>     4447.171363 |     0)   0.440 us    |      cpc_read.isra.0();
>>>>     4447.171364 |     0)   3.640 us    |    }
>>>>     4447.171364 |     0)   0.520 us    | cppc_cpufreq_perf_to_khz
>>>> [cppc_cpufreq]();
>>>>     4447.171365 |     0) + 34.240 us   |  }
>>>>
>>>> It also shows the first reading typically took longer than the later
>>>> ones. The second, the third and the fourth actually took almost same
>>>> time.
>>>>
> Do you happen to know if this is still the case if the delay in between
> the two cppc_get_perf_ctrs() calls is larger (a few ms)?

Do you mean whether the first reading takes longer if larger delay is 
used? If so, then yes. It should be not related to the delay.

>>>> I also tried to read perf_fb_ctrs_t0 twice (so total 3 reads, 2 for t0,
>>>> 1 for t1, 2us delay between each read), but I didn't see noticeable
>>>> improvement. 4 reads (2 for t0, 2 for t1) does show some noticeable
>>>> improvement.
>>>>
> I'm not sure I understand this. t0, t1? Can you expand on the sequence
> of reads here? Are you reading the same counter multiple times instead
> of reading all counters multiple times?

t0 means perf_fb_ctrs_t0, t1 means perf_fb_ctrs_t1. The above experiment 
actually does 4 reads (2 for perf_fb_ctrs_t0, 2 for perf_fb_ctrs_t1), 
but just use the 2nd (perf_fb_ctrs_t0) result and the 4th 
(perf_fb_ctrs_t1) result to do the calculation.

>>> Thanks for the new data.
>>>
>>>>> Example of code where we do 4 calls to cppc_get_perf_ctrs():
>>>>>
>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>>>> b/drivers/cpufreq/cppc_cpufreq.c
>>>>> index 022e3555407c..6370f2f0bdad 100644
>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>> @@ -853,6 +853,20 @@ static unsigned int
>>>>> cppc_cpufreq_get_rate(unsigned int cpu)
>>>>>
>>>>>           udelay(2); /* 2usec delay between sampling */
>>>>>
>>>>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>>> +       if (ret)
>>>>> +               return ret;
>>>>> +
>>>>> +       udelay(2); /* 2usec delay between sampling */
>>>>> +
>>>>> +       /* Do a third call. */
>>>>> +       ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>>> +       if (ret)
>>>>> +               return ret;
>>>>> +
>>>>> +       udelay(2); /* 2usec delay between sampling */
>>>>> +
>>>>> +       /* Do a fourth call. */
>>>>>           ret = cppc_get_perf_ctrs(cpu, &fb_ctrs_t1);
>>>>>           if (ret)
>>>>>                   return ret;
>>>>>
>>>>>> The above trace data shows some cpc reading took a little bit longer
>>>>>> than usual. I suspected it was caused by interconnect congestion.
>>>>>>
>>>>>>
>>>>>> So it looks like IRQ is the major contributing factor of high error
>>>>>> (4xxxxxx KHz), interconnect congestion is the major
>>>>>> contributing factor
>>>>>> of low error (3xxxxxx KHz).
>>>>>>
>>>>>> So I did the below test:
>>>>>> 1. Disable IRQ: The high errors were gone (> 3700000KHz),
>>>>>> but low errors
>>>>>> were still seen.
>>>>>> 2.10us delay: The high errors were still seen.
>>>>>> 3. Disable IRQ + 10us delay: all the errors were gone.
>>>>>>
>>>>>> I think the test result also supports the tracing data.
>>>>>>
>>>>>>
>>>>>> I also got some confusion about calling
>>>>>> cppc_cpufreq_get_rate() with irq
>>>>>> disabled. Rafael thought 10ms delay is too long because the
>>>>>> function may
>>>>>> be called with irq disabled. But a deeper look at the
>>>>>> function shows it
>>>>>> should *NOT* be called with irq disabled at all.
>>>>>>
>>>>>> First, if pcc channel is used, cpc reading may take over 100ms, it is
>>>>>> way larger the proposed 10ms delay.
>>>>>> Second, reading from cpc channel needs to take a semaphore, so it may
>>>>>> sleep. But sleep with IRQ disabled is not allowed.
>>>>> Yes right, however the semaphore is not taken in between the
>>>>> sequence of
>>>>> cpc_read() calls in cppc_get_perf_ctrs(). So maybe the change below
>>>>> should
>>>>> be acceptable:
>>>> Yeah, we should be able to find a smaller irq disable section.
>>>>
>>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>>> index c51d3ccb4cca..105a7e2ffffa 100644
>>>>> --- a/drivers/acpi/cppc_acpi.c
>>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>>> @@ -1315,6 +1315,7 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>           struct cppc_pcc_data *pcc_ss_data = NULL;
>>>>>           u64 delivered, reference, ref_perf, ctr_wrap_time;
>>>>>           int ret = 0, regs_in_pcc = 0;
>>>>> +       unsigned long flags;
>>>>>
>>>>>           if (!cpc_desc) {
>>>>>                   pr_debug("No CPC descriptor for CPU:%d\n", cpunum);
>>>>> @@ -1350,10 +1351,14 @@ int cppc_get_perf_ctrs(int cpunum, struct
>>>>> cppc_perf_fb_ctrs *perf_fb_ctrs)
>>>>>                   }
>>>>>           }
>>>>>
>>>>> +       local_irq_save(flags);
>>>>> +
>>>>>           cpc_read(cpunum, delivered_reg, &delivered);
>>>>>           cpc_read(cpunum, reference_reg, &reference);
>>>>>           cpc_read(cpunum, ref_perf_reg, &ref_perf);
>>>>>
>>>>> +       local_irq_restore(flags);
>>>>> +
>>>> cpc_read_ffh() would return -EPERM if irq is disabled.
>>>>
>>>> So, the irq disabling must happen for mmio only in cpc_read(), for
>>>> example:
>>> I thought the issue was that irqs could happen in between cpc_read()
>>> functions,
>>> the patch below would not cover it. If the frequency is more accurate
>>> with this patch, I think I don't understand something.
>> Yeah, you are correct. The irq disabling window has to cover all the
>> cpc_read(). I didn't test with this patch. My test was done conceptually
>> with:
>>
>> disable irq
>> cppc_get_perf_ctrs(t0)
>> udelay(2)
>> cppc_get_perf_ctrs(t1)
>> enable irq
>>
>> But this will break cpc_read_ffh().
> Can you not disable IRQs in cppc_get_perf_ctrs() only if the registers
> are CPC_IN_SYSTEM_MEMORY? Only spanning the reads of the delivered
> register and the reference register, which should have minimal delay in
> between?
>
> As in:
>
> if (CPC_IN_SYSTEM_MEMORY(delivered_reg) &&
>      CPC_IN_SYSTEM_MEMORY(reference_reg))
> 	...
>
> This will and should not affect FFH - the fix for that would have to be
> different.

It won't work, right? The problem is cppc_get_perf_ctrs() calls 
cpc_read()s to read delivered and reference respectively, we just can 
tell whether they are CPC_IN_SYSTEM_MEMORY in cpc_read() instead of in 
cppc_get_perf_ctrs(). So the resulting code should conceptually look like:

disable irq
read delivered
enable irq

disable irq
read reference
enable irq

But there still may be interrupts between the two reads. We actually want:

disable irq
read delivered
read reference
enable irq

>>> (asking for more information)
>>> Just to check, the core/perf counters are AMUs and the other CPPC
>>> registers
>>> are mmio right ? Is it possible to know the CPPC registers that are
>>> implemented on your platform ?
>> AFAIK, the perf counters are implemented by AMU and accessed via mmio, the
>> other CPPC registers are implemented by PCC.
>>
> Usually there is an implementation defined rate of sync between the CPU
> counters and the values available through the memory mapped interface.
> Is it possible that some inaccuracies are introduced by that as well, in
> case there is a sync in between the two counter (delivered and reference)
> reads?

AFAIK, the counter values are available through the memory mapped 
interface immediately, there's no coarser rate of synchronization.

>
> Hope it helps,
> Ionela.
>
>>> Also is it possible which platform you are using ?
>> Ampere One
>>
>>>> diff --git a/drivers/acpi/cppc_acpi.c b/drivers/acpi/cppc_acpi.c
>>>> index c51d3ccb4cca..f3c92d844074 100644
>>>> --- a/drivers/acpi/cppc_acpi.c
>>>> +++ b/drivers/acpi/cppc_acpi.c
>>>> @@ -982,6 +982,7 @@ static int cpc_read(int cpu, struct
>>>> cpc_register_resource *reg_res, u64 *val)
>>>>            void __iomem *vaddr = NULL;
>>>>            int pcc_ss_id = per_cpu(cpu_pcc_subspace_idx, cpu);
>>>>            struct cpc_reg *reg = &reg_res->cpc_entry.reg;
>>>> +       unsigned long flags;
>>>>
>>>>            if (reg_res->type == ACPI_TYPE_INTEGER) {
>>>>                    *val = reg_res->cpc_entry.int_value;
>>>> @@ -1015,6 +1016,7 @@ static int cpc_read(int cpu, struct
>>>> cpc_register_resource *reg_res, u64 *val)
>>>>                    return
>>>> acpi_os_read_memory((acpi_physical_address)reg->address,
>>>>                                    val, reg->bit_width);
>>>>
>>>> +       local_irq_save(flags);
>>>>            switch (reg->bit_width) {
>>>>            case 8:
>>>>                    *val = readb_relaxed(vaddr);
>>>> @@ -1029,10 +1031,12 @@ static int cpc_read(int cpu, struct
>>>> cpc_register_resource *reg_res, u64 *val)
>>>>                    *val = readq_relaxed(vaddr);
>>>>                    break;
>>>>            default:
>>>> +               local_irq_restore(flags);
>>>>                    pr_debug("Error: Cannot read %u bit width from PCC for
>>>> ss: %d\n",
>>>>                             reg->bit_width, pcc_ss_id);
>>>>                    return -EFAULT;
>>>>            }
>>>> +       local_irq_restore(flags);
>>>>
>>>>            return 0;
>>>>     }
>>>>
>>>>>           /*
>>>>>            * Per spec, if ctr_wrap_time optional register is
>>>>> unsupported, then the
>>>>>            * performance counters are assumed to never wrap during the
>>>>> lifetime of
>>>>>
>>>>>> Third, if the counters are implemented by AMU, cpc_read_ffh() needs to
>>>>>> send IPI so it requires IRQ enabled.
>>>>> If I'm not mistaken, the CPU calling cpc_read_ffh() might have IRQs
>>>>> disabled,
>>>>> it should not prevent it to send IPIs no ?
>>>> It can't work with irq disabled. The comment in counters_read_on_cpu()
>>>> says "Abort call on counterless CPU or when interrupts are disabled -
>>>> can lead to deadlock in smp sync call."
>>>>
>>>>
>>>> And it just returns -EPERM and raise a warning if irq is disabled.
>>> Ok right,
>>>
>>> Regards,
>>> Piere

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ