linux-kernel - Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aKx4nZWsRPTXK942@arm.com>
Date: Mon, 25 Aug 2025 16:52:13 +0200
From: Beata Michalska <beata.michalska@....com>
To: Prashant Malani <pmalani@...gle.com>
Cc: Yang Shi <yang@...amperecomputing.com>,
	open list <linux-kernel@...r.kernel.org>,
	"open list:CPU FREQUENCY SCALING FRAMEWORK" <linux-pm@...r.kernel.org>,
	"Rafael J. Wysocki" <rafael@...nel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Catalin Marinas <catalin.marinas@....com>,
	Ionela Voinescu <Ionela.Voinescu@....com>
Subject: Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads

On Wed, Aug 20, 2025 at 02:25:16PM -0700, Prashant Malani wrote:
> Hi Beata,
> 
> On Wed, 20 Aug 2025 at 13:49, Beata Michalska <beata.michalska@....com> wrote:
> >
> > Kinda working on that one.
> 
> OK. I'm eager to see what the solution is!
> 
> > >
> > > Outside of that, I can't think of another mitigation beyond adding delay to make
> > > the time deltas not matter so much.
> > I'm not entirely sure what 'so much' means in this context.
> > How one would quantify whether the added delay is actually mitigating the issue?
> >
> 
> I alluded to it in the commit description, but here is the my basic
> numerical analysis:
> The effective timestamps for the 4 readings right now are:
> Timestamp t0: del0
> Timestamp t0 + m: ref0
> (Time delay X us)
> Timestamp t1: del1
> Timestamp t1 + n: ref1
> 
> Timestamp t1 = t0 + m + X
> 
> The perf calculation is:
> Per = del1 - del0 / ref1 - ref0
>       = Del_counter_diff_over_time(t1 - t0) /
> ref_counter_diff_over_time(t1 + n - (t0 + m))
>       = Del_counter_diff_over time(t0 + m + X - t0) /
> ref_counter_diff_over_time((t0 + m + X + n - t0 - m)
>       = Del_counter_diff_over_time(m + X) / ref_counter_diff_over_time(n + X)
> 
> If X >> (m,n) this becomes:
>       = Del_counter_diff_over_time(X) / ref_counter_diff_over_time(X)
> which is what the actual calculation is supposed to be.
> 
> if X ~ (m, N) (which is what the case is right now), the calculation
> becomes erratic.
This is still bound by 'm' and 'n' values, as the difference between those will
determine the error factor (with given, fixed X). If m != n, one counter delta
is stretched more than the other, so the perf ratio no longer represents the
same time interval. And that will vary between platforms/workloads leading to
over/under-reporting.

> 
> > >
> > > Perhaps ARM should introduce a "snapshot" feature which takes the snapshot
> > > of the AMU counters on a register write; IAC that's outside the scope
> > What do you mean by register write ?
> 
> I mean that we should have a snapshot register:
> SYS_AMEVCNTR0_SNAPSHOT_EL0
> 
> writing to this will save the current values of SYS_AMEVCNTR0_CORE_EL0 and
> SYS_AMEVCNTR0_CONST_EL0 into two shadow registers (say
> SYS_AMEVCNTR0_CORE_SNAPSHOT_EL0
> and SYS_AMEVCNTR0_CONST_SNAPSHOT_EL0)
> 
> That way the OS specifics in how the registers are read don't matter.
> 
> Of course, I'm not a H/W architect so this proposal is very rough, but hopefully
> helps illustrate my idea.
> 
Ah, so you meant architectural changes - thanks for clarification.
> > > of this discussion.
> > >
> > > Could you kindly explain why adding the udelay here is discouraged?
> >
> > Would you mind clarifying how the specific value of 200 µs was determined?
> > Was it carefully derived from measurements across multiple platforms and
> > workloads, or simply observed to “work” on one relatively stable setup?
> >
> > Understanding the basis for choosing that delay will help put the discussion
> > into the right context.
> 
> TBH, I just experimented with values on our system and observed the readings of
> cores running stress-ng. I tried 100us and that still gave variability
> (values greater than
> the theoretical max frequency). It's possible that the "optimal value"
> is somewhere
> between these two.
> 
> There have been other observations on this topic [1], that suggest
> that even 100us
> improves the error rate significantly from what it is with 2us.
> 
> BR,
Which is exactly why I've mentioned this approach is not really recommended,
being bound to rather specific setup. There have been similar proposals in the
past, all with different values of the delay which should illustrate how fragile
solution (if any) that is.

---
BR
Beata
> -Prashant
> 
> [1] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/