linux-kernel - Re: [PATCH v2 5/6] cpufreq: Avoid using inconsistent policy->min and policy->max

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aALQhEi609NQAV7S@sultan-box.localdomain>
Date: Sat, 19 Apr 2025 08:21:56 +1000
From: Sultan Alsawaf <sultan@...neltoast.com>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>,
	Linux PM <linux-pm@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
	Mario Limonciello <mario.limonciello@....com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Christian Loehle <christian.loehle@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Valentin Schneider <vschneid@...hat.com>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH v2 5/6] cpufreq: Avoid using inconsistent policy->min and
 policy->max

On Fri, Apr 18, 2025 at 09:42:15PM +0200, Rafael J. Wysocki wrote:
> On Fri, Apr 18, 2025 at 12:18 PM Sultan Alsawaf <sultan@...neltoast.com> wrote:
> >
> > On Tue, Apr 15, 2025 at 12:04:21PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> > >
> > > Since cpufreq_driver_resolve_freq() can run in parallel with
> > > cpufreq_set_policy() and there is no synchronization between them,
> > > the former may access policy->min and policy->max while the latter
> > > is updating them and it may see intermediate values of them due
> > > to the way the update is carried out.  Also the compiler is free
> > > to apply any optimizations it wants both to the stores in
> > > cpufreq_set_policy() and to the loads in cpufreq_driver_resolve_freq()
> > > which may result in additional inconsistencies.
> > >
> > > To address this, use WRITE_ONCE() when updating policy->min and
> > > policy->max in cpufreq_set_policy() and use READ_ONCE() for reading
> > > them in cpufreq_driver_resolve_freq().  Moreover, rearrange the update
> > > in cpufreq_set_policy() to avoid storing intermediate values in
> > > policy->min and policy->max with the help of the observation that
> > > their new values are expected to be properly ordered upfront.
> > >
> > > Also modify cpufreq_driver_resolve_freq() to take the possible reverse
> > > ordering of policy->min and policy->max, which may happen depending on
> > > the ordering of operations when this function and cpufreq_set_policy()
> > > run concurrently, into account by always honoring the max when it
> > > turns out to be less than the min (in case it comes from thermal
> > > throttling or similar).
> > >
> > > Fixes: 151717690694 ("cpufreq: Make policy min/max hard requirements")
> > > Cc: 5.16+ <stable@...r.kernel.org> # 5.16+
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> > > ---
> > >
> > > v1 -> v2: Minor edit in the subject
> > >
> > > ---
> > >  drivers/cpufreq/cpufreq.c |   46 ++++++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 36 insertions(+), 10 deletions(-)
> > >
> > > --- a/drivers/cpufreq/cpufreq.c
> > > +++ b/drivers/cpufreq/cpufreq.c
> > > @@ -490,14 +490,12 @@
> > >  }
> > >  EXPORT_SYMBOL_GPL(cpufreq_disable_fast_switch);
> > >
> > > -static unsigned int clamp_and_resolve_freq(struct cpufreq_policy *policy,
> > > -                                        unsigned int target_freq,
> > > -                                        unsigned int relation)
> > > +static unsigned int __resolve_freq(struct cpufreq_policy *policy,
> > > +                                unsigned int target_freq,
> > > +                                unsigned int relation)
> > >  {
> > >       unsigned int idx;
> > >
> > > -     target_freq = clamp_val(target_freq, policy->min, policy->max);
> > > -
> > >       if (!policy->freq_table)
> > >               return target_freq;
> > >
> > > @@ -507,6 +505,15 @@
> > >       return policy->freq_table[idx].frequency;
> > >  }
> > >
> > > +static unsigned int clamp_and_resolve_freq(struct cpufreq_policy *policy,
> > > +                                        unsigned int target_freq,
> > > +                                        unsigned int relation)
> > > +{
> > > +     target_freq = clamp_val(target_freq, policy->min, policy->max);
> > > +
> > > +     return __resolve_freq(policy, target_freq, relation);
> > > +}
> > > +
> > >  /**
> > >   * cpufreq_driver_resolve_freq - Map a target frequency to a driver-supported
> > >   * one.
> > > @@ -521,7 +528,22 @@
> > >  unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> > >                                        unsigned int target_freq)
> > >  {
> > > -     return clamp_and_resolve_freq(policy, target_freq, CPUFREQ_RELATION_LE);
> > > +     unsigned int min = READ_ONCE(policy->min);
> > > +     unsigned int max = READ_ONCE(policy->max);
> > > +
> > > +     /*
> > > +      * If this function runs in parallel with cpufreq_set_policy(), it may
> > > +      * read policy->min before the update and policy->max after the update
> > > +      * or the other way around, so there is no ordering guarantee.
> > > +      *
> > > +      * Resolve this by always honoring the max (in case it comes from
> > > +      * thermal throttling or similar).
> > > +      */
> > > +     if (unlikely(min > max))
> > > +             min = max;
> > > +
> > > +     return __resolve_freq(policy, clamp_val(target_freq, min, max),
> > > +                           CPUFREQ_RELATION_LE);
> > >  }
> > >  EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq);
> > >
> > > @@ -2632,11 +2654,15 @@
> > >        * Resolve policy min/max to available frequencies. It ensures
> > >        * no frequency resolution will neither overshoot the requested maximum
> > >        * nor undershoot the requested minimum.
> > > +      *
> > > +      * Avoid storing intermediate values in policy->max or policy->min and
> > > +      * compiler optimizations around them because them may be accessed
> > > +      * concurrently by cpufreq_driver_resolve_freq() during the update.
> > >        */
> > > -     policy->min = new_data.min;
> > > -     policy->max = new_data.max;
> > > -     policy->min = clamp_and_resolve_freq(policy, policy->min, CPUFREQ_RELATION_L);
> > > -     policy->max = clamp_and_resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
> > > +     WRITE_ONCE(policy->max, __resolve_freq(policy, new_data.max, CPUFREQ_RELATION_H));
> > > +     new_data.min = __resolve_freq(policy, new_data.min, CPUFREQ_RELATION_L);
> > > +     WRITE_ONCE(policy->min, new_data.min > policy->max ? policy->max : new_data.min);
> >
> > I don't think this is sufficient, because this still permits an incoherent
> > policy->min and policy->max combination, which makes it possible for schedutil
> > to honor the incoherent limits; i.e., schedutil may observe old policy->min and
> > new policy->max or vice-versa.
> 
> Yes, it may, as stated in the new comment in cpufreq_driver_resolve_freq().

Thanks for pointing that out; I had ignored that hunk while reviewing.

But I ignored it because schedutil still accesses policy->min/max unprotected
via cpufreq_policy_apply_limits() and __cpufreq_driver_target(). The race still
affects those calls.

> > We also can't permit a wrong freq to be propagated to the driver and then send
> > the _right_ freq afterwards; IOW, we can't let a bogus freq slip through and
> > just correct it later.
> 
> The frequency is neither wrong nor bogus, it is only affected by one
> of the limits that were in effect previously or will be in effect
> going forward.  They are valid limits in either case.

I would argue that limits only make sense as a pair, not on their own. Checking
for min > max only covers the case where the new min exceeds the old max; this
means that, when min is raised without exceeding the old max, a thermal throttle
attempt could instead result in a raised frequency floor:

	1. policy->min == 100000, policy->max == 2500000
	2. Policy limit update request: new min of 400000, new max of 500000
	3. schedutil observes policy->min == 400000, policy->max == 2500000

Raising the min freq while lowering the max freq can be a valid thermal throttle
scheme. But it only makes sense if both limits are applied simultaneously.

> > How about using a seqlock?
> 
> This would mean extra overhead in the scheduler path pretty much for no gain.

Or there's the slightly cursed approach of using a union to facilitate an atomic
64-bit store of policy->min and max at the same time, since min/max are 32 bits.

Sultan