[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0hkdr2N65p84X0HdVCkqr621=rE4cFmqFMMXTvn6=BCAw@mail.gmail.com>
Date: Wed, 11 May 2022 15:50:11 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Schspa Shi <schspa@...il.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux PM <linux-pm@...r.kernel.org>
Subject: Re: [PATCH v3] cpufreq: fix race on cpufreq online
On Wed, May 11, 2022 at 3:42 PM Schspa Shi <schspa@...il.com> wrote:
>
> "Rafael J. Wysocki" <rafael@...nel.org> writes:
>
> > On Wed, May 11, 2022 at 2:59 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
> >>
> >> On Wed, May 11, 2022 at 2:21 PM Viresh Kumar <viresh.kumar@...aro.org> wrote:
> >> >
> >> > On 11-05-22, 16:10, Schspa Shi wrote:
> >> > > Viresh Kumar <viresh.kumar@...aro.org> writes:
> >> > > > I am not sure, but maybe there were issues in calling init() with rwsem held, as
> >> > > > it may want to call some API from there.
> >> > > >
> >> > >
> >> > > I have checked all the init() implement of the fellowing files, It should be OK.
> >> > > Function find command:
> >> > > ag "init[\s]+=" drivers/cpufreq
> >> > >
> >> > > All the init() implement only initialize policy object without holding this lock
> >> > > and won't call cpufreq APIs need to hold this lock.
> >> >
> >> > Okay, we can see if someone complains later then :)
> >> >
> >> > > > I don't think you can do that safely. offline() or exit() may depend on
> >> > > > policy->cpus being set to all CPUs.
> >> > > OK, I will move this after exit(). and there will be no effect with those
> >> > > two APIs. But policy->cpus must be clear before release policy->rwsem.
> >> >
> >> > Hmm, I don't think depending on the values of policy->cpus is a good idea to be
> >> > honest. This design is inviting bugs to come in at another place. We need a
> >> > clear flag for this, a new flag or something like policy_list.
> >
> > Why?
> >
> >> > Also I see the same bug happening while the policy is removed. The kobject is
> >> > put after the rwsem is dropped.
> >
> > This shouldn't be a problem because of the wait_for_completion() in
> > cpufreq_policy_put_kobj(). It is known that cpufreq_sysfs_release()
> > has run when cpufreq_policy_put_kobj() returns, so it is safe to free
> > the policy then.
> >
> >> > > > static inline bool policy_is_inactive(struct cpufreq_policy *policy)
> >> > > > {
> >> > > > - return cpumask_empty(policy->cpus);
> >> > > > + return unlikely(cpumask_empty(policy->cpus) ||
> >> > > > + list_empty(&policy->policy_list));
> >> > > > }
> >> > > >
> >> > >
> >> > > I don't think this fully solves my problem.
> >> > > 1. There is some case which cpufreq_online failed after the policy is added to
> >> > > cpufreq_policy_list.
> >> >
> >> > And I missed that :(
> >> >
> >> > > 2. policy->policy_list is not protected by &policy->rwsem, and we
> >> > > can't relay on this to
> >> > > indict the policy is fine.
> >> >
> >> > Ahh..
> >> >
> >> > > >From this point of view, we can fix this problem through the state of
> >> > > this linked list.
> >> > > But the above two problems need to be solved first.
> >> >
> >> > I feel overriding policy_list for this is going to make it complex/messy.
> >> >
> >> > Maybe something like this then:
> >>
> >> There are two things.
> >>
> >> One is the possible race with respect to the sysfs access occurring
> >> during failing initialization and the other is that ->offline() or
> >> ->exit() can be called with or without holding the policy rwsem
> >> depending on the code path.
> >>
> >> Namely, cpufreq_offline() calls them under the policy rwsem, but
> >> cpufreq_remove_dev() calls ->exit() outside the rwsem. Also they are
> >> called outside the rwsem in cpufreq_online().
> >>
> >> Moreover, ->offline() and ->exit() cannot expect policy->cpus to be
> >> populated, because they are called when it is empty from
> >> cpufreq_offline().
> >>
> >> So the $subject patch is correct AFAICS even though it doesn't address
> >> all of the above.
> >
> > TBH, I'm not sure why show() doesn't check policy_is_inactive() under the rwsem.
> >
>
> There is a exist bugs, and somebody try to fixed, please see commit
> Fixes: 2f66196208c9 ("cpufreq: check if policy is inactive early in
> __cpufreq_get()")
Well, exactly.
This only addressed one bug out of a category.
> > Moreover, I'm not sure why the locking dance in store() is necessary.
>
> The store interface hold cpu_hotplug_lock via
> cpus_read_trylock();
> , cannot run in parallel with cpufreq_online() & cpufreq_offline().
So the reason why is to prevent store() from running in parallel with
the two functions above. Which generally is because the policy
configuration is in-flight then. However, I'm wondering about what
exactly would break then.
Powered by blists - more mailing lists