linux-kernel - Re: sched-freq locking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 21 Jan 2016 09:45:13 +0000
From:	Juri Lelli <Juri.Lelli@....com>
To:	Steve Muckle <steve.muckle@...aro.org>
Cc:	Michael Turquette <mturquette@...libre.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Patrick Bellasi <patrick.bellasi@....com>,
	Morten Rasmussen <morten.rasmussen@....com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	"Rafael J. Wysocki" <rjw@...ysocki.net>,
	Javi Merino <Javi.Merino@....com>,
	Punit Agrawal <punit.agrawal@....com>
Subject: Re: sched-freq locking

On 20/01/16 12:46, Steve Muckle wrote:
> Hi Juri,
> 
> On 01/20/2016 07:58 AM, Juri Lelli wrote:
> > On 20/01/16 06:50, Steve Muckle wrote:
> >> On 01/20/2016 04:18 AM, Juri Lelli wrote:
> >>> I fear that caching could break thermal. If everybody was already using
> >>> sched-freq interface to control frequency this won't probably be a
> >>> problem, but we are not yet there :(. So, IIUC by caching policy->max,
> >>> for example, we might affect what thermal expects from cpufreq.
> >>
> >> I think we could probably just access policy->max without rwsem, as long
> >> as we ensure policy is valid. Cpufreq will take care to cap a frequency
> >> request to an upper limit put in by thermal anyway so I don't think it's
> >> a problematic race. But I haven't spent much time thinking about it.
> >>
> > 
> > Mmm, can't the fact that the scheduler might think it can still request
> > a frequency above max be problematic?
> 
> I don't think so because cpufreq is going to clamp the incoming request
> to be within policy->min and policy->max anyway (see
> __cpufreq_driver_target() near the top). So nothing's going to
> break/catch fire.
> 

Right. I wasn't worried about this, but rather by the fact that the
scheduler might think there is enough space in a CPU (and balance tasks
accordingly) when that space has actually been restricted by thermal.
But, I also think there are ways to prevent this from happening, we just
need to be aware of such a problem.

> The question to me is, if the system gets throttled (or un-throttled)
> just as you are evaluating capacity requests, are you actually better
> off using the new policy->max value to scale? I suspect not, because the
> recent load tracking statistics and hence the capacity requests will
> have been calculated based on the original policy->max rather than the
> one that was just written.
> 
> Perhaps a decayed version of policy->max could be maintained and used to
> scale instead. Not sure if this will be enough of an issue in practice
> though to warrant it...
> 

Yeah, I can't really see a big issue here at the moment.

> > 
> ...
> > 
> >> Currently schedfreq has to go through two stages of aggregation. The
> >> first is aggregating the capacity request votes from the different sched
> >> classes within a CPU. The second is aggregating the capacity request
> >> votes from the CPUs within a frequency domain. I'm looking to solve the
> >> locking issues with both of these stages as well as those with calling
> >> into cpufreq in the fast path.
> >>
> >> For the first stage, currently we assume that the rq lock is held. This
> >> is the case for the existing calls into schedfreq (the load balancer
> >> required a minor change). There are a few more hooks that need to go
> >> into migration paths in core.c which will be slightly more painful, but
> >> they are IMO doable.
> >>
> >> For the second stage I believe an internal schedfreq spinlock is
> >> required, for one thing to protect against competing updates within the
> >> same frequency domain from different CPUs, but also so that we can
> >> guarantee the cpufreq policy is valid, which can be done if the
> >> governor_start/stop callbacks take the spinlock as well.
> >>
> > 
> > Does this need to nest within the rq lock?
> 
> I think it will have to because I suspect releasing the rq lock to run
> the second stage, then re-acquiring it afterwards, will cause breakage
> in the scheduler paths from which this is all being called.
> 

Right, that's what I was thinking too. However, we need to be aware that
we might add some overhead caused by contention on this new spinlock.

> Otherwise I don't see a requirement within schedfreq for it to be nested.
> 

OTOH, doesn't second stage happen on the kthread (when we need to
sleep)?

> >
> >> As for accessing various things in cpufreq->policy and trying to take
> >> rwsem in the fast path, we should be able to either cache some of the
> >> items in the governor_start() path, eliminate the references, or access
> >> them without locking rwsem (as long as we know the policy is valid,
> > 
> > If we only need to guarantee existence of the policy, without any direct
> > updates, RCU might be a good fit.
> 
> It must be guaranteed that policy->cpus and policy->freq_table are
> current/synchronized with cpufreq for the duration of the second stage.
> I think this precludes the use of RCU.
> 

Mmm, it seems so yes.

> We're going to need an internal schedfreq spinlock to arbitrate multiple
> CPUs in a frequency domain trying to go through the hot path
> concurrently anyway, so we're really just extending that critical
> section to the governor start and stop callbacks where we can safely
> cache some of the policy data.
> 
> > 
> >> which we do by taking the spinlock in governor_start()). Here are the
> >> things we currently reference in the schedfreq set capacity hook and my
> >> thoughts on each of them:
> >>
> >> policy->governor: this is currently tested to see if schedfreq is
> >> enabled, but this can be tracked internally to schedfreq and set in the
> >> governor_start/stop callbacks
> >>
> >> policy->governor_data: used to access schedfreq's data for the policy,
> >> could be changed to an internal schedfreq percpu pointer
> >>
> >> policy->cpus, policy->freq_table: these can be cached internally to
> >> schedfreq on governor_start/stop
> >>
> >> policy->max: as mentioned above I *think* we could get away with
> >> accessing this without locking rwsem as discussed above
> >>
> >> policy->transition_ongoing: this isn't strictly required as schedfreq
> >> could track internally whether it has a transition outstanding, but we
> >> should also be okay to look at this without policy->rwsem since
> >> schedfreq is the only one queuing transitions, again assuming we take
> >> care to ensure policy is valid as above
> >>
> >> Finally when actually requesting a frequency change in the fast path, we
> >> can trylock policy->rwsem and fall back to the slow path (kthread) if
> >> that fails.
> >>
> > 
> > OTOH, all the needed aggregation could make the fast path not so fast in
> > the end. So, maybe having a fast and a slow path is always good?
> 
> Sorry I didn't follow... What do you mean by having a fast and a slow path?
> 

Sorry, I wasn't clear enough. What I was trying to say is that having
two different paths, one fast where we aggregate locally and fire a
request and a second one slower where we aggregate per frequency domain,
compute the new request and call cpufreq, might be always desirable;
even on system that could issue frequency changes without sleeping.