linux-kernel - Re: [RFC 08/14] sched/tune: add detailed documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55F935D9.2080607@arm.com>
Date:	Wed, 16 Sep 2015 10:26:49 +0100
From:	Juri Lelli <juri.lelli@....com>
To:	Steve Muckle <steve.muckle@...aro.org>,
	Patrick Bellasi <Patrick.Bellasi@....com>
Cc:	Ricky Liang <jcliang@...omium.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
	Jonathan Corbet <corbet@....net>,
	"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
	Viresh Kumar <viresh.kumar@...aro.org>
Subject: Re: [RFC 08/14] sched/tune: add detailed documentation

Hi Steve,

thanks a lot for this interesting discussion.

On 16/09/15 00:55, Steve Muckle wrote:
> On 09/15/2015 08:00 AM, Patrick Bellasi wrote:
>>> Agreed, though I also think those tunable values might also change for a
>>> given set of tasks in different circumstances.
>>
>> Could you provide an example?
>>
>> In my view the per-task support should be exploited just for quite
>> specialized tasks, which are usually not subject to many different
>> phases during their execution.
> 
> The surfaceflinger task in Android is a possible example. It can have
> the same issue as the graphics controller task you mentioned - needing
> to finish quickly so the overall display pipeline can meet its deadline,
> but often not exerting enough CPU demand by itself to raise the
> frequency high enough.
>

SurfaceFlinger timeliness requirements, and maybe AudioFlinger's and
others' as well, might be better expressed by using other scheduling
classes, IMHO. SCHED_DEADLINE, for example, has built-in explicit
deadlines awareness and might work better with this kind of activities.
Not to mention that Android has already started using SCHED_FIFO for
some of its time sensitive tasks. It seems to me that the long run goal
should be to give the scheduler more information about what is going on
and then use such information to do more informed decisions (scheduling,
OPP selection, etc.).

> Since mobile platforms are so power sensitive though, it won't be
> possible to boost surfaceflinger all the time. Perhaps the
> surfaceflinger boost could be managed by some sort of userspace daemon
> monitoring the sort of usecase running and/or whether display deadlines
> are being missed, and updating a schedtune boost cgroup.
> 

I'd say you would like to "boost" just enough to meet a certain quality
of service in the end.

>> For example, in a graphics rendering pipeline usually we have a host
> ...
>> With SchedTune we would like to get a similar result to the one you
>> describe using min_sample_time and above_hispeed_delay by linking
>> somehow the "interpretation" of the PELT signal with the boost value.
>>
>> Right now we have in sched-DVFS an idle % headroom which is hardcoded
>> to be ~20% of the current OPP capacity. When we cross that boundary
>> that threshold with the CPU usage, we switch straight to the max OPP.
>> If we could figure out a proper mechanism to link the boost signal to
>> both the idle % headroom and the target OPP, I think we could achieve
>> quite similar results than what you can get with the knobs offered by
>> the interactive governor.
>> The more you boost a task the bigger is the idle % headroom and
>> the higher is the OPP you will jump.
> 
> Let's say I have a system with one task (to set aside the per-task vs.
> global policy issue temporarily) and I want to define a policy which
> 
>  - quickly goes to 1.2GHz when the current frequency is less than
>    that and demand exceeds capacity
> 
>  - waits at least 40ms (or just "a longer time") before increasing the
>    frequency if the current frequency is 1.2GHz or higher
> 
> This is similar to (though a simplification of) what interactive is
> often configured to do on mobile platforms. AFAIK it's a fairly common
> strategy due to the power-perf curves and OPPs available on CPUs, and at
> the same time striving to maintain decent UI responsiveness.
> 

Not that this is already in place, but, once we'll have an energy model
of the platform available to the scheduler (the EAS idea), shouldn't
this kind of considerations be possible without any explicit
configuration? I mean, it seems to me that you start reasoning about
trade-offs after you obtained power-perf curves for your platform; but,
once this data will be available to the scheduler, don't you think we
could put a bit more intelligence there to make the same kind of
decisions you would configure a governor to do?

> Even with the proposed modification to link boost with idle % and target
> OPP I don't think there'd currently be a way to express this policy,
> which goes beyond the linear scaling of the magnitude of CPU demand
> requested by a task, idle headroom or target OPP.
> 
>>
> ...
>>> The hardcoded values in the
>>> task load tracking algorithm seem concerning though from a tuning
>>> standpoint.
>>
>> I agree, that's why we are thinking about the solution described
>> before. Exploit the boost value to replace the hardcoded thresholds
>> should allow to get more flexibility while being per-task defined.
>> Hopefully, tuning per task can be more easy and effective than
>> selection a single value fitting all needs.
>>
>>>
>>>>> The interactive functionality would require additional knobs. I
>>> ...
>>>> However, regarding specifically the latency on OPP changes, there are
>>>> a couple of extension we was thinking about:
>>>> 1. link the SchedTune boost value with the % of idle headroom which
>>>>    triggers an OPP increase
>>>> 2. use the SchedTune boost value to defined the high frequency to jump
>>>>    at when a CPU crosses the % of idle headroom
>>>
>>> Hmmm... This may be useful (only testing/profiling would tell) though it
>>> may be nice to be able to tune these values.
>>
>> Again, in my view the tuning should be per task with a single knob.
>> The value of the knob should than be properly mapped on other internal
>> values to obtain a well defined behavior driven by information shared
>> with the scheduler, i.e. a PELT signal.
>>
>>>> These are tunables which allows to parameterize the way the PELT
>>>> signal for CPU usage is interpreted by the sched-DVFS governor.
>>>>
>>>> How such tunables should be exposed and tuned is to be discussed.
>>>> Indeed, one of the main goals of the sched-DVFS and SchedTune
>>>> specifically, is to simplify the tuning of a platform by exposing to
>>>> userspace a reduced number of tunables, preferably just one.
>>>
>>> This last point (the desire for a single tunable) is perhaps at the root
>>> of my main concern. There are users/vendors for whom the current
>>> tunables are insufficient, resulting in their hacking the governors to
>>> add more tunables or features in the policy.
>>
>> We should also consider that we are proposing not only a single
>> tunable but also a completely different standpoint. Not more a "blind"
>> system-wide view on the average system behaviors, but instead a more
>> detailed view on tasks behaviors. A single tunable used to "tag" tasks
>> maybe it's not such a limited solution in this design.
> 
> I think the algorithm is still fairly blind. There still has to be a
> heuristic for future CPU usage, it's now just per-task and in the
> scheduler (PELT), whereas it used to be per-CPU and in the governor.
> 
> This allows for good features like adjusting frequency right away on
> task migration/creation/exit or per task boosting etc., but I think
> policy will still be important. Tasks change their behavior all the
> time, at least in the mobile usecases I've seen.
> 
>>> Consolidating CPU frequency and idle management in the scheduler will
>>> clean things up and probably make things more effective, but I don't
>>> think it will remove the need for a highly configurable policy.
>>
>> This can be verified only by starting to use sched-DVFS + SchedTune on
>> real/synthetic setup to verify which features are eventually missing,
>> or specific use-cases not properly managed.
>> If we are able to setup these experiments perhaps we will be able to
>> identify a better design for a scheduler driver solution.
> 
> Agree. I hope to be able to run some of these experiments to help.
> 
>>> I'm curious about the drive for one tunable. Is that something there's
> ...
>> We have plenty of experience, collected on the past years, on CPUFreq
>> governors and customer specific mods.
>> Don't you think we can exploit that experience to reason around a
>> fresh new design that allows to satisfy all requirements while
>> providing possibly a simpler interface?
> 
> Sure. I'm just communicating requirements I've seen :) .
> 

And that's great! :-)

>> I agree with you that all the current scenarios must be supported by
>> the new proposal. We should probably start by listing them and come
>> out with a set of test cases that allow to verify where we are wrt
>> the state of the art.
> 
> Sounds like a good plan to me... Perhaps we could discuss some mobile
> usecases next week at Linaro Connect?
> 

I'm up for it!

Best,

- Juri

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/