linux-kernel - Re: [PATCH] sched/rt: Add a new sysctl to control uclamp_util

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200122124424.GA15976@darkstar>
Date:   Wed, 22 Jan 2020 13:44:24 +0100
From:   Patrick Bellasi <patrick.bellasi@...bug.net>
To:     Qais Yousef <qais.yousef@....com>
Cc:     Valentin Schneider <valentin.schneider@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Iurii Zaikin <yzaikin@...gle.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        qperret@...gle.com, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH] sched/rt: Add a new sysctl to control uclamp_util_min

On 22-Jan 11:45, Qais Yousef wrote:
> On 01/22/20 11:19, Patrick Bellasi wrote:
> > On 14-Jan 21:34, Qais Yousef wrote:
> > > On 01/09/20 10:21, Patrick Bellasi wrote:
> > > > That's not entirely true. In that patch we introduce cgroup support
> > > > but, if you look at the code before that patch, for CFS tasks there is
> > > > only:
> > > >  - CFS task-specific values (min,max)=(0,1024) by default
> > > >  - CFS system-wide tunables (min,max)=(1024,1024) by default
> > > > and a change on the system-wide tunable allows for example to enforce
> > > > a uclamp_max=200 on all tasks.
> > > > 
> > > > A similar solution can be implemented for RT tasks, where we have:
> > > >  - RT task-specific values (min,max)=(1024,1024) by default
> > > >  - RT system-wide tunables (min,max)=(1024,1024) by default
> > > >  and a change on the system-wide tunable allows for example to enforce
> > > >  a uclamp_min=200 on all tasks.
> > > 
> > > I feel I'm already getting lost in the complexity of the interaction here. Do
> > > we really need to go that path?
> > > 
> > > So we will end up with a default system wide for all tasks + a CFS system wide
> > > default + an RT system wide default?
> > > 
> > > As I understand it, we have a single system wide default now.
> > 
> > Right now we have one system wide default and that's both for all
> > CFS/RT tasks, when cgroups are not in use, or for root group and
> > autogroup CFS/RT tasks, when cgroups are in use.
> > 
> > What I'm proposing is to limit the usage of the current system wide
> > default to CFS tasks only, while we add a similar new one specifically
> > for RT tasks.
> 
> This change of behavior will be user-visible. I doubt anyone depends on it yet,
> so if we want to change it now is the time.

As it will be forcing users to move tasks into a different group.

> The main issue I see here is that the system wide are *always* enforced. They
> guarantee a global restriction. Which you're proposing to split into CFS and RT
> system wide global restrictions.

For RT tasks, since we don't have a utilization signal, we always
clamp the 1024 value. That's why forcing a default can be obtained by
just setting a constraints.

Do note that, right now we don't have a concept of default _requested_
clamp value. Perhaps having used uclamp_default[UCLAMP_CNT] has been
misleading but what that value does is just setting the default
_max constraint_ value. If you don't change them, we allow min,max
constraints to have any value.

> But what I am proposing here is setting the *default* fork value of RT uclamp
> min. Which is currently hardcoded to uclamp_none(UCLAMP_MAX) = 1024.

Yes, I got what you propose. And that's what I don't like.

The default _requested_ clamp value for RT task is 1024 dy definition,
since we decided that RT tasks should run to max OPP by default.

> I think the 2 concepts are different to a system wide RT control. Or I am still
> confused about your vision of how this should really work. I don't see how what
> you propose could solve my problem.
> 
> For example:
> 
> 	sysctl_sched_uclamp_util_min = 1024
> 
> means that tasks can set their uclamp_min to [0:1024], but
> 
> 	sysctl_sched_uclamp_util_min = 800
> 
> means that tasks can set their uclamp_min to [0:800]

You don't have to play with the min clamp.

To recap, for RT tasks:
 - the default _requested_ value is 1024
 - we always clamp the requested value considering the system wide
   clamps
 - the system wide clamps for RT tasks will be, by default, set to:
   (util_min,util_max) = (1024,1024)

According to the above we have the default behaviors, RT tasks running
at max OPP.

Now, let say you wanna your RT tasks to run only up to 200.
Well, what you need to do is to clamp the default _requested_ value
(1024) with a clamp range like (*,200):

   rt_clamped_util = uclamp_util(1024, (*,200)) = 200

IOW, what you will use to reduce the OPP is something like:

     sysctl_sched_uclamp_util_rt_max = 200

You just need to reduce the MAX clamp constraint, which will enforce
whatever the task requires and/or its min to be not greater the 200.


> Which when splitting this to CFS and RT controls, they'd still mean control
> the possible range the tasks can use.
> 
> But what I want here is that
> 
> 	sysctl_sched_rt_default_uclamp_util_min = 800
> 
> Will set p->uclamp_min = 800 when a new RT task is forked and will stay like
> this until the user modifies it. And update all current RT tasks that are still
> using the default value to the new one when ever this changes.

Right, that will work, but you are adding design concepts on top of a
design which allows already you to get what you want: just by properly
setting the constraints.

> If this value is outside the system wide allowed range; it'll be clamped
> correctly by uclamp_eff_get().
> 
> Note that the default fork values for CFS are (0, 1024) and we don't have a way
> to change this at the moment.

That's not true, the default value at FORK is to copy the parent
clamps. Unless you ask for a reset on fork.


> > 
> > At the end we will have two system wide defaults, not three.
> > 
> > > > > (Would we need CONFIG_RT_GROUP_SCHED for this? IIRC there's a few pain points
> > > > > when turning it on, but I think we don't have to if we just want things like
> > > > > uclamp value propagation?)
> > > > 
> > > > No, the current design for CFS tasks works also on !CONFIG_CFS_GROUP_SCHED.
> > > > That's because in this case:
> > > >   - uclamp_tg_restrict() returns just the task requested value
> > > >   - uclamp_eff_get() _always_ restricts the requested value considering
> > > >     the system defaults
> > > >  
> > > > > It's quite more work than the simple thing Qais is introducing (and on both
> > > > > user and kernel side).
> > > > 
> > > > But if in the future we will want to extend CGroups support to RT then
> > > > we will feel the pains because we do the effective computation in two
> > > > different places.
> > > 
> > > Hmm what you're suggesting here is that we want to have
> > > cpu.rt.uclamp.{min,max}? I'm not sure I can agree this is a good idea.
> > 
> > That's exactly what we already do for other controllers. For example,
> > if you look at the bandwidth controller, we have separate knobs for
> > CFS and RT tasks.
> > 
> > > It makes more sense to create a special group for all rt tasks rather than
> > > treat rt tasks in a cgroup differently.
> > 
> > Don't see why that should make more sense. This can work of course but
> > it would enforce a more strict configuration and usage of cgroups to
> > userspace.
> 
> It makes sense to me because cgroups offer a logical separation, and if tasks
> within the same group need to be treated differently then they're logically
> separated and need to be grouped differently consequently.

And that's why we can have knobs for different scheduling classes in
the same task group.

> The uclamp values are not RT only properties.

In the current implementation, but that was not certainly the original
intent. Ref to:

   Message-ID: <20190115101513.2822-10-patrick.bellasi@....com>
   https://lore.kernel.org/lkml/20190115101513.2822-10-patrick.bellasi@arm.com/

I did remove the CFS/RT specialization just to keep uclamp simple for
the first merge. But now that we have it in and people understand the
RT use-case better, it's just possible to complete the design with
specialized CFS and RT default constraints and knobs.

I also remember that Vincent pointed out, at a certain point, that RT
and CFS tasks should be distinguished internally when it comes to
aggregate the constraints. This will require more work, but what I'm
proposing since the beginning is the first step toward this solution.

> I can see the convenience of doing it the way you propose though. Not sure
> I agree about the added complexity worth it.

I keep not seeing the added complexity. We just need to add two new system
wide knobs and a small filter in the uclamp_eff_get().

> > I also have some doubths about this approach matching the delegation
> > model principles.
> 
> I don't know about that TBH.

Then it's worth a reading. I remember I discovered quite a lot of
caviets about how cgroups are supposed to be supported and used.

I have the doubth that reflecting your default concept into something
playing well with cgroups could be tricky. If anything else, the
constraints propagation approach we use know is something Teju already
gave us his blessing on.

> > > > Do note that a proper CGroup support requires that the system default
> > > > values defines the values for the root group and are consistently
> > > > propagated down the hierarchy. Thus we need to add a dedicated pair of
> > > > cgroup attributes, e.g. cpu.util.rt.{min.max}.
> > > > 
> > > > To recap, we don't need CGROUP support right now but just to add a new
> > > > default tracking similar to what we do for CFS.
> > > >
> > > > We already proposed such a support in one of the initial versions of
> > > > the uclamp series:
> > > >    Message-ID: <20190115101513.2822-10-patrick.bellasi@....com>
> > > >    https://lore.kernel.org/lkml/20190115101513.2822-10-patrick.bellasi@arm.com/
> > > 
> > > IIUC what you're suggesting is:
> > > 
> > > 	1. Use the sysctl to specify the default_rt_uclamp_min
> > > 	2. Enforce this value in uclamp_eff_get() rather than my sync logic
> > > 	3. Remove the current hack to always set
> > > 	   rt_task->uclamp_min = uclamp_none(UCLAMP_MAX)
> > 
> > Right, that's the essence...
> > 
> > > If I got it correctly I'd be happy to experiment with it if this is what
> > > you're suggesting. Otherwise I'm afraid I'm failing to see the crust of the
> > > problem you're trying to highlight.
> > 
> > ... from what your write above I think you got it right.
> > 
> > In my previous posting:
> > 
> >    Message-ID: <20200109092137.GA2811@...kstar>
> >    https://lore.kernel.org/lkml/20200109092137.GA2811@darkstar/
> > 
> > there is also the code snippet which should be good enough to extend
> > uclamp_eff_get(). Apart from that, what remains is:
> 
> I still have my doubts this will achieve what I want, but hopefully will have
> time to experiment with this today.

Right, just remember you need to reduce the max clamp, not the min as
you was proposing in the example above.

> My worry is that this function enforces
> restrictions, but what I want to do is set a default. I still can't see how
> they are the same; which your answers hints they are.

It all boils down to see that:
   
   rt_clamped_util = uclamp_util(1024, (*,200)) = 200

> > - to add the two new sysfs knobs for sysctl_sched_uclamp_util_{min,max}_rt
> 
> See my answer above.
> 
> > - make a call about how rt tasks in cgroups are clamped, a simple
> >   proposal is in the second snippet of my message above.
> 
> I'm not keen on changing how this works and it is outside of the scope of this
> patch anyway. Or at least I don't see how they are related yet.

Sure you'll realize once (if) you add the small uclamp_eff_get()
change I'm proposing.

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi