linux-kernel - Re: [PATCH RESEND 2 1/1] sched/syscalls: Allow setting niceness using sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e3Nl9UdWoWuPJauA6X3vNj71jDUwHZYS5b5WSmKCHrU7AyivFG5oLkrL-ewb3IjoQyUouDgZO2T-3WEzBIJ9Uru1AcEDTaVsRzHrukUfto8=@pm.me>
Date: Wed, 13 Nov 2024 00:13:13 +0000
From: Michael Pratt <mcpratt@...me>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, Andrew Morton <akpm@...ux-foundation.org>, Thomas Gleixner <tglx@...utronix.de>, Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH RESEND 2 1/1] sched/syscalls: Allow setting niceness using sched_param struct

Hi Steven, thanks for the reply,

On Tuesday, November 12th, 2024 at 10:34, Steven Rostedt <rostedt@...dmis.org> wrote:

> >
> > Cc: stable@...r.kernel.org # Apply to kernel/sched/core.c
>
>
> Why is stable Cc'd?
>

I believe this should be backported, if accepted,
so that the behavior between kernel versions is matching.

> > Signed-off-by: Michael C. Pratt mcpratt@...me
> > ---
> > kernel/sched/syscalls.c | 21 +++++++++++++++++++--
> > 1 file changed, 19 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> > index 24f9f90b6574..43eb283e6281 100644
> > --- a/kernel/sched/syscalls.c
> > +++ b/kernel/sched/syscalls.c
> > @@ -785,6 +785,19 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
> > attr.sched_policy = policy;
> > }
> >
> > + if (attr.sched_priority > MAX_PRIO-1)
> > + return -EINVAL;
> > +
> > + /*
> > + * If priority is set for SCHED_NORMAL or SCHED_BATCH,
> > + * set the niceness instead, but only for user calls.
> > + */
> > + if (check && attr.sched_priority > MAX_RT_PRIO-1 &&
> > + ((policy != SETPARAM_POLICY && fair_policy(policy)) || fair_policy(p->policy))) {
> > + attr.sched_nice = PRIO_TO_NICE(attr.sched_priority);
> > + attr.sched_priority = 0;
> > + }
>
>
> This really looks like a hack.

If you have what you would consider to be a "non-hack" fix
in order for standard function posix_spawnattr_setschedparam()
to be usable instead of always fail for non-RT processes, let me know...

Looking at the larger picture, because Linux translates the sched_param struct
into the local sched_attr struct, and because of changes in history, there are several
caveats to be handled. This internal preparation function _sched_setscheduler()
is essentially only consisting of "hacks", except for the member copying between structs.

> Specifically that we are exposing how the
> kernel records priority to user space. That is the greater than
> MAX_RT_PRIO-1. 120 may be the priority the kernel sees on nice values, but
> it is not something that we should expose to user space system calls.

Is the default static priority value not already exposed, perhaps everywhere???
It's not a secret, but rather common knowledge that when observing a "niceness" of 0
or a "priority" of 20 as seen through common programs like "top", that
these values actually represent 120 as the real priority value, and that
the niceness value is a simple addition to the default for the final effective value.

Also, we have in newer kernel versions, maybe or maybe not dependent on configuration,
the procfs object called "sched" which already shows the actual value.

I can do:

  $ cat /proc/$$/sched

and see the 120 without needing interpretation
due to it being represented in a different way.

> That said, you are worried about the race of spawning a new task and
> setting its nice value because the new task may have exited. What about
> using pidfd? Create a task returning the pidfd and use that to set its nice
> value.

I read a little about pidfd, but I'm not seeing the exact connection here,
perhaps it will reduce the race condition but it cannot eliminate it as far as I see.
For example, I am not finding a function that uses it to adjust niceness.

It's not that the "exit before modify" race condition is the only concern,
it's just one of the less obvious factors making up my rationale for this change.
I'm also concerned with efficiency. Why do we need to call another syscall
if the syscall we are already in can handle it?

Personally, I find it strange that in sched_setscheduler()
the policy can be changed but not the priority,
when there is a standardized function dedicated to just that.

The difference between RT and normal processes
is simply that for normal processes, we use "niceness" instead,
so this patch simply translates the priority to "niceness",
which cannot be expressed separately with the relevant POSIX functions.

--
MCP