[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241112193617.169fefbc@gandalf.local.home>
Date: Tue, 12 Nov 2024 19:36:17 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Michael Pratt <mcpratt@...me>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, Andrew Morton
<akpm@...ux-foundation.org>, Thomas Gleixner <tglx@...utronix.de>, Peter
Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, Juri Lelli
<juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH RESEND 2 1/1] sched/syscalls: Allow setting niceness
using sched_param struct
On Wed, 13 Nov 2024 00:13:13 +0000
Michael Pratt <mcpratt@...me> wrote:
> >
> > Why is stable Cc'd?
> >
>
> I believe this should be backported, if accepted,
> so that the behavior between kernel versions is matching.
That's not the purpose of stable. In fact, I would argue that it's the
opposite of what stable is for. A stable kernel should *not* change
behavior as that can cause regressions. If you want the newest behavior,
then you should use the newest kernels.
> I can do:
>
> $ cat /proc/$$/sched
>
> and see the 120 without needing interpretation
> due to it being represented in a different way.
True it is exposed via files, but wouldn't this be the first change to make
it visible via a system call?
>
>
> > That said, you are worried about the race of spawning a new task and
> > setting its nice value because the new task may have exited. What about
> > using pidfd? Create a task returning the pidfd and use that to set its nice
> > value.
>
> I read a little about pidfd, but I'm not seeing the exact connection here,
> perhaps it will reduce the race condition but it cannot eliminate it as far as I see.
> For example, I am not finding a function that uses it to adjust niceness.
We can always add a system call do to that ;-) In fact, there's a lot of
system calls that need to be converted to use pidfd over pid.
>
> It's not that the "exit before modify" race condition is the only concern,
> it's just one of the less obvious factors making up my rationale for this change.
> I'm also concerned with efficiency. Why do we need to call another syscall
> if the syscall we are already in can handle it?
>
> Personally, I find it strange that in sched_setscheduler()
> the policy can be changed but not the priority,
> when there is a standardized function dedicated to just that.
My concern is the man page that has (in Debian):
$ man sched_setscheduler
[..]
Currently, Linux supports the following "normal" (i.e., non-real-time) scheduling policies as values that may be specified in policy:
SCHED_OTHER the standard round-robin time-sharing policy;
SCHED_BATCH for "batch" style execution of processes; and
SCHED_IDLE for running very low priority background jobs.
For each of the above policies, param->sched_priority must be 0.
Where we already document that the sched_priority "must be 0".
>
> The difference between RT and normal processes
> is simply that for normal processes, we use "niceness" instead,
> so this patch simply translates the priority to "niceness",
> which cannot be expressed separately with the relevant POSIX functions.
I agree that POSIX has never been that great, but instead of modifying an
existing documented system call to do something that is documented not to
do, I believe we should either use other existing system calls or make a
new one.
-- Steve
Powered by blists - more mailing lists