linux-kernel - [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20211115170241.1.I94825a614577505bd1a8be9aeff208a49cb39b3d@changeid>
Date:   Mon, 15 Nov 2021 17:02:45 -0800
From:   Douglas Anderson <dianders@...omium.org>
To:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>
Cc:     Joel Fernandes <joelaf@...gle.com>,
        Douglas Anderson <dianders@...omium.org>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org
Subject: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority

While testing RT_GROUP_SCHED, I found that my system would go bonkers
if my test RT tasks ever got throttled (even if my test RT tasks were
set to only get a tiny slice of CPU time). Specifically I found that
whenever my test RT tasks were throttled that all other RT tasks in
the system were being starved (!!). Several important RT tasks in the
kernel were suddenly getting almost no timeslices and my system became
unusable.

After some experimentation, I determined that this behavior only
happened when I gave my test RT tasks a high priority. If I gave my
test RT tasks a low priority then they were throttled as expected and
nothing was starved.

I managed to come up with a test case that hopefully anyone can run to
demonstrate the problem. The test case uses shell commands and python
but certainly you could reproduce in other ways:

echo "Allow 20 ms more of RT at system and top cgroup"
old_rt=$(cat /proc/sys/kernel/sched_rt_runtime_us)
echo $((old_rt + 20000)) > /proc/sys/kernel/sched_rt_runtime_us
old_rt=$(cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us)
echo $((old_rt + 20000)) > /sys/fs/cgroup/cpu/cpu.rt_runtime_us

echo "Give 10 ms each to spinny and printy groups"
mkdir /sys/fs/cgroup/cpu/spinny
echo 10000 > /sys/fs/cgroup/cpu/spinny/cpu.rt_runtime_us
mkdir /sys/fs/cgroup/cpu/printy
echo 10000 > /sys/fs/cgroup/cpu/printy/cpu.rt_runtime_us

echo "Fork off a printy thing to be a nice RT citizen"
echo "Prints once per second. Priority only 1."
python -c "import time;
last_time = time.time()
while True:
  time.sleep(1)
  now_time = time.time()
  print('Time fies %f' % (now_time - last_time))
  last_time = now_time" &
pid=$!
echo "Give python a few seconds to get started"
sleep 3
echo $pid >> /sys/fs/cgroup/cpu/printy/tasks
chrt -p -f 1 $pid

echo "Sleep to observe that everything is peachy"
sleep 3

echo "Fork off a bunch of evil spinny things"
echo "Chews CPU time. Priority 99."
for i in $(seq 13); do
  python -c "while True: pass"&
  pid=$!
  echo $pid >> /sys/fs/cgroup/cpu/spinny/tasks
  chrt -p -f 99 $pid
done

echo "Huh? Almost no more prints?"

I believe that the problem is an "if" test that's been in
push_rt_task() forever where we will just reschedule the current task
if it's higher priority than the next one. If I just remove that
special case then everything works for me. I tried making it
conditional on just `!rq->rt.rt_throttled` but for whatever reason
that wasn't enough. The `if` test looks like an unlikely special case
optimization and it seems like things ought to be fine without it.

Signed-off-by: Douglas Anderson <dianders@...omium.org>
---
I know less than zero about the scheduler (so if I told you something,
it's better than 50% chance the the opposite is true!). Here I'm
asserting that we totally don't need this special case and the system
will be fine without it, but I actually don't have any data to back
that up. If nothing else, hopefully my test case in the commit message
would let someone else reproduce and see what I'm talking about and
can come up with a better fix.

 kernel/sched/rt.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index b48baaba2fc2..93ea5de0f917 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2048,16 +2048,6 @@ static int push_rt_task(struct rq *rq, bool pull)
 	if (WARN_ON(next_task == rq->curr))
 		return 0;

-	/*
-	 * It's possible that the next_task slipped in of
-	 * higher priority than current. If that's the case
-	 * just reschedule current.
-	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
-		resched_curr(rq);
-		return 0;
-	}
-
 	/* We might release rq lock */
 	get_task_struct(next_task);

-- 
2.34.0.rc1.387.gb447b232ab-goog