linux-kernel - Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's higher priority

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAD=FV=VL_xrQu3Bvb9GFcfSaOpTF_x5dWPhZe60SWC3TinaLqA@mail.gmail.com>
Date:   Wed, 1 Dec 2021 16:50:44 -0800
From:   Doug Anderson <dianders@...omium.org>
To:     Hillf Danton <hdanton@...a.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Joel Fernandes <joelaf@...gle.com>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Mel Gorman <mgorman@...e.de>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/rt: Don't reschedule a throttled task even if it's
 higher priority

Hi,

On Wed, Dec 1, 2021 at 3:31 AM Hillf Danton <hdanton@...a.com> wrote:
>
> On Mon, 15 Nov 2021 17:02:45 -0800 Douglas Anderson wrote:
> > While testing RT_GROUP_SCHED, I found that my system would go bonkers
> > if my test RT tasks ever got throttled (even if my test RT tasks were
> > set to only get a tiny slice of CPU time). Specifically I found that
> > whenever my test RT tasks were throttled that all other RT tasks in
> > the system were being starved (!!). Several important RT tasks in the
> > kernel were suddenly getting almost no timeslices and my system became
> > unusable.
> >
> > After some experimentation, I determined that this behavior only
> > happened when I gave my test RT tasks a high priority. If I gave my
> > test RT tasks a low priority then they were throttled as expected and
> > nothing was starved.
> >
> > I managed to come up with a test case that hopefully anyone can run to
> > demonstrate the problem. The test case uses shell commands and python
> > but certainly you could reproduce in other ways:
> >
> > echo "Allow 20 ms more of RT at system and top cgroup"
> > old_rt=$(cat /proc/sys/kernel/sched_rt_runtime_us)
> > echo $((old_rt + 20000)) > /proc/sys/kernel/sched_rt_runtime_us
> > old_rt=$(cat /sys/fs/cgroup/cpu/cpu.rt_runtime_us)
> > echo $((old_rt + 20000)) > /sys/fs/cgroup/cpu/cpu.rt_runtime_us
> >
> > echo "Give 10 ms each to spinny and printy groups"
> > mkdir /sys/fs/cgroup/cpu/spinny
> > echo 10000 > /sys/fs/cgroup/cpu/spinny/cpu.rt_runtime_us
> > mkdir /sys/fs/cgroup/cpu/printy
> > echo 10000 > /sys/fs/cgroup/cpu/printy/cpu.rt_runtime_us
> >
> > echo "Fork off a printy thing to be a nice RT citizen"
> > echo "Prints once per second. Priority only 1."
> > python -c "import time;
> > last_time = time.time()
> > while True:
> >   time.sleep(1)
> >   now_time = time.time()
> >   print('Time fies %f' % (now_time - last_time))
> >   last_time = now_time" &
> > pid=$!
> > echo "Give python a few seconds to get started"
> > sleep 3
> > echo $pid >> /sys/fs/cgroup/cpu/printy/tasks
> > chrt -p -f 1 $pid
> >
> > echo "Sleep to observe that everything is peachy"
> > sleep 3
> >
> > echo "Fork off a bunch of evil spinny things"
> > echo "Chews CPU time. Priority 99."
> > for i in $(seq 13); do
> >   python -c "while True: pass"&
> >   pid=$!
> >   echo $pid >> /sys/fs/cgroup/cpu/spinny/tasks
> >   chrt -p -f 99 $pid
> > done
> >
> > echo "Huh? Almost no more prints?"
> >
> > I believe that the problem is an "if" test that's been in
> > push_rt_task() forever where we will just reschedule the current task
> > if it's higher priority than the next one. If I just remove that
> > special case then everything works for me. I tried making it
> > conditional on just `!rq->rt.rt_throttled` but for whatever reason
> > that wasn't enough. The `if` test looks like an unlikely special case
> > optimization and it seems like things ought to be fine without it.
> >
> > Signed-off-by: Douglas Anderson <dianders@...omium.org>
> > ---
> > I know less than zero about the scheduler (so if I told you something,
> > it's better than 50% chance the the opposite is true!). Here I'm
> > asserting that we totally don't need this special case and the system
> > will be fine without it, but I actually don't have any data to back
> > that up. If nothing else, hopefully my test case in the commit message
> > would let someone else reproduce and see what I'm talking about and
> > can come up with a better fix.
>
> Can you try to tune the knob down to somewhere like 1ms?
>
> Hillf
>
> /*
>  * period over which we measure -rt task CPU usage in us.
>  * default: 1s
>  */
> unsigned int sysctl_sched_rt_period = 1000000;

I could give it a shot, but that's a pretty big behavior change and
the Documentation (sched-rt-group.rst) warns me away from such a
thing. The default of 1 second seems crazy conservative, but tweaking
it all the way down to 1 ms seems a bit aggressive. It also feels like
this would only be working around the problem, not necessarily solving
it at the core?

-Doug