linux-kernel - runnable tasks never making it to a runqueue where they can run?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAL1RGDUbxcx+cCM_mf1nFJ=Jz0SyaiW0ffxiY4gZ6KNWzJCnMg@mail.gmail.com>
Date:	Tue, 6 Mar 2012 18:19:04 -0800
From:	Roland Dreier <roland@...nel.org>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	LKML <linux-kernel@...r.kernel.org>
Subject: runnable tasks never making it to a runqueue where they can run?

Hi Peter, scheduler developers,

I'm seeing some strange scheduler behavior on a system with a process that has
a mix of SCHED_FIFO and SCHED_OTHER threads bound to various cpu masks.
The short story is that some SCHED_OTHER threads never get runtime, even
though there is one CPU where no SCHED_FIFO threads are allowed to run.

The longer story with details is that I have a system running 2.6.39.4 (I looked
through scheduler changes since then and didn't see anything relevant looking,
but of course I could have missed something -- and unfortunately rerunning
with a newer kernel is not totally trivial for this system).

It's a dual-socket Westmere Xeon system (2 threads per core, 6 cores per
package, 2 packages, so 24 rqs total).  The program in question has 20
SCHED_FIFO tasks, 19 of which are bound to a single CPU, and one of
which which is limited to the cpumask 7ff5d7 (CPUs 0-2, 4, 6-8, 10, 12-22).
The SCHED_OTHER threads (of which there are quite a few) are bound
to the cpumask fff5d7 (CPUs 0-2, 4, 6-8, 10, 12-23).  The main point here
is that all SCHED_OTHER threads can run on CPU 23, and none of the
SCHED_FIFO threads can run there.

Due to a couple of bugs in the process, the 19 single-CPU realtime threads
went beserk and never went to sleep, so their 19 CPUs are completely
given over to running SCHED_FIFO threads.  (I have the default

# grep . /proc/sys/kernel/sched_rt*
/proc/sys/kernel/sched_rt_period_us:1000000
/proc/sys/kernel/sched_rt_runtime_us:950000

but since 5 CPUs never run RT threads, this limit never kicks in, and I
have no group scheduling or anything like that configured)

Another bug led to a SCHED_OTHER thread always being runnable, and this
is where things got weird: that thread seemingly became the only thing that
ran on CPU 23 -- all the other threads get load balanced between runqueues
where they have no chance of ever running because of the RT threads.

For example, I sampled /proc/sched_debug for a while and watched a
particular TID bounce between the rq of cpu 6,7,8,10,18,19,22 but never
onto 23 where it would actually get a chance to run.  (Interestingly, those
8 CPUs seem to form a domain)

It seems something about this situation is confusing select_task_rq_fair()
so that CPU 23 always looks really bad, but I can't see what goes wrong.

Is this expected behavior?  I can believe that we brought this on ourselves by
misconfiguring things, but in that case it would be good to know what we could
do to avoid this problem.

I'm attaching /proc/sched_debug and /proc/schedstat of the system in this state
in case that helps.  Let me know if there's anything else I should gather from
the system.

Thanks!
  Roland

Download attachment "sched_debug" of type "application/octet-stream" (53049 bytes)

Download attachment "schedstat" of type "application/octet-stream" (16115 bytes)