linux-kernel - CFS Performance Issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 28 May 2009 15:02:22 +0200
From:	Olaf Kirch <okir@...e.de>
To:	linux-kernel@...r.kernel.org
Cc:	mingo@...hat.com, Andreas Gruenbacher <agruen@...e.de>
Subject: CFS Performance Issues

Hi Ingo,

As you probably know, we've been chasing a variety of performance issues
on our SLE11 kernel, and one of the suspects has been CFS for quite a
while. The benchmarks that pointed to CFS include AIM7, dbench, and a few
others, but the picture has been a bit hazy as to what is really the problem here.

Now IBM recently told us they had played around with some scheduler
tunables and found that by turning off NEW_FAIR_SCHEDULERS, they
could make the regression on a compute benchmark go away completely.
We're currently working on rerunning other benchmarks with NEW_FAIR_SLEEPERS
turned off to see whether it has an impact on these as well.

Of course, the first question we asked ourselves was, how can NEW_FAIR_SLEEPERS
affect a benchmark that rarely sleeps, or not at all?

The answer was, it's not affecting the benchmark processes, but some noise
going on in the background. When I was first able to reproduce this on my work
station, it was knotify4 running in the background - using hardly any CPU, but
getting woken up ~1000 times a second. Don't ask me what it's doing :-)

So I sat down and reproduced this; the most recent iteration of the test program
is courtesy of Andreas Gruenbacher (see below).

This program spawns a number of processes that just spin in a loop. It also spawns
a single process that wakes up 1000 times a second. Every second, it computes the
average time slice per process (utime / number of involuntary context switches),
and prints out the overall average time slice and average utime.

While running this program, you can conveniently enable or disable fair sleepers.
When I do this on my test machine (no desktop in the background this time :-)
I see this:

./slice 16
    avg slice:  1.12 utime: 216263.187500
    avg slice:  0.25 utime: 125507.687500
    avg slice:  0.31 utime: 125257.937500
    avg slice:  0.31 utime: 125507.812500
    avg slice:  0.12 utime: 124507.875000
    avg slice:  0.38 utime: 124757.687500
    avg slice:  0.31 utime: 125508.000000
    avg slice:  0.44 utime: 125757.750000
    avg slice:  2.00 utime: 128258.000000
 ------ here I turned off new_fair_sleepers ----
    avg slice: 10.25 utime: 137008.500000
    avg slice:  9.31 utime: 139008.875000
    avg slice: 10.50 utime: 141508.687500
    avg slice:  9.44 utime: 139258.750000
    avg slice: 10.31 utime: 140008.687500
    avg slice:  9.19 utime: 139008.625000
    avg slice: 10.00 utime: 137258.625000
    avg slice: 10.06 utime: 135258.562500
    avg slice:  9.62 utime: 138758.562500

As you can see, the average time slice is *extremely* low with new fair
sleepers enabled. Turning it off, we get ~10ms time slices, and a
performance that is roughly 10% higher. It looks like this kind of
"silly time slice syndrome" is what is really eating performance here.

After staring at place_entity for a while, and by watching the process'
vruntime for a while, I think what's happening is this.

With fair sleepers turned off, a process that just got woken up will
get the vruntime of the process that's leftmost in the rbtree, and will
thus be placed to the right of the current task.

However, with fair_sleepers enabled, a newly woken up process
will retain its old vruntime as long as it's less than sched_latency
in the past, and thus it will be placed to the very left in the rbtree.
Since a task that is mostly sleeping will never accrue vruntime at
the same rate a cpu-bound task does, it will always preempt any
running task almost immediately after it's scheduled.

Does this make sense?

Any insight you can offer here is greatly appreciated!

Thanks,
Olaf
-- 
Neo didn't bring down the Matrix. SOA did.
				--soafacts.com

View attachment "slice.c" of type "text/x-csrc" (2107 bytes)