lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 10 Oct 2009 17:57:59 +0300
From:	Török Edwin <edwin@...mav.net>
To:	Ingo Molnar <mingo@...e.hu>, Peter Zijlstra <peterz@...radead.org>
CC:	Linux Kernel <linux-kernel@...r.kernel.org>, aCaB <acab@...mav.net>
Subject: Mutex vs semaphores scheduler bug

Hi,

If a semaphore (such as mmap_sem) is heavily congested, then using a
userspace mutex makes the program faster.

For example using a mutex around *anonymous* mmaps, speeds it up
significantly (~80% on this microbenchmark,
~15% on real applications). Such workarounds shouldn't  be necessary for
userspace applications, the kernel should
by default use the most efficient implementation for locks.

Run the attached testcase, and see that when using a mutex, the elapsed
time (and user+sys too) is smaller
than when not using the mutex [1].

So using mmap/munmap by itself (and thus directly hitting the
semaphore), and memsetting only first page
is 80% slower than doing the same guarded by a mutex.
If the entire allocated region is memset (thus faulting all pages), then
the mmap without a mutex is still 7% slower.

In both cases the maximum time for a mmap is HIGHER when using the mutex
(8.7 ms>5.1 ms and 4 ms>0.2 ms),
however the average time is smaller.
However when using a mutex the number of context switches is SMALLER by
40-60%.

I think its a bug in the scheduler, it scheduler the mutex case much
better.
Maybe because userspace also spins a bit before actually calling
futex(). So *maybe* a possible fix would be to also try
and spin a little on some semaphores before taking them.

I think its important to optimize the mmap_sem semaphore, because it is
hit quite often (even when not mmaping files):
 - mmap of anon memory
 - malloc (due to above)
 - stack increase
 - page faults (of both file-mapped and anon memory)

P.S.: this is not just a microbenchmark, it came up during profiling of
ClamAV, using the mutex speeds it up by ~15%.
P.S.: if needed I can send you my .config
Best regards,
--Edwin

[1]

Timings on Linux 2.6.31 Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
--------------------------------------------------------------------------------------------------------------------------

Running tests with ntimings=100, nloops=100, nthreads=16, number of CPUs=4
Starting test mmap sem congestion (memset all)
Test mmap sem congestion (memset all) ended
Timing mmap sem congestion (memset all): elapsted time: 99.1104 s, 3.055
%user, 38.521 %sys
max 5175.1 us, average 2477.75 us, stdev 49681.3 us
resource usage: 3.02818 s user, 38.1784 s sys, 99.1104 s elapsed,
(10240001 min + 0 maj) pagefaults, 5844551 + 2924 context switches

Starting test mmap sem congestion with mutex workaround (memset all)
Test mmap sem congestion with mutex workaround (memset all) ended
Timing mmap sem congestion with mutex workaround (memset all): elapsted
time: 93.4126 s, 3.357 %user, 31.822 %sys
max 8710.94 us, average 2335.3 us, stdev 549728 us
resource usage: 3.13619 s user, 29.7259 s sys, 93.4126 s elapsed,
(10240000 min + 0 maj) pagefaults, 3316568 + 1464 context switches

Starting test mmap sem congestion (memset firstpage)
Test mmap sem congestion (memset firstpage) ended
Timing mmap sem congestion (memset firstpage): elapsted time: 6.73074 s,
0.000 %user, 15.571 %sys
max 229.96 us, average 168.263 us, stdev 541.789 us
resource usage: 0 s user, 1.04806 s sys, 6.73074 s elapsed, (160000 min
+ 0 maj) pagefaults, 639971 + 7993 context switches

Starting test mmap sem congestion with mutex workaround (memset firstpage)
Test mmap sem congestion with mutex workaround (memset firstpage) ended
Timing mmap sem congestion with mutex workaround (memset firstpage):
elapsted time: 1.23567 s, 2.590 %user, 37.553 %sys
max 4039.39 us, average 30.887 us, stdev 61860.3 us
resource usage: 0.032 s user, 0.464025 s sys, 1.23567 s elapsed, (160000
min + 0 maj) pagefaults, 22335 + 53 context switches

Timings on Linux 2.6.30.4 Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
--------------------------------------------------------------------------------------------------------------------------

Running tests with ntimings=100, nloops=100, nthreads=16, number of CPUs=8
Starting test mmap sem congestion (memset all)
Test mmap sem congestion (memset all) ended
Timing mmap sem congestion (memset all): elapsted time: 81.7756 s,
10.527 %user, 317.860 %sys
        max 4893.67 us, average 4088.78 us, stdev 23209.1 us
        resource usage: 8.60853 s user, 259.932 s sys, 81.7756 s
elapsed, (10240002 min + 0 maj) pagefaults, 5278414 + 16294 context switches

Starting test mmap sem congestion with mutex workaround (memset all)
Test mmap sem congestion with mutex workaround (memset all) ended
Timing mmap sem congestion with mutex workaround (memset all): elapsted
time: 55.312 s, 13.068 %user, 259.070 %sys
        max 5665.63 us, average 2765.6 us, stdev 527012 us
        resource usage: 7.22845 s user, 143.297 s sys, 55.312 s elapsed,
(10240000 min + 0 maj) pagefaults, 3351636 + 14707 context switches

Starting test mmap sem congestion (memset firstpage)
Test mmap sem congestion (memset firstpage) ended
Timing mmap sem congestion (memset firstpage): elapsted time: 6.08303 s,
0.526 %user, 51.951 %sys
        max 354.55 us, average 304.151 us, stdev 74.5072 us
        resource usage: 0.032 s user, 3.16019 s sys, 6.08303 s elapsed,
(160000 min + 0 maj) pagefaults, 639994 + 3959 context switches

Starting test mmap sem congestion with mutex workaround (memset firstpage)
Test mmap sem congestion with mutex workaround (memset firstpage) ended
Timing mmap sem congestion with mutex workaround (memset firstpage):
elapsted time: 1.15893 s, 6.558 %user, 90.088 %sys
        max 4767.38 us, average 57.9458 us, stdev 83782.6 us
        resource usage: 0.076001 s user, 1.04406 s sys, 1.15893 s
elapsed, (160000 min + 0 maj) pagefaults, 64700 + 29 context switches



View attachment "scheduler.c" of type "text/x-csrc" (8078 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ