lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180330190951.nfcdwuzp42bl2lfy@linux-n805>
Date:   Fri, 30 Mar 2018 12:09:51 -0700
From:   Davidlohr Bueso <dave@...olabs.net>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>,
        manfred@...orfullife.com
Cc:     Linux Containers <containers@...ts.linux-foundation.org>,
        linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
        khlebnikov@...dex-team.ru, prakash.sangappa@...cle.com,
        luto@...nel.org, akpm@...ux-foundation.org, oleg@...hat.com,
        serge.hallyn@...ntu.com, esyr@...hat.com, jannh@...gle.com,
        linux-security-module@...r.kernel.org,
        Pavel Emelyanov <xemul@...nvz.org>,
        Nagarathnam Muthusamy <nagarathnam.muthusamy@...cle.com>
Subject: Re: [REVIEW][PATCH 11/11] ipc/sem: Fix semctl(..., GETPID, ...)
 between pid namespaces

On Wed, 28 Mar 2018, Davidlohr Bueso wrote:

>On Fri, 23 Mar 2018, Eric W. Biederman wrote:
>
>>Today the last process to update a semaphore is remembered and
>>reported in the pid namespace of that process.  If there are processes
>>in any other pid namespace querying that process id with GETPID the
>>result will be unusable nonsense as it does not make any
>>sense in your own pid namespace.
>
>Yeah that sounds pretty wrong.
>
>>
>>Due to ipc_update_pid I don't think you will be able to get System V
>>ipc semaphores into a troublesome cache line ping-pong.  Using struct
>>pids from separate process are not a problem because they do not share
>>a cache line.  Using struct pid from different threads of the same
>>process are unlikely to be a problem as the reference count update
>>can be avoided.
>>
>>Further linux futexes are a much better tool for the job of mutual
>>exclusion between processes than System V semaphores.  So I expect
>>programs that  are performance limited by their interprocess mutual
>>exclusion primitive will be using futexes.
>
>You would be wrong. There are plenty of real workloads out there
>that do not use futexes and are care about performance; in the end
>futexes are only good for the uncontended cases, it can also
>destroy numa boxes if you consider the global hash table. Experience
>as shown me that sysvipc sems are quite still used.
>
>>
>>So while it is possible that enhancing the storage of the last
>>rocess of a System V semaphore from an integer to a struct pid
>>will cause a performance regression because of the effect
>>of frequently updating the pid reference count.  I don't expect
>>that to happen in practice.
>
>How's that? Now thanks to ipc_update_pid() for each semop the user
>passes, perform_atomic_semop() will do two atomic updates for the
>cases where there are multiple processes updating the sem. This is
>not uncommon.
>
>Could you please provide some numbers.

I ran this on a 40-core (no ht) Westmere with two benchmarks. The first
is Manfred's sysvsem lockunlock[1] program which uses _processes_ to,
well, lock and unlock the semaphore. The options are a little
unconventional, to keep the "critical region small" and the lock+unlock
frequency high I added busy_in=busy_out=10. Similarly, to get the
worst case scenario and have everyone update the same semaphore, a single
one is used. Here are the results (pretty low stddev from run to run)
for doing 100,000 lock+unlock.

- 1 proc:
   * vanilla
	total execution time: 0.110638 seconds for 100000 loops
   * dirty
	total execution time: 0.120144 seconds for 100000 loops

- 2 proc:
   * vanilla
	total execution time: 0.379756 seconds for 100000 loops
   * dirty
	total execution time: 0.477778 seconds for 100000 loops

- 4 proc:
   * vanilla
	total execution time: 6.749710 seconds for 100000 loops
   * dirty
	total execution time: 4.651872 seconds for 100000 loops

- 8 proc:
   * vanilla
        total execution time: 5.558404 seconds for 100000 loops
   * dirty
	total execution time: 7.143329 seconds for 100000 loops

- 16 proc:
   * vanilla
	total execution time: 9.016398 seconds for 100000 loops
   * dirty
	total execution time: 9.412055 seconds for 100000 loops

- 32 proc:
   * vanilla
	total execution time: 9.694451 seconds for 100000 loops
   * dirty
	total execution time: 9.990451 seconds for 100000 loops

- 64 proc:
   * vanilla
	total execution time: 9.844984 seconds for 100032 loops
   * dirty
	total execution time: 10.016464 seconds for 100032 loops

Lower task counts show pretty massive performance hits of ~9%, ~25%
and ~30% for single, two and four/eight processes. As more are added
I guess the overhead tends to disappear as for one you have a lot
more locking contention going on.


The second workload I ran this patch on was Chris Mason's sem-scalebench[2]
program which uses _threads_ for the sysvsem option (this benchmark is more
about semaphores as a concept rather than sysvsem specific). Dealing with
a single semaphore and increasing thread counts we get:

sembench-sem
                                            vanill                   dirt
                                           vanilla                  dirty
Hmean     sembench-sem-2      286272.00 (   0.00%)   288232.00 (   0.68%)
Hmean     sembench-sem-8      510966.00 (   0.00%)   494375.00 (  -3.25%)
Hmean     sembench-sem-12     435753.00 (   0.00%)   465328.00 (   6.79%)
Hmean     sembench-sem-21     448144.00 (   0.00%)   462091.00 (   3.11%)
Hmean     sembench-sem-30     479519.00 (   0.00%)   471295.00 (  -1.72%)
Hmean     sembench-sem-48     533270.00 (   0.00%)   542525.00 (   1.74%)
Hmean     sembench-sem-79     510218.00 (   0.00%)   528392.00 (   3.56%)

Unsurprisingly, the thread case shows no overhead -- and yes, even better at
times but still noise). Similarly, when completely abusing the systems and doing
64*NCPUS there is pretty much no difference:

              vanill        dirt
             vanilla       dirty
User         1865.99     1819.75
System      35080.97    35396.34
Elapsed      3602.03     3560.50

So at least for a large box this patch hurts the cases where there is low
to medium cpu usage (no more than ~8 processes on a 40 core box) in a non
trivial way. For more processes it doesn't matter. We can confirm that the
case for threads is irrelevant. While I'm not happy about the 30% regression
I guess we can live with this.

Manfred, any thoughts?

Thanks
Davidlohr

[1] https://github.com/manfred-colorfu/ipcscale/blob/master/sem-lockunlock.c
[2] https://github.com/davidlohr/sembench-ng/blob/master/sembench.c

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ