[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180330190951.nfcdwuzp42bl2lfy@linux-n805>
Date: Fri, 30 Mar 2018 12:09:51 -0700
From: Davidlohr Bueso <dave@...olabs.net>
To: "Eric W. Biederman" <ebiederm@...ssion.com>,
manfred@...orfullife.com
Cc: Linux Containers <containers@...ts.linux-foundation.org>,
linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
khlebnikov@...dex-team.ru, prakash.sangappa@...cle.com,
luto@...nel.org, akpm@...ux-foundation.org, oleg@...hat.com,
serge.hallyn@...ntu.com, esyr@...hat.com, jannh@...gle.com,
linux-security-module@...r.kernel.org,
Pavel Emelyanov <xemul@...nvz.org>,
Nagarathnam Muthusamy <nagarathnam.muthusamy@...cle.com>
Subject: Re: [REVIEW][PATCH 11/11] ipc/sem: Fix semctl(..., GETPID, ...)
between pid namespaces
On Wed, 28 Mar 2018, Davidlohr Bueso wrote:
>On Fri, 23 Mar 2018, Eric W. Biederman wrote:
>
>>Today the last process to update a semaphore is remembered and
>>reported in the pid namespace of that process. If there are processes
>>in any other pid namespace querying that process id with GETPID the
>>result will be unusable nonsense as it does not make any
>>sense in your own pid namespace.
>
>Yeah that sounds pretty wrong.
>
>>
>>Due to ipc_update_pid I don't think you will be able to get System V
>>ipc semaphores into a troublesome cache line ping-pong. Using struct
>>pids from separate process are not a problem because they do not share
>>a cache line. Using struct pid from different threads of the same
>>process are unlikely to be a problem as the reference count update
>>can be avoided.
>>
>>Further linux futexes are a much better tool for the job of mutual
>>exclusion between processes than System V semaphores. So I expect
>>programs that are performance limited by their interprocess mutual
>>exclusion primitive will be using futexes.
>
>You would be wrong. There are plenty of real workloads out there
>that do not use futexes and are care about performance; in the end
>futexes are only good for the uncontended cases, it can also
>destroy numa boxes if you consider the global hash table. Experience
>as shown me that sysvipc sems are quite still used.
>
>>
>>So while it is possible that enhancing the storage of the last
>>rocess of a System V semaphore from an integer to a struct pid
>>will cause a performance regression because of the effect
>>of frequently updating the pid reference count. I don't expect
>>that to happen in practice.
>
>How's that? Now thanks to ipc_update_pid() for each semop the user
>passes, perform_atomic_semop() will do two atomic updates for the
>cases where there are multiple processes updating the sem. This is
>not uncommon.
>
>Could you please provide some numbers.
I ran this on a 40-core (no ht) Westmere with two benchmarks. The first
is Manfred's sysvsem lockunlock[1] program which uses _processes_ to,
well, lock and unlock the semaphore. The options are a little
unconventional, to keep the "critical region small" and the lock+unlock
frequency high I added busy_in=busy_out=10. Similarly, to get the
worst case scenario and have everyone update the same semaphore, a single
one is used. Here are the results (pretty low stddev from run to run)
for doing 100,000 lock+unlock.
- 1 proc:
* vanilla
total execution time: 0.110638 seconds for 100000 loops
* dirty
total execution time: 0.120144 seconds for 100000 loops
- 2 proc:
* vanilla
total execution time: 0.379756 seconds for 100000 loops
* dirty
total execution time: 0.477778 seconds for 100000 loops
- 4 proc:
* vanilla
total execution time: 6.749710 seconds for 100000 loops
* dirty
total execution time: 4.651872 seconds for 100000 loops
- 8 proc:
* vanilla
total execution time: 5.558404 seconds for 100000 loops
* dirty
total execution time: 7.143329 seconds for 100000 loops
- 16 proc:
* vanilla
total execution time: 9.016398 seconds for 100000 loops
* dirty
total execution time: 9.412055 seconds for 100000 loops
- 32 proc:
* vanilla
total execution time: 9.694451 seconds for 100000 loops
* dirty
total execution time: 9.990451 seconds for 100000 loops
- 64 proc:
* vanilla
total execution time: 9.844984 seconds for 100032 loops
* dirty
total execution time: 10.016464 seconds for 100032 loops
Lower task counts show pretty massive performance hits of ~9%, ~25%
and ~30% for single, two and four/eight processes. As more are added
I guess the overhead tends to disappear as for one you have a lot
more locking contention going on.
The second workload I ran this patch on was Chris Mason's sem-scalebench[2]
program which uses _threads_ for the sysvsem option (this benchmark is more
about semaphores as a concept rather than sysvsem specific). Dealing with
a single semaphore and increasing thread counts we get:
sembench-sem
vanill dirt
vanilla dirty
Hmean sembench-sem-2 286272.00 ( 0.00%) 288232.00 ( 0.68%)
Hmean sembench-sem-8 510966.00 ( 0.00%) 494375.00 ( -3.25%)
Hmean sembench-sem-12 435753.00 ( 0.00%) 465328.00 ( 6.79%)
Hmean sembench-sem-21 448144.00 ( 0.00%) 462091.00 ( 3.11%)
Hmean sembench-sem-30 479519.00 ( 0.00%) 471295.00 ( -1.72%)
Hmean sembench-sem-48 533270.00 ( 0.00%) 542525.00 ( 1.74%)
Hmean sembench-sem-79 510218.00 ( 0.00%) 528392.00 ( 3.56%)
Unsurprisingly, the thread case shows no overhead -- and yes, even better at
times but still noise). Similarly, when completely abusing the systems and doing
64*NCPUS there is pretty much no difference:
vanill dirt
vanilla dirty
User 1865.99 1819.75
System 35080.97 35396.34
Elapsed 3602.03 3560.50
So at least for a large box this patch hurts the cases where there is low
to medium cpu usage (no more than ~8 processes on a 40 core box) in a non
trivial way. For more processes it doesn't matter. We can confirm that the
case for threads is irrelevant. While I'm not happy about the 30% regression
I guess we can live with this.
Manfred, any thoughts?
Thanks
Davidlohr
[1] https://github.com/manfred-colorfu/ipcscale/blob/master/sem-lockunlock.c
[2] https://github.com/davidlohr/sembench-ng/blob/master/sembench.c
Powered by blists - more mailing lists