linux-kernel - Re: [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4BC98C70.50904@colorfullife.com>
Date:	Sat, 17 Apr 2010 12:24:48 +0200
From:	Manfred Spraul <manfred@...orfullife.com>
To:	Chris Mason <chris.mason@...cle.com>
CC:	zach.brown@...cle.com, jens.axboe@...cle.com,
	linux-kernel@...r.kernel.org, Nick Piggin <npiggin@...e.de>
Subject: Re: [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU

Hi Chris,

On 04/12/2010 08:49 PM, Chris Mason wrote:
> @@ -599,6 +622,13 @@ again:
>   		list_splice_init(&new_pending,&work_list);
>   		goto again;
>   	}
> +
> +	list_sort(NULL,&wake_list, list_comp);
> +	while (!list_empty(&wake_list)) {
> +		q = list_entry(wake_list.next, struct sem_queue, list);
> +		list_del_init(&q->list);
> +		wake_up_sem_queue(q, 0);
> +	}
>   }
>    
What about moving this step much later?

There is no need to hold any locks for the actual wake_up_process().

I've updated my patch:
- improved update_queue that guarantees no O(N^2) for your workload.
- move the actual wake-up after dropping all locks
- optimize setting sem_otime
- cacheline align the ipc spinlock.

But the odd thing:
It doesn't improve the sembench result at all (AMD Phenom X4)
The only thing that is reduced is the system time:
 From ~1 min system time for "sembench -t 250 -w 250 -r 30 -o 0" to ~30 sec.

cpu binding the sembench threads results in an improvement of ~50% - at 
the cost of a significant increase of the system time (from 30 seconds 
to 1 min) and the user time (from 2 seconds to 14 seconds).

Are you sure that the problem is contention on the semaphore array spinlock?
With the above changes, the code that is under the spin_lock is very short.
Especially:
- Why does optimizing ipc/sem.c only reduce the system time [reported by 
time] and not the sembench output?
- Why is there no improvement from the ___cache_line_align?
If there would be  contention, then there should be trashing from 
accessing the lock and writing sem_otime and reading sem_base.
- Additionally: you wrote that reducing the array size does not help much.
But: The arrays are 100% independant, the ipc code scales linearly.
Spreading the work over multiple spinlocks is - like cache line aligning 
- usually a 100% guaranteed improvement if there is contention.

I've attached a modified sembench.c and the proposal for ipc/sem.c
Could you try it?
What do you think?
How many cores do you have in your test system?

--
     Manfred

View attachment "patch-ipc-optimize_bulkwakeup-3" of type "text/plain" (11774 bytes)

View attachment "sembench.c" of type "text/plain" (12583 bytes)