[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4BC98C70.50904@colorfullife.com>
Date: Sat, 17 Apr 2010 12:24:48 +0200
From: Manfred Spraul <manfred@...orfullife.com>
To: Chris Mason <chris.mason@...cle.com>
CC: zach.brown@...cle.com, jens.axboe@...cle.com,
linux-kernel@...r.kernel.org, Nick Piggin <npiggin@...e.de>
Subject: Re: [PATCH 2/2] ipc semaphores: order wakeups based on waiter CPU
Hi Chris,
On 04/12/2010 08:49 PM, Chris Mason wrote:
> @@ -599,6 +622,13 @@ again:
> list_splice_init(&new_pending,&work_list);
> goto again;
> }
> +
> + list_sort(NULL,&wake_list, list_comp);
> + while (!list_empty(&wake_list)) {
> + q = list_entry(wake_list.next, struct sem_queue, list);
> + list_del_init(&q->list);
> + wake_up_sem_queue(q, 0);
> + }
> }
>
What about moving this step much later?
There is no need to hold any locks for the actual wake_up_process().
I've updated my patch:
- improved update_queue that guarantees no O(N^2) for your workload.
- move the actual wake-up after dropping all locks
- optimize setting sem_otime
- cacheline align the ipc spinlock.
But the odd thing:
It doesn't improve the sembench result at all (AMD Phenom X4)
The only thing that is reduced is the system time:
From ~1 min system time for "sembench -t 250 -w 250 -r 30 -o 0" to ~30 sec.
cpu binding the sembench threads results in an improvement of ~50% - at
the cost of a significant increase of the system time (from 30 seconds
to 1 min) and the user time (from 2 seconds to 14 seconds).
Are you sure that the problem is contention on the semaphore array spinlock?
With the above changes, the code that is under the spin_lock is very short.
Especially:
- Why does optimizing ipc/sem.c only reduce the system time [reported by
time] and not the sembench output?
- Why is there no improvement from the ___cache_line_align?
If there would be contention, then there should be trashing from
accessing the lock and writing sem_otime and reading sem_base.
- Additionally: you wrote that reducing the array size does not help much.
But: The arrays are 100% independant, the ipc code scales linearly.
Spreading the work over multiple spinlocks is - like cache line aligning
- usually a 100% guaranteed improvement if there is contention.
I've attached a modified sembench.c and the proposal for ipc/sem.c
Could you try it?
What do you think?
How many cores do you have in your test system?
--
Manfred
View attachment "patch-ipc-optimize_bulkwakeup-3" of type "text/plain" (11774 bytes)
View attachment "sembench.c" of type "text/plain" (12583 bytes)
Powered by blists - more mailing lists