[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 04 Apr 2007 12:22:00 +1000
From: Nick Piggin <nickpiggin@...oo.com.au>
To: Eric Dumazet <dada1@...mosbay.com>
CC: Andrew Morton <akpm@...ux-foundation.org>,
Jakub Jelinek <jakub@...hat.com>,
Ulrich Drepper <drepper@...hat.com>,
Andi Kleen <andi@...stfloor.org>,
Rik van Riel <riel@...hat.com>,
Linux Kernel <linux-kernel@...r.kernel.org>,
linux-mm@...ck.org, Hugh Dickins <hugh@...itas.com>
Subject: Re: missing madvise functionality
Eric Dumazet wrote:
> Andrew Morton a écrit :
>
>> On Tue, 3 Apr 2007 16:29:37 -0400
>> Jakub Jelinek <jakub@...hat.com> wrote:
>>
>>> On Tue, Apr 03, 2007 at 01:17:09PM -0700, Ulrich Drepper wrote:
>>>
>>>> Andrew Morton wrote:
>>>>
>>>>> Ulrich, could you suggest a little test app which would demonstrate
>>>>> this
>>>>> behaviour?
>>>>
>>>> It's not really reliably possible to demonstrate this with a small
>>>> program using malloc. You'd need something like this mysql test case
>>>> which Rik said is not hard to run by yourself.
>>>>
>>>> If somebody adds a kernel interface I can easily produce a glibc patch
>>>> so that the test can be run in the new environment.
>>>>
>>>> But it's of course easy enough to simulate the specific problem in a
>>>> micro benchmark. If you want that let me know.
>>>
>>> I think something like following testcase which simulates what free
>>> and malloc do when trimming/growing a non-main arena.
>>>
>>> My guess is that all the page zeroing is pretty expensive as well and
>>> takes significant time, but I haven't profiled it.
>>>
>>> #include <pthread.h>
>>> #include <stdlib.h>
>>> #include <sys/mman.h>
>>> #include <unistd.h>
>>>
>>> void *
>>> tf (void *arg)
>>> {
>>> (void) arg;
>>> size_t ps = sysconf (_SC_PAGE_SIZE);
>>> void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>> if (p == MAP_FAILED)
>>> exit (1);
>>> int i;
>>> for (i = 0; i < 100000; i++)
>>> {
>>> /* Pretend to use the buffer. */
>>> char *q, *r = (char *) p + 128 * ps;
>>> size_t s;
>>> for (q = (char *) p; q < r; q += ps)
>>> *q = 1;
>>> for (s = 0, q = (char *) p; q < r; q += ps)
>>> s += *q;
>>> /* Free it. Replace this mmap with
>>> madvise (p, 128 * ps, MADV_THROWAWAY) when implemented. */
>>> if (mmap (p, 128 * ps, PROT_NONE,
>>> MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p)
>>> exit (2);
>>> /* And immediately malloc again. This would then be deleted. */
>>> if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE))
>>> exit (3);
>>> }
>>> return NULL;
>>> }
>>>
>>> int
>>> main (void)
>>> {
>>> pthread_t th[32];
>>> int i;
>>> for (i = 0; i < 32; i++)
>>> if (pthread_create (&th[i], NULL, tf, NULL))
>>> exit (4);
>>> for (i = 0; i < 32; i++)
>>> pthread_join (th[i], NULL);
>>> return 0;
>>> }
>>>
>>
>> whee. 135,000 context switches/sec on a slow 2-way. mmap_sem, most
>> likely. That is ungood.
>>
>> Did anyone monitor the context switch rate with the mysql test?
>>
>> Interestingly, your test app (with s/100000/1000) runs to completion
>> in 13
>> seocnd on the slow 2-way. On a fast 8-way, it took 52 seconds and
>> sustained 40,000 context switches/sec. That's a bit unexpected.
>>
>> Both machines show ~8% idle time, too :(
>
>
> Yes... then add to this some futex work, and you get the picture.
>
> I do think such workloads might benefit from a vma_cache not shared by
> all threads but private to each thread. A sequence could invalidate the
> cache(s).
>
> ie instead of a mm->mmap_cache, having a mm->sequence, and each thread
> having a current->mmap_cache and current->mm_sequence
I have a patchset to do exactly this, btw.
Anyway what is the status of the private futex work. I don't think that
is very intrusive or complicated, so it should get merged ASAP (so then
at least we have the interface there).
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists