linux-kernel - preemption and rwsems (was: Re: missing madvise functionality)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20070404160006.8d81a533.akpm@linux-foundation.org>
Date:	Wed, 4 Apr 2007 16:00:06 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Jakub Jelinek <jakub@...hat.com>
Cc:	Ulrich Drepper <drepper@...hat.com>,
	Andi Kleen <andi@...stfloor.org>,
	Rik van Riel <riel@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	linux-mm@...ck.org, Hugh Dickins <hugh@...itas.com>,
	Ingo Molnar <mingo@...e.hu>
Subject: preemption and rwsems (was: Re: missing madvise functionality)

On Tue, 3 Apr 2007 16:29:37 -0400
Jakub Jelinek <jakub@...hat.com> wrote:

> #include <pthread.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <unistd.h>
> 
> void *
> tf (void *arg)
> {
>   (void) arg;
>   size_t ps = sysconf (_SC_PAGE_SIZE);
>   void *p = mmap (NULL, 128 * ps, PROT_READ | PROT_WRITE,
>                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>   if (p == MAP_FAILED)
>     exit (1);
>   int i;
>   for (i = 0; i < 100000; i++)
>     {
>       /* Pretend to use the buffer.  */
>       char *q, *r = (char *) p + 128 * ps;
>       size_t s;
>       for (q = (char *) p; q < r; q += ps)
>         *q = 1;
>       for (s = 0, q = (char *) p; q < r; q += ps)
>         s += *q;
>       /* Free it.  Replace this mmap with
>          madvise (p, 128 * ps, MADV_THROWAWAY) when implemented.  */
>       if (mmap (p, 128 * ps, PROT_NONE,
>                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) != p)
>         exit (2);
>       /* And immediately malloc again.  This would then be deleted.  */
>       if (mprotect (p, 128 * ps, PROT_READ | PROT_WRITE))
>         exit (3);
>     }
>   return NULL;
> }
> 
> int
> main (void)
> {
>   pthread_t th[32];
>   int i;
>   for (i = 0; i < 32; i++)
>     if (pthread_create (&th[i], NULL, tf, NULL))
>       exit (4);
>   for (i = 0; i < 32; i++)
>     pthread_join (th[i], NULL);
>   return 0;
> }

This little test app is fun.

I run it all on a single CPU under `taskset -c 0' on the 8-way and it still
causes 160,000 context switches per second and takes 9.5 seconds (after
s/100000/1000).

The kernel has

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

and when I switch that to

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set

the context switch rate falls to zilch and total runtime falls to 6.4
seconds.

Presumably the same problem will occur with CONFIG_PREEMPT_VOLUNTARY on
uniprocessor kernels.

<thinks>

What we effectively have is 32 threads on a single CPU all doing

	for (ever) {
		down_write()
		up_write()
		down_read()
		up_read();
	}

and rwsems are "fair".  So

  thread A                                     thread B

  down_write();

  cond_resched()
  ->schedule()

                                               down_read() -> blocks

  up_write()

  down_read()

  up_read()

  down_write() -> there's a reader: block

                                               down_read() -> succeeds

                                               up_read()

                                               down_write() -> there's another down_writer: block

  down_write() -> succeeds

  up_write()

  down_read() -> there's a down_writer: block

                                               down_write() succeeds

                                               up_write()

                                               down_read() -> succeeds

                                               up_read()

                                               down_write() -> there's a down_reader: block

  down_read() succeeds


ad nauseum.


If that cond_resched() was not there, none of this would ever happen - each
thread merrily chugs away doing its ups and downs until it expires its
timeslice.  Interesting, in a sad sort of way.



Setting CONFIG_PREEMPT_NONE doesn't appear to make any difference to
context switch rate or runtime when all eight CPUs are used, so this
phenomenon is unlikely to be involved in the mysql problem.

I wonder why a similar thing doesn't happen when more than one CPU is used.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/