linux-kernel - Re: [MM] Make mm counters per cpu instead of atomic V2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20091106122344.51118116.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Fri, 6 Nov 2009 12:23:44 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Cc:	Christoph Lameter <cl@...ux-foundation.org>,
	Dave Jones <davej@...hat.com>,
	"hugh.dickins@...cali.co.uk" <hugh.dickins@...cali.co.uk>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	akpm@...ux-foundation.org, Tejun Heo <tj@...nel.org>
Subject: Re: [MM] Make mm counters per cpu instead of atomic V2

On Fri, 6 Nov 2009 10:11:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com> wrote:
> This is the result of 'top -b -n 1' with 2000 processes(most of them just sleep)
> on my 8cpu, SMP box.
> 
> == [Before]
>  Performance counter stats for 'top -b -n 1' (5 runs):
> 
>      406.690304  task-clock-msecs         #      0.442 CPUs    ( +-   3.327% )
>              32  context-switches         #      0.000 M/sec   ( +-   0.000% )
>               0  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>             718  page-faults              #      0.002 M/sec   ( +-   0.000% )
>       987832447  cycles                   #   2428.955 M/sec   ( +-   2.655% )
>       933831356  instructions             #      0.945 IPC     ( +-   2.585% )
>        17383990  cache-references         #     42.745 M/sec   ( +-   1.676% )
>          353620  cache-misses             #      0.870 M/sec   ( +-   0.614% )
> 
>     0.920712639  seconds time elapsed   ( +-   1.609% )
> 
> == [After]
>  Performance counter stats for 'top -b -n 1' (5 runs):
> 
>      675.926348  task-clock-msecs         #      0.568 CPUs    ( +-   0.601% )
>              62  context-switches         #      0.000 M/sec   ( +-   1.587% )
>               0  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
>            1095  page-faults              #      0.002 M/sec   ( +-   0.000% )
>      1896320818  cycles                   #   2805.514 M/sec   ( +-   1.494% )
>      1790600289  instructions             #      0.944 IPC     ( +-   1.333% )
>        35406398  cache-references         #     52.382 M/sec   ( +-   0.876% )
>          722781  cache-misses             #      1.069 M/sec   ( +-   0.192% )
> 
>     1.190605561  seconds time elapsed   ( +-   0.417% )
> 
> Because I know 'ps' related workload is used in various ways, "How this will
> be in large smp" is my concern.
> 
> Maybe usual use of 'ps -elf' will not read RSS value and not affected by this.
> If this counter supports single-thread-mode (most of apps are single threaded),
> impact will not be big.
> 

Measured extreme case benefits with attached program. 
please see # of page faults. Bigger is better.
please let me know my program is buggy.
Excuse:
My .config may not be for extreme performace challenge, and my host only have 8cpus.
(memcg is enabled, hahaha...)

# of page fault is not very stable (affected by task-clock-msecs.)
but maybe we have some improvements.

I'd like to see score of "top" and this in big servers......

BTW, can't we have single-thread-mode for this counter ?
Usual program's read-side will get much benefit.....


==[Before]==
 Performance counter stats for './multi-fault 8' (5 runs):

  474810.516710  task-clock-msecs         #      7.912 CPUs    ( +-   0.006% )
          10713  context-switches         #      0.000 M/sec   ( +-   2.529% )
              8  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
       16669105  page-faults              #      0.035 M/sec   ( +-   0.449% )
  1487101488902  cycles                   #   3131.989 M/sec   ( +-   0.012% )
   307164795479  instructions             #      0.207 IPC     ( +-   0.177% )
     2355518599  cache-references         #      4.961 M/sec   ( +-   0.420% )
      901969818  cache-misses             #      1.900 M/sec   ( +-   0.824% )

   60.008425257  seconds time elapsed   ( +-   0.004% )

==[After]==
 Performance counter stats for './multi-fault 8' (5 runs):

  474212.969563  task-clock-msecs         #      7.902 CPUs    ( +-   0.007% )
          10281  context-switches         #      0.000 M/sec   ( +-   0.156% )
              9  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
       16795696  page-faults              #      0.035 M/sec   ( +-   2.218% )
  1485411063159  cycles                   #   3132.371 M/sec   ( +-   0.014% )
   305810331186  instructions             #      0.206 IPC     ( +-   0.133% )
     2391293765  cache-references         #      5.043 M/sec   ( +-   0.737% )
      890490519  cache-misses             #      1.878 M/sec   ( +-   0.212% )

   60.010631769  seconds time elapsed   ( +-   0.004% )

Thanks,
-Kame

==

/*
 * multi-fault.c :: causes 60secs of parallel page fault in multi-thread.
 * % gcc -O2 -o multi-fault multi-fault.c -lpthread
 * % multi-fault # of cpus.
 */

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NR_THREADS	32
pthread_t threads[NR_THREADS];
/*
 * For avoiding contention in page table lock, FAULT area is
 * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
 */
#define MMAP_LENGTH	(8 * 1024 * 1024)
#define FAULT_LENGTH	(2 * 1024 * 1024)
void *mmap_area[NR_THREADS];
#define PAGE_SIZE	4096

pthread_barrier_t barrier;
int name[NR_THREADS];

void *worker(void *data)
{
	int cpu = *(int *)data;
	cpu_set_t set;

	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	sched_setaffinity(0, sizeof(set), &set);
	pthread_barrier_wait(&barrier);

	while (1) {
		char *c;
		char *start = mmap_area[cpu];
		char *end = mmap_area[cpu] + FAULT_LENGTH;

		for (c = start; c < end; c += PAGE_SIZE)
			*c = 0;

		madvise(start, FAULT_LENGTH, MADV_DONTNEED);
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	int i, num, ret;

	if (argc < 2)
		return 0;

	num = atoi(argv[1]);	

	pthread_barrier_init(&barrier, NULL, num + 1);

	for (i = 0; i < num; i++) {
		name[i] = i;
		ret = pthread_create(&threads[i], NULL, worker, &name[i]);
		if (ret < 0) {
			perror("pthread create");
			return 0;
		}
		mmap_area[i] = mmap(NULL, MMAP_LENGTH,
				PROT_WRITE | PROT_READ,
				MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
	}
	pthread_barrier_wait(&barrier);	
	sleep(60);
	return 0;
}





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/