linux-kernel - [PATCH 0/2] memcg: improving scalability by reducing lock contention at charge/uncharge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20091002135531.3b5abf5c.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Fri, 2 Oct 2009 13:55:31 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	"linux-mm@...ck.org" <linux-mm@...ck.org>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"balbir@...ux.vnet.ibm.com" <balbir@...ux.vnet.ibm.com>,
	"nishimura@....nes.nec.co.jp" <nishimura@....nes.nec.co.jp>
Subject: [PATCH 0/2] memcg: improving scalability by reducing lock
 contention at charge/uncharge

Hi,

This patch is against mmotm + softlimit fix patches.
(which are now in -rc git tree.)

In the latest -rc series, the kernel avoids accessing res_counter when
cgroup is root cgroup. This helps scalabilty when memcg is not used.

It's necessary to improve scalabilty even when memcg is used. This patch
is for that. Previous Balbir's work shows that the biggest obstacles for
better scalabilty is memcg's res_counter. Then, there are 2 ways.

(1) make counter scale well.
(2) avoid accessing core counter as much as possible.

My first direction was (1). But no, there is no counter which is free
from false sharing when it needs system-wide fine grain synchronization.
And res_counter has several functionality...this makes (1) difficult.
spin_lock (in slow path) around counter means tons of invalidation will
happen even when we just access counter without modification.

This patch series is for (2). This implements charge/uncharge in bached manner.
This coalesces access to res_counter at charge/uncharge using nature of
access locality.

Tested for a month. And I got good reorts from Balbir and Nishimura, thanks.
One concern is that this adds some members to the bottom of task_struct.
Better idea is welcome.

Following is test result of continuous page-fault on my 8cpu box(x86-64).

A loop like this runs on all cpus in parallel for 60secs. 
==
        while (1) {
                x = mmap(NULL, MEGA, PROT_READ|PROT_WRITE,
                        MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);

                for (off = 0; off < MEGA; off += PAGE_SIZE)
                        x[off]=0;
                munmap(x, MEGA);
        }
==
please see # of page faults. I think this is good improvement.

[Before]
 Performance counter stats for './runpause.sh' (5 runs):

  474539.756944  task-clock-msecs         #      7.890 CPUs    ( +-   0.015% )
          10284  context-switches         #      0.000 M/sec   ( +-   0.156% )
             12  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
       18425800  page-faults              #      0.039 M/sec   ( +-   0.107% )
  1486296285360  cycles                   #   3132.080 M/sec   ( +-   0.029% )
   380334406216  instructions             #      0.256 IPC     ( +-   0.058% )
     3274206662  cache-references         #      6.900 M/sec   ( +-   0.453% )
     1272947699  cache-misses             #      2.682 M/sec   ( +-   0.118% )

   60.147907341  seconds time elapsed   ( +-   0.010% )

[After]
 Performance counter stats for './runpause.sh' (5 runs):

  474658.997489  task-clock-msecs         #      7.891 CPUs    ( +-   0.006% )
          10250  context-switches         #      0.000 M/sec   ( +-   0.020% )
             11  CPU-migrations           #      0.000 M/sec   ( +-   0.000% )
       33177858  page-faults              #      0.070 M/sec   ( +-   0.152% )
  1485264748476  cycles                   #   3129.120 M/sec   ( +-   0.021% )
   409847004519  instructions             #      0.276 IPC     ( +-   0.123% )
     3237478723  cache-references         #      6.821 M/sec   ( +-   0.574% )
     1182572827  cache-misses             #      2.491 M/sec   ( +-   0.179% )

   60.151786309  seconds time elapsed   ( +-   0.014% )

Regards,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/