lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20160527081119.GB7651@yexl-desktop>
Date:	Fri, 27 May 2016 16:11:19 +0800
From:	Ye Xiaolong <xiaolong.ye@...el.com>
To:	Johannes Weiner <hannes@...xchg.org>
Cc:	Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
	Vladimir Davydov <vdavydov@...tuozzo.com>,
	Michal Hocko <mhocko@...e.cz>,
	David Rientjes <rientjes@...gle.com>,
	LKML <linux-kernel@...r.kernel.org>, lkp@...org
Subject: Re: [LKP] [mm] 23047a96d7: vm-scalability.throughput -23.8%
 regression

On Wed, May 25, 2016 at 02:06:17PM +0800, Ye Xiaolong wrote:
>On Mon, May 23, 2016 at 04:46:05PM -0400, Johannes Weiner wrote:
>>Hi,
>>
>>thanks for your report.
>>
>>On Tue, May 17, 2016 at 12:58:05PM +0800, kernel test robot wrote:
>>> FYI, we noticed vm-scalability.throughput -23.8% regression due to commit:
>>> 
>>> commit 23047a96d7cfcfca1a6d026ecaec526ea4803e9e ("mm: workingset: per-cgroup cache thrash detection")
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>> 
>>> in testcase: vm-scalability
>>> on test machine: lkp-hsw01: 56 threads Grantley Haswell-EP with 64G memory
>>> with following conditions: cpufreq_governor=performance/runtime=300s/test=lru-file-readtwice
>>
>>That test hammers the LRU activation path, to which this patch added
>>the cgroup lookup and pinning code. Does the following patch help?
>>

Hi, Johannes

FYI, I have done more tests with your fix patch.

1) apply it on top of latest kernel (head commit: 478a1469("Merge tag 'dax-locking-for-4.7' of
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm")

The following is the comparison info among first bad commit's parent, first
bad commit, head commit of linus' master branch, your fix commit(a7abed95):

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/tbox_group/test/testcase:
  gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/300s/lkp-hsw01/lru-file-readtwice/vm-scalability

commit: 
  612e44939c3c77245ac80843c0c7876c8cf97282
  23047a96d7cfcfca1a6d026ecaec526ea4803e9e
  478a1469a7d27fe6b2f85fc801ecdeb8afc836e6
  a7abed950afdc1186d4eaf442b7eb296ff04c947

612e44939c3c7724 23047a96d7cfcfca1a6d026eca 478a1469a7d27fe6b2f85fc801 a7abed950afdc1186d4eaf442b
---------------- -------------------------- -------------------------- --------------------------
         %stddev     %change         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \          |                \
  28384711 ±  0%     -23.8%   21621405 ±  0%     -12.4%   24865101 ±  4%      -8.1%   26076417 ±  3%  vm-scalability.throughput
   1854112 ±  0%      -7.7%    1711141 ±  0%      +6.4%    1973257 ±  4%      +9.2%    2025214 ±  3%  vm-scalability.time.involuntary_context_switches
      5279 ±  0%      -0.7%       5243 ±  0%      -2.6%       5143 ±  0%      -2.4%       5153 ±  0%  vm-scalability.time.percent_of_cpu_this_job_got
     16267 ±  0%      -0.6%      16173 ±  0%      -2.0%      15934 ±  0%      -1.8%      15978 ±  0%  vm-scalability.time.system_time
    176.03 ±  0%     -22.2%     136.95 ±  1%     -10.4%     157.66 ±  1%     -11.2%     156.32 ±  0%  vm-scalability.time.user_time
    302905 ±  2%     -31.2%     208386 ±  0%      +5.8%     320618 ± 47%     -36.0%     193991 ± 22%  vm-scalability.time.voluntary_context_switches
      0.92 ±  2%     +51.0%       1.38 ±  2%     +96.5%       1.80 ±  0%     +97.3%       1.81 ±  0%  perf-profile.cycles-pp.kswapd
      2585 ±  1%      -1.9%       2536 ±  0%      +9.6%       2834 ±  1%     +10.7%       2862 ±  1%  uptime.idle
    754212 ±  1%     -29.2%     533832 ±  2%     -34.8%     491397 ±  2%     -27.5%     546666 ±  8%  softirqs.RCU
    151918 ±  8%      +5.7%     160522 ±  5%     -17.4%     125419 ± 18%     -22.7%     117409 ±  7%  softirqs.SCHED
    176.03 ±  0%     -22.2%     136.95 ±  1%     -10.4%     157.66 ±  1%     -11.2%     156.32 ±  0%  time.user_time
    302905 ±  2%     -31.2%     208386 ±  0%      +5.8%     320618 ± 47%     -36.0%     193991 ± 22%  time.voluntary_context_switches


2) apply it on top of v4.6 (head commit: 2dcd0af5("Linux 4.6"))

The following is the comparison info among first bad commit's parent, first
bad commit, v4.6, your fix commit(c05f8814):

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/tbox_group/test/testcase:
  gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/300s/lkp-hsw01/lru-file-readtwice/vm-scalability

commit: 
  612e44939c3c77245ac80843c0c7876c8cf97282
  23047a96d7cfcfca1a6d026ecaec526ea4803e9e
  v4.6
  c05f8814641ceabbc628cd4edc7f64ff58498d5a

612e44939c3c7724 23047a96d7cfcfca1a6d026eca                       v4.6 c05f8814641ceabbc628cd4edc
---------------- -------------------------- -------------------------- --------------------------
         %stddev     %change         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \          |                \
  28384711 ±  0%     -23.8%   21621405 ±  0%     -18.9%   23013011 ±  0%     -19.2%   22937943 ±  0%  vm-scalability.throughput
   1854112 ±  0%      -7.7%    1711141 ±  0%      -5.2%    1757124 ±  0%      -4.9%    1762398 ±  0%  vm-scalability.time.involuntary_context_switches
     66021 ±  0%      -0.4%      65745 ±  0%      +1.8%      67231 ±  1%      +3.4%      68291 ±  0%  vm-scalability.time.minor_page_faults
      5279 ±  0%      -0.7%       5243 ±  0%      -1.6%       5197 ±  0%      -1.5%       5198 ±  0%  vm-scalability.time.percent_of_cpu_this_job_got
     16267 ±  0%      -0.6%      16173 ±  0%      -1.5%      16030 ±  0%      -1.4%      16032 ±  0%  vm-scalability.time.system_time
    176.03 ±  0%     -22.2%     136.95 ±  1%     -17.8%     144.66 ±  1%     -17.0%     146.15 ±  1%  vm-scalability.time.user_time
    302905 ±  2%     -31.2%     208386 ±  0%     -19.4%     244167 ±  1%     -19.3%     244584 ±  1%  vm-scalability.time.voluntary_context_switches
     23999 ±  2%      -5.9%      22575 ±  2%      -7.1%      22291 ±  3%      -7.0%      22319 ±  1%  meminfo.Mapped
      0.92 ±  2%     +51.0%       1.38 ±  2%     +96.8%       1.81 ±  0%     +95.5%       1.79 ±  0%  perf-profile.cycles-pp.kswapd
    754212 ±  1%     -29.2%     533832 ±  2%     -41.1%     444019 ±  4%     -42.8%     431345 ±  0%  softirqs.RCU
     20518 ±  2%      -8.1%      18866 ±  2%      -2.6%      19980 ±  3%      -2.9%      19926 ±  7%  vmstat.system.cs
     10574 ± 19%     +29.9%      13737 ±  8%      +1.6%      10740 ± 33%     +16.9%      12359 ± 17%  numa-meminfo.node0.Mapped
     13490 ± 13%     -36.6%       8549 ± 17%     -13.3%      11689 ± 29%     -26.5%       9912 ± 22%  numa-meminfo.node1.Mapped
    176.03 ±  0%     -22.2%     136.95 ±  1%     -17.8%     144.66 ±  1%     -17.0%     146.15 ±  1%  time.user_time
    302905 ±  2%     -31.2%     208386 ±  0%     -19.4%     244167 ±  1%     -19.3%     244584 ±  1%  time.voluntary_context_switches


Thanks,
Xiaolong
>
>Hi,
>
>Here is the comparison of original first bad commit (23047a96d) and your new patch (063f6715e).
>vm-scalability.throughput improved 11.3%.
>
>
>compiler/cpufreq_governor/kconfig/rootfs/runtime/tbox_group/test/testcase:
>  gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/300s/lkp-hsw01/lru-file-readtwice/vm-scalability
>
>commit: 
>  23047a96d7cfcfca1a6d026ecaec526ea4803e9e
>  063f6715e77a7be5770d6081fe6d7ca2437ac9f2
>
>23047a96d7cfcfca 063f6715e77a7be5770d6081fe
>---------------- --------------------------
>         %stddev     %change         %stddev
>             \          |                \
>  21621405 ±  0%     +11.3%   24069657 ±  2%  vm-scalability.throughput
>   1711141 ±  0%     +40.9%    2411083 ±  2%  vm-scalability.time.involuntary_context_switches
>      2747 ±  0%      +2.4%       2812 ±  0%  vm-scalability.time.maximum_resident_set_size
>      5243 ±  0%      -1.2%       5180 ±  0%  vm-scalability.time.percent_of_cpu_this_job_got
>    136.95 ±  1%     +13.6%     155.55 ±  0%  vm-scalability.time.user_time
>    208386 ±  0%     -71.5%      59394 ± 16%  vm-scalability.time.voluntary_context_switches
>      1.38 ±  2%     +21.7%       1.69 ±  2%  perf-profile.cycles-pp.kswapd
>    160522 ±  5%     -30.0%     112342 ±  2%  softirqs.SCHED
>      2536 ±  0%      +7.3%       2722 ±  2%  uptime.idle
>   1711141 ±  0%     +40.9%    2411083 ±  2%  time.involuntary_context_switches
>    136.95 ±  1%     +13.6%     155.55 ±  0%  time.user_time
>    208386 ±  0%     -71.5%      59394 ± 16%  time.voluntary_context_switches
>      1052 ± 13%   +1453.8%      16346 ± 39%  cpuidle.C1-HSW.usage
>      1045 ± 12%     -54.3%     477.50 ± 25%  cpuidle.C3-HSW.usage
> 5.719e+08 ±  1%     +17.9%  6.743e+08 ±  0%  cpuidle.C6-HSW.time
>  40424411 ±  2%     -97.3%    1076732 ± 99%  cpuidle.POLL.time
>      7179 ±  5%     -99.9%       6.50 ± 53%  cpuidle.POLL.usage
>      0.51 ±  8%     -40.6%       0.30 ± 13%  turbostat.CPU%c1
>      2.83 ±  2%     +30.5%       3.70 ±  0%  turbostat.CPU%c6
>      0.23 ± 79%    +493.4%       1.35 ±  2%  turbostat.Pkg%pc2
>    255.52 ±  0%      +3.3%     263.95 ±  0%  turbostat.PkgWatt
>     53.26 ±  0%     +14.9%      61.22 ±  0%  turbostat.RAMWatt
>   1836104 ±  0%     +13.3%    2079934 ±  4%  vmstat.memory.free
>      5.00 ±  0%     -70.0%       1.50 ± 33%  vmstat.procs.b
>    107.00 ±  0%      +8.4%     116.00 ±  2%  vmstat.procs.r
>     18866 ±  2%     +40.1%      26436 ± 13%  vmstat.system.cs
>     69056 ±  0%     +11.8%      77219 ±  1%  vmstat.system.in
>  31628132 ±  0%     +80.9%   57224963 ±  0%  meminfo.Active
>  31294504 ±  0%     +81.7%   56876042 ±  0%  meminfo.Active(file)
>    142271 ±  6%     +11.2%     158138 ±  5%  meminfo.DirectMap4k
>  30612825 ±  0%     -87.2%    3915695 ±  0%  meminfo.Inactive
>  30562772 ±  0%     -87.4%    3862631 ±  0%  meminfo.Inactive(file)
>     15635 ±  1%     +38.0%      21572 ±  8%  meminfo.KernelStack
>     22575 ±  2%      +7.7%      24316 ±  4%  meminfo.Mapped
>   1762372 ±  3%     +12.2%    1976873 ±  3%  meminfo.MemFree
>    847557 ±  0%    +105.5%    1741958 ±  8%  meminfo.SReclaimable
>    946378 ±  0%     +95.1%    1846370 ±  8%  meminfo.Slab
>
>Thanks,
>Xiaolong
>
>>>From b535c630fd8954865b7536c915c3916beb3b4830 Mon Sep 17 00:00:00 2001
>>From: Johannes Weiner <hannes@...xchg.org>
>>Date: Mon, 23 May 2016 16:14:24 -0400
>>Subject: [PATCH] mm: fix vm-scalability regression in workingset_activation()
>>
>>23047a96d7cf ("mm: workingset: per-cgroup cache thrash detection")
>>added cgroup lookup and pinning overhead to the LRU activation path,
>>which the vm-scalability benchmark is particularly sensitive to.
>>
>>Inline the lookup functions to eliminate calls. Furthermore, since
>>activations are not moved when pages are moved between memcgs, we
>>don't need the full page->mem_cgroup locking; holding the RCU lock is
>>enough to prevent the memcg from being freed.
>>
>>Signed-off-by: Johannes Weiner <hannes@...xchg.org>
>>---
>> include/linux/memcontrol.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
>> include/linux/mm.h         |  8 ++++++++
>> mm/memcontrol.c            | 42 ------------------------------------------
>> mm/workingset.c            | 10 ++++++----
>> 4 files changed, 56 insertions(+), 47 deletions(-)
>>
>>diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>index a805474df4ab..0bb36cf89bf6 100644
>>--- a/include/linux/memcontrol.h
>>+++ b/include/linux/memcontrol.h
>>@@ -306,7 +306,48 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
>> 
>> void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
>> 
>>-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>>+static inline struct mem_cgroup_per_zone *
>>+mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
>>+{
>>+	int nid = zone_to_nid(zone);
>>+	int zid = zone_idx(zone);
>>+
>>+	return &memcg->nodeinfo[nid]->zoneinfo[zid];
>>+}
>>+
>>+/**
>>+ * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
>>+ * @zone: zone of the wanted lruvec
>>+ * @memcg: memcg of the wanted lruvec
>>+ *
>>+ * Returns the lru list vector holding pages for the given @zone and
>>+ * @mem.  This can be the global zone lruvec, if the memory controller
>>+ * is disabled.
>>+ */
>>+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>>+						    struct mem_cgroup *memcg)
>>+{
>>+	struct mem_cgroup_per_zone *mz;
>>+	struct lruvec *lruvec;
>>+
>>+	if (mem_cgroup_disabled()) {
>>+		lruvec = &zone->lruvec;
>>+		goto out;
>>+	}
>>+
>>+	mz = mem_cgroup_zone_zoneinfo(memcg, zone);
>>+	lruvec = &mz->lruvec;
>>+out:
>>+	/*
>>+	 * Since a node can be onlined after the mem_cgroup was created,
>>+	 * we have to be prepared to initialize lruvec->zone here;
>>+	 * and if offlined then reonlined, we need to reinitialize it.
>>+	 */
>>+	if (unlikely(lruvec->zone != zone))
>>+		lruvec->zone = zone;
>>+	return lruvec;
>>+}
>>+
>> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>> 
>> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>>diff --git a/include/linux/mm.h b/include/linux/mm.h
>>index b530c99e8e81..a9dd54e196a7 100644
>>--- a/include/linux/mm.h
>>+++ b/include/linux/mm.h
>>@@ -943,11 +943,19 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
>> {
>> 	return page->mem_cgroup;
>> }
>>+static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>>+{
>>+	return READ_ONCE(page->mem_cgroup);
>>+}
>> #else
>> static inline struct mem_cgroup *page_memcg(struct page *page)
>> {
>> 	return NULL;
>> }
>>+static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>>+{
>>+	return NULL;
>>+}
>> #endif
>> 
>> /*
>>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>index b3f16ab4b431..f65e5e527864 100644
>>--- a/mm/memcontrol.c
>>+++ b/mm/memcontrol.c
>>@@ -323,15 +323,6 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>> 
>> #endif /* !CONFIG_SLOB */
>> 
>>-static struct mem_cgroup_per_zone *
>>-mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
>>-{
>>-	int nid = zone_to_nid(zone);
>>-	int zid = zone_idx(zone);
>>-
>>-	return &memcg->nodeinfo[nid]->zoneinfo[zid];
>>-}
>>-
>> /**
>>  * mem_cgroup_css_from_page - css of the memcg associated with a page
>>  * @page: page of interest
>>@@ -944,39 +935,6 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>> 	     iter = mem_cgroup_iter(NULL, iter, NULL))
>> 
>> /**
>>- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
>>- * @zone: zone of the wanted lruvec
>>- * @memcg: memcg of the wanted lruvec
>>- *
>>- * Returns the lru list vector holding pages for the given @zone and
>>- * @mem.  This can be the global zone lruvec, if the memory controller
>>- * is disabled.
>>- */
>>-struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>>-				      struct mem_cgroup *memcg)
>>-{
>>-	struct mem_cgroup_per_zone *mz;
>>-	struct lruvec *lruvec;
>>-
>>-	if (mem_cgroup_disabled()) {
>>-		lruvec = &zone->lruvec;
>>-		goto out;
>>-	}
>>-
>>-	mz = mem_cgroup_zone_zoneinfo(memcg, zone);
>>-	lruvec = &mz->lruvec;
>>-out:
>>-	/*
>>-	 * Since a node can be onlined after the mem_cgroup was created,
>>-	 * we have to be prepared to initialize lruvec->zone here;
>>-	 * and if offlined then reonlined, we need to reinitialize it.
>>-	 */
>>-	if (unlikely(lruvec->zone != zone))
>>-		lruvec->zone = zone;
>>-	return lruvec;
>>-}
>>-
>>-/**
>>  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
>>  * @page: the page
>>  * @zone: zone of the page
>>diff --git a/mm/workingset.c b/mm/workingset.c
>>index 8a75f8d2916a..8252de4566e9 100644
>>--- a/mm/workingset.c
>>+++ b/mm/workingset.c
>>@@ -305,9 +305,10 @@ bool workingset_refault(void *shadow)
>>  */
>> void workingset_activation(struct page *page)
>> {
>>+	struct mem_cgroup *memcg;
>> 	struct lruvec *lruvec;
>> 
>>-	lock_page_memcg(page);
>>+	rcu_read_lock();
>> 	/*
>> 	 * Filter non-memcg pages here, e.g. unmap can call
>> 	 * mark_page_accessed() on VDSO pages.
>>@@ -315,12 +316,13 @@ void workingset_activation(struct page *page)
>> 	 * XXX: See workingset_refault() - this should return
>> 	 * root_mem_cgroup even for !CONFIG_MEMCG.
>> 	 */
>>-	if (!mem_cgroup_disabled() && !page_memcg(page))
>>+	memcg = page_memcg_rcu(page);
>>+	if (!mem_cgroup_disabled() && !memcg)
>> 		goto out;
>>-	lruvec = mem_cgroup_zone_lruvec(page_zone(page), page_memcg(page));
>>+	lruvec = mem_cgroup_zone_lruvec(page_zone(page), memcg);
>> 	atomic_long_inc(&lruvec->inactive_age);
>> out:
>>-	unlock_page_memcg(page);
>>+	rcu_read_unlock();
>> }
>> 
>> /*
>>-- 
>>2.8.2
>>
>_______________________________________________
>LKP mailing list
>LKP@...ts.01.org
>https://lists.01.org/mailman/listinfo/lkp

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ