linux-kernel - [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20220518190911.82400-1-hannes@cmpxchg.org>
Date:   Wed, 18 May 2022 15:09:11 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Dave Hansen <dave.hansen@...ux.intel.com>,
        "Huang, Ying" <ying.huang@...el.com>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Andrew Morton <akpm@...ux-foundation.org>
Cc:     linux-mm@...ck.org, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org, kernel-team@...com,
        Zi Yan <ziy@...dia.com>, Michal Hocko <mhocko@...e.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Roman Gushchin <guro@...com>
Subject: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim"

This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a.

Its premise was that cgroup reclaim cares about freeing memory inside
the cgroup, and demotion just moves them around within the cgroup
limit. Hence, pages from toptier nodes should be reclaimed directly.

However, with NUMA balancing now doing tier promotions, demotion is
part of the page aging process. Global reclaim demotes the coldest
toptier pages to secondary memory, where their life continues and from
which they have a chance to get promoted back. Essentially, tiered
memory systems have an LRU order that spans multiple nodes.

When cgroup reclaims pages coming off the toptier directly, there can
be colder pages on lower tier nodes that were demoted by global
reclaim. This is an aging inversion, not unlike if cgroups were to
reclaim directly from the active lists while there are inactive pages.

Proactive reclaim is another factor. The goal of that it is to offload
colder pages from expensive RAM to cheaper storage. When lower tier
memory is available as an intermediate layer, we want offloading to
take advantage of it instead of bypassing to storage.

Revert the patch so that cgroups respect the LRU order spanning the
memory hierarchy.

Of note is a specific undercommit scenario, where all cgroup limits in
the system add up to <= available toptier memory. In that case,
shuffling pages out to lower tiers first to reclaim them from there is
inefficient. This is something could be optimized/short-circuited
later on (although care must be taken not to accidentally recreate the
aging inversion). Let's ensure correctness first.

Signed-off-by: Johannes Weiner <hannes@...xchg.org>
Cc: Dave Hansen <dave.hansen@...ux.intel.com>
Cc: "Huang, Ying" <ying.huang@...el.com>
Cc: Yang Shi <yang.shi@...ux.alibaba.com>
Cc: Zi Yan <ziy@...dia.com>
Cc: Michal Hocko <mhocko@...e.com>
Cc: Shakeel Butt <shakeelb@...gle.com>
Cc: Roman Gushchin <guro@...com>
---
 mm/vmscan.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6918fff06e1..7a4090712177 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -528,13 +528,8 @@ static bool can_demote(int nid, struct scan_control *sc)
 {
 	if (!numa_demotion_enabled)
 		return false;
-	if (sc) {
-		if (sc->no_demotion)
-			return false;
-		/* It is pointless to do demotion in memcg reclaim */
-		if (cgroup_reclaim(sc))
-			return false;
-	}
+	if (sc && sc->no_demotion)
+		return false;
 	if (next_demotion_node(nid) == NUMA_NO_NODE)
 		return false;

-- 
2.36.1