[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20221222041905.2431096-1-yuzhao@google.com>
Date: Wed, 21 Dec 2022 21:18:58 -0700
From: Yu Zhao <yuzhao@...gle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Johannes Weiner <hannes@...xchg.org>,
Jonathan Corbet <corbet@....net>,
Michael Larabel <michael@...haellarabel.com>,
Michal Hocko <mhocko@...nel.org>,
Mike Rapoport <rppt@...nel.org>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Suren Baghdasaryan <surenb@...gle.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-mm@...gle.com,
Yu Zhao <yuzhao@...gle.com>
Subject: [PATCH mm-unstable v3 0/8] mm: multi-gen LRU: memcg LRU
What's new
==========
1. Rebased to the latest mm-unstable and resolved the conflict with
commit 8032bf1233a7 ("treewide: use get_random_u32_below() instead
of deprecated function").
2. Added two comprehensive benchmarks:
https://lore.kernel.org/r/20221220214923.1229538-1-yuzhao@google.com/
https://lore.kernel.org/r/20221221000748.1374772-1-yuzhao@google.com/
Overview
========
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).
Its goal is to improve the scalability of global reclaim, which is
critical to system-wide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.
Its memory bloat is a pointer to each lruvec and negligible to each
pglist_data. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not
affect the worst-case complexity O(n). Therefore, on average, it has a
sublinear complexity in contrast to the current linear complexity.
The basic structure of an memcg LRU can be understood by an analogy to
the active/inactive LRU (of folios):
1. It has the young and the old (generations), i.e., the counterparts
to the active and the inactive;
2. The increment of max_seq triggers promotion, i.e., the counterpart
to activation;
3. Other events trigger similar operations, e.g., offlining an memcg
triggers demotion, i.e., the counterpart to deactivation.
In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out at will
and reduces latency without affecting fairness over some time.
The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/
The following is a simple test to quickly verify its effectiveness.
Test design:
1. Create multiple memcgs.
2. Each memcg contains a job (fio).
3. All jobs access the same amount of memory randomly.
4. The system does not experience global memory pressure.
5. Periodically write to the root memory.reclaim.
Desired outcome:
1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
over mean(pgsteal) is close to 0%.
2. The total pgsteal is close to the total requested through
memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
to 100%.
Actual outcome [1]:
MGLRU off MGLRU on
stddev(pgsteal) / mean(pgsteal) 75% 20%
sum(pgsteal) / sum(requested) 425% 95%
####################################################################
MEMCGS=128
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
mkdir /sys/fs/cgroup/memcg$memcg
done
start() {
echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
--filename=/dev/zero --size=1920M --rw=randrw \
--rate=64m,64m --random_distribution=random \
--fadvise_hint=0 --time_based --runtime=10h \
--group_reporting --minimal
}
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
start &
done
sleep 600
for ((i = 0; i < 600; i++)); do
echo 256m >/sys/fs/cgroup/memory.reclaim
sleep 6
done
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
done
####################################################################
[1]: This was obtained from running the above script (touches less
than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
hour.
Yu Zhao (8):
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
mm: multi-gen LRU: remove eviction fairness safeguard
mm: multi-gen LRU: remove aging fairness safeguard
mm: multi-gen LRU: shuffle should_run_aging()
mm: multi-gen LRU: per-node lru_gen_folio lists
mm: multi-gen LRU: clarify scan_control flags
mm: multi-gen LRU: simplify arch_has_hw_pte_young() check
Documentation/mm/multigen_lru.rst | 8 +-
include/linux/memcontrol.h | 10 +
include/linux/mm_inline.h | 25 +-
include/linux/mmzone.h | 131 ++++-
mm/memcontrol.c | 16 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 769 ++++++++++++++++++++----------
mm/workingset.c | 4 +-
8 files changed, 693 insertions(+), 271 deletions(-)
--
2.39.0.314.g84b9a713c41-goog
Powered by blists - more mailing lists