linux-kernel - [PATCH mm-unstable v1 0/8] mm: multi-gen LRU: memcg LRU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20221201223923.873696-1-yuzhao@google.com>
Date:   Thu,  1 Dec 2022 15:39:16 -0700
From:   Yu Zhao <yuzhao@...gle.com>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Michael Larabel <Michael@...haellarabel.com>,
        Michal Hocko <mhocko@...nel.org>,
        Mike Rapoport <rppt@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Suren Baghdasaryan <surenb@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-mm@...gle.com,
        Yu Zhao <yuzhao@...gle.com>
Subject: [PATCH mm-unstable v1 0/8] mm: multi-gen LRU: memcg LRU

An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).

Its goal is to improve the scalability of global reclaim, which is
critical to systemwide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.

Its memory bloat is a pointer to each LRU vector and negligible to
each node. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not
affect the worst-case complexity O(n). Therefore, on average, it has
a sublinear complexity in contrast to the current linear complexity.

The basic structure of an memcg LRU can be understood by an analogy
to the active/inactive LRU (of folios):
1. It has the young and the old (generations);
2. Its linked lists have the head and the tail;
3. The increment of max_seq triggers promotion;
4. Other events, e.g., offlining an memcg, triggers similar
   operations.

In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
   the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out and
   reduces latency without affecting fairness over some time.

The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221201223923.873696-7-yuzhao@google.com/

The following is a simple test to quickly verify its effectiveness.
More benchmarks are coming soon.

  Test design:
  1. Create multiple memcgs.
  2. Each memcg contains a job (fio).
  3. All jobs access the same amount of memory randomly.
  4. The system does not experience global memory pressure.
  5. Periodically write to the root memory.reclaim.

  Desired outcome:
  1. All memcgs have similar pgsteal, i.e.,
     stddev(pgsteal)/mean(pgsteal) is close to 0%.
  2. The total pgsteal is close to the total requested through
     memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to
     100%.

  Actual outcome [1]:
             stddev(pgsteal)/mean(pgsteal) sum(pgsteal)/sum(requested)
  MGLRU off  75%                           425%
  MGLRU on   20%                           95%

  ####################################################################
  MEMCGS=128

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      mkdir /sys/fs/cgroup/memcg$memcg
  done

  start() {
      echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs

      fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
          --filename=/dev/zero --size=1920M --rw=randrw \
          --rate=64m,64m --random_distribution=random \
          --fadvise_hint=0 --time_based --runtime=10h \
          --group_reporting --minimal
  }

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      start &
  done

  sleep 600

  for ((i = 0; i < 600; i++)); do
      echo 256m >/sys/fs/cgroup/memory.reclaim
      sleep 6
  done

  for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
      grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
  done
  ####################################################################

[1]: This was obtained from running the above script (touches less
     than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
     hour.

Yu Zhao (8):
  mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
  mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
  mm: multi-gen LRU: remove eviction fairness safeguard
  mm: multi-gen LRU: remove aging fairness safeguard
  mm: multi-gen LRU: shuffle should_run_aging()
  mm: multi-gen LRU: per-node lru_gen_folio lists
  mm: multi-gen LRU: clarify scan_control flags
  mm: multi-gen LRU: simplify arch_has_hw_pte_young() check

 Documentation/mm/multigen_lru.rst |   8 +-
 include/linux/memcontrol.h        |  10 +
 include/linux/mm_inline.h         |  25 +-
 include/linux/mmzone.h            | 127 ++++-
 mm/memcontrol.c                   |  16 +
 mm/page_alloc.c                   |   1 +
 mm/vmscan.c                       | 765 ++++++++++++++++++++----------
 mm/workingset.c                   |   4 +-
 8 files changed, 687 insertions(+), 269 deletions(-)

-- 
2.39.0.rc0.267.gcb52ba06e7-goog