linux-kernel - [RFC PATCH v3 0/6] Refault distance update with MGLRU support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230920190244.16839-1-ryncsn@gmail.com>
Date:   Thu, 21 Sep 2023 03:02:38 +0800
From:   Kairui Song <ryncsn@...il.com>
To:     linux-mm@...ck.org
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Yu Zhao <yuzhao@...gle.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...e.com>,
        Hugh Dickins <hughd@...gle.com>, Nhat Pham <nphamcs@...il.com>,
        Yuanchu Xie <yuanchu@...gle.com>,
        Kalesh Singh <kaleshsingh@...gle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        "T . J . Mercier" <tjmercier@...gle.com>,
        linux-kernel@...r.kernel.org, Kairui Song <kasong@...cent.com>
Subject: [RFC PATCH v3 0/6] Refault distance update with MGLRU support

From: Kairui Song <kasong@...cent.com>

I noticed MGLRU not working very well on certain workflows, which is
observed on some workloads with heavy memory stress.

After some debugging, I found this was related to refault distance
detection, when the file page workingset size exceeds total memory,
and the access distance (the left-shift time of a page before it gets
activated or promoted, considering LRU starts from right) of file pages
are larger than total memory. All file pages are stuck on the oldest
generation and getting read-in then evicted permutably, few get activated
and stay in memory.

This series tries to fix this problem by rework the refault distance
based activation to better fit MGLRU, and also tries to use a unified
algorithm for both MGLRU and Inactive/Active LRU, the performance almost
doubled for the workloads that are not working well previously.

Patch 1/5 reworked the refault distance detection model for
Inactive/Active LRU, and updated the comments.

Patch 2/5 splitted the code logic into a helper, prepare for MGLRU.

Patch 3/5 and 4/5 are code simplification and updates for MGLRU.

Patch 5/5 applies the modified refault distance algorithm
for MGLRU.

Following benchmark showed 5x improvement:
To simulate the workflow, I setup a 3-replicated mongodb cluster using
docker, each in a standalone cgroup, set to use 5GB of wiretiger cache
and 10g of oplog, on a 32G VM. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.

Before (with ZRAM enabled, the result won't change whether
any kind of swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 919 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     577             27584645283.7   0.02 txn/s
------------------------------------------------------------------
  TOTAL           577             27584645283.7   0.02 txn/s

Patched (with ZRAM enabled):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 905 seconds
------------------------------------------------------------------
                  Executed        Time (µs)       Rate
  STOCK_LEVEL     2542            27121571486.2   0.09 txn/s
------------------------------------------------------------------
  TOTAL           2542            27121571486.2   0.09 txn/s

The performance is 5x times better than before. Testing with lower
stress and some other benchmarks also shows slight improvement or
equivalent performance (eg. fio tests shows a observable performance gain).

Sending out as RFC, I'm still doing more test on it, since this changed
a frequently used algorithm and not really sure if there is any performance
regression on long term. It should improvement the performance for file
pages in general even if there are low memory pressure, since it saved
some cgroup iterations and atomic operations.

Update from V2:
- Rebase to latest mm-stable and redone some tests.
- Split the algorithm change into a stand alone patch as
  suggested by Johannes Weiner.

Update from V1:
- Removed the fls operations which previously used in patch 1 for
  protecting active pages by expontial ratio, simply compare with number of
  inactive pages seems good enough.
- Update some benchmarks results, test result that are basically
  identical as before are not updated.

Kairui Song (6):
  workingset: simplify and use a more intuitive model
  workingset: move refault distance checking into to a helper
  workignset: simplify the initilization code
  workingset: simplify lru_gen_test_recent
  mm, lru_gen: convert avg_total and avg_refaulted to atomic
  workingset, lru_gen: apply refault-distance based re-activation

 include/linux/mmzone.h |   4 +-
 include/linux/swap.h   |   2 -
 mm/swap.c              |   1 -
 mm/vmscan.c            |  30 +--
 mm/workingset.c        | 416 +++++++++++++++++++++--------------------
 5 files changed, 236 insertions(+), 217 deletions(-)

-- 
2.41.0