linux-kernel - Re: [PATCH mm-unstable v2 6/6] mm/mglru: rework workingset protection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z1PSn79GPcCxeI_g@google.com>
Date: Fri, 6 Dec 2024 21:44:15 -0700
From: Yu Zhao <yuzhao@...gle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	Kairui Song <kasong@...cent.com>,
	Kalesh Singh <kaleshsingh@...gle.com>
Subject: Re: [PATCH mm-unstable v2 6/6] mm/mglru: rework workingset protection

On Thu, Dec 05, 2024 at 05:31:26PM -0700, Yu Zhao wrote:
> With the aging feedback no longer considering the distribution of
> folios in each generation, rework workingset protection to better
> distribute folios across MAX_NR_GENS. This is achieved by reusing
> PG_workingset and PG_referenced/LRU_REFS_FLAGS in a slightly different
> way.
> 
> For folios accessed multiple times through file descriptors, make
> lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in
> folio->flags after PG_referenced, then PG_workingset after
> LRU_REFS_WIDTH. After all its bits are set, i.e.,
> LRU_REFS_FLAGS|BIT(PG_workingset), a folio is lazily promoted into the
> second oldest generation in the eviction path. And when
> folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that
> lru_gen_inc_refs() can start over. For this case, LRU_REFS_MASK is
> only valid when PG_referenced is set.
> 
> For folios accessed multiple times through page tables,
> folio_update_gen() from a page table walk or lru_gen_set_refs() from a
> rmap walk sets PG_referenced after the accessed bit is cleared for the
> first time. Thereafter, those two paths set PG_workingset and promote
> folios to the youngest generation. Like folio_inc_gen(), when
> folio_update_gen() does that, it also clears PG_referenced. For this
> case, LRU_REFS_MASK is not used.
> 
> For both of the cases, after PG_workingset is set on a folio, it
> remains until this folio is either reclaimed, or "deactivated" by
> lru_gen_clear_refs(). It can be set again if lru_gen_test_recent()
> returns true upon a refault.
> 
> When adding folios to the LRU lists, lru_gen_distance() distributes
> them as follows:
> +---------------------------------+---------------------------------+
> |    Accessed thru page tables    | Accessed thru file descriptors  |
> +---------------------------------+---------------------------------+
> | PG_active (set while isolated)  |                                 |
> +----------------+----------------+----------------+----------------+
> | PG_workingset  | PG_referenced  | PG_workingset  | LRU_REFS_FLAGS |
> +---------------------------------+---------------------------------+
> |<--------- MIN_NR_GENS --------->|                                 |
> |<-------------------------- MAX_NR_GENS -------------------------->|
> 
> After this patch, some typical client and server workloads showed
> improvements under heavy memory pressure. For example, Python TPC-C,
> which was used to benchmark a different approach [1] to better detect
> refault distances, showed a significant decrease in total refaults:
>                             Before      After      Change
>   Time (seconds)            10801       10801      0%
>   Executed (transactions)   41472       43663      +5%
>   workingset_nodes          109070      120244     +10%
>   workingset_refault_anon   5019627     7281831    +45%
>   workingset_refault_file   1294678786  554855564  -57%
>   workingset_refault_total  1299698413  562137395  -57%
> 
> [1] https://lore.kernel.org/20230920190244.16839-1-ryncsn@gmail.com/
> 
> Reported-by: Kairui Song <kasong@...cent.com>
> Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@mail.gmail.com/
> Signed-off-by: Yu Zhao <yuzhao@...gle.com>
> Tested-by: Kalesh Singh <kaleshsingh@...gle.com>
> ---
>  include/linux/mm_inline.h |  94 +++++++++++++------------
>  include/linux/mmzone.h    |  82 +++++++++++++---------
>  mm/swap.c                 |  23 +++---
>  mm/vmscan.c               | 142 +++++++++++++++++++++++---------------
>  mm/workingset.c           |  29 ++++----
>  5 files changed, 209 insertions(+), 161 deletions(-)

Some outlier results from LULESH (Livermore Unstructured Lagrangian
Explicit Shock Hydrodynamics) [1] caught my eye. The following fix
made the benchmark a lot happier (128GB DRAM + Optane swap):
                            Before    After    Change
  Average (z/s)             6894      7574     +10%
  Deviation (10 samples)    12.96%    1.76%    -86%

[1] https://asc.llnl.gov/codes/proxy-apps/lulesh

Andrew, can you please fold it in? Thanks!

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 90bbc2b3be8b..5e03a61c894f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -916,8 +916,7 @@ static enum folio_references folio_check_references(struct folio *folio,
 		if (!referenced_ptes)
 			return FOLIOREF_RECLAIM;
 
-		lru_gen_set_refs(folio);
-		return FOLIOREF_ACTIVATE;
+		return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE : FOLIOREF_KEEP;
 	}
 
 	referenced_folio = folio_test_clear_referenced(folio);
@@ -4173,11 +4172,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 			old_gen = folio_update_gen(folio, new_gen);
 			if (old_gen >= 0 && old_gen != new_gen)
 				update_batch_size(walk, folio, old_gen, new_gen);
-
-			continue;
-		}
-
-		if (lru_gen_set_refs(folio)) {
+		} else if (lru_gen_set_refs(folio)) {
 			old_gen = folio_lru_gen(folio);
 			if (old_gen >= 0 && old_gen != new_gen)
 				folio_activate(folio);