lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMgjq7D5b8d8o+imkLyOwOj0dZyYOA2ZaoY9_XF13Nf0OgYjbQ@mail.gmail.com>
Date: Wed, 8 Jan 2025 01:14:58 +0800
From: Kairui Song <ryncsn@...il.com>
To: Yu Zhao <yuzhao@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, David Stevens <stevensd@...omium.org>, 
	Kalesh Singh <kaleshsingh@...gle.com>
Subject: Re: [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback

On Tue, Dec 31, 2024 at 12:36 PM Yu Zhao <yuzhao@...gle.com> wrote:

Hi Yu,

>
> The aging feedback is based on both the number of generations and the
> distribution of folios in each generation. The number of generations
> is currently the distance between max_seq and anon min_seq. This is
> because anon min_seq is not allowed to move past file min_seq. The
> rationale for that is that file is always evictable whereas anon is
> not. However, for use cases where anon is a lot cheaper than file:
> 1. Anon in the second oldest generation can be a better choice than
>    file in the oldest generation.
> 2. A large amount of file in the oldest generation can skew the
>    distribution, making should_run_aging() return false negative.
>
> Allow anon and file min_seq to move independently, and use solely the
> number of generations as the feedback for aging. Specifically, when
> both anon and file are evictable, anon min_seq can now be greater than
> file min_seq, and therefore the number of generations becomes the
> distance between max_seq and min(min_seq[0],min_seq[1]). And
> should_run_aging() returns true if and only if the number of
> generations is less than MAX_NR_GENS.
>
> As the first step to the final optimization, this change by itself
> should not have userspace-visiable effects beyond performance. The
> next twos patch will take advantage of this change; the last patch in
> this series will better distribute folios across MAX_NR_GENS.
>
> Reported-by: David Stevens <stevensd@...omium.org>
> Signed-off-by: Yu Zhao <yuzhao@...gle.com>
> Tested-by: Kalesh Singh <kaleshsingh@...gle.com>
> ---
>  include/linux/mmzone.h |  17 ++--
>  mm/vmscan.c            | 200 ++++++++++++++++++-----------------------
>  2 files changed, 96 insertions(+), 121 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b36124145a16..8245ecb0400b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -421,12 +421,11 @@ enum {
>  /*
>   * The youngest generation number is stored in max_seq for both anon and file
>   * types as they are aged on an equal footing. The oldest generation numbers are
> - * stored in min_seq[] separately for anon and file types as clean file pages
> - * can be evicted regardless of swap constraints.
> - *
> - * Normally anon and file min_seq are in sync. But if swapping is constrained,
> - * e.g., out of swap space, file min_seq is allowed to advance and leave anon
> - * min_seq behind.
> + * stored in min_seq[] separately for anon and file types so that they can be
> + * incremented independently. Ideally min_seq[] are kept in sync when both anon
> + * and file types are evictable. However, to adapt to situations like extreme
> + * swappiness, they are allowed to be out of sync by at most
> + * MAX_NR_GENS-MIN_NR_GENS-1.
>   *
>   * The number of pages in each generation is eventually consistent and therefore
>   * can be transiently negative when reset_batch_size() is pending.
> @@ -446,8 +445,8 @@ struct lru_gen_folio {
>         unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
>         /* the exponential moving average of evicted+protected */
>         unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
> -       /* the first tier doesn't need protection, hence the minus one */
> -       unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
> +       /* can only be modified under the LRU lock */
> +       unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         /* can be modified without holding the LRU lock */
>         atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> @@ -498,7 +497,7 @@ struct lru_gen_mm_walk {
>         int mm_stats[NR_MM_STATS];
>         /* total batched items */
>         int batched;
> -       bool can_swap;
> +       int swappiness;
>         bool force_scan;
>  };
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f236db86de8a..f767e3d34e73 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2627,11 +2627,17 @@ static bool should_clear_pmd_young(void)
>                 READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),      \
>         }
>
> +#define evictable_min_seq(min_seq, swappiness)                         \
> +       min((min_seq)[!(swappiness)], (min_seq)[(swappiness) != MAX_SWAPPINESS])
> +
>  #define for_each_gen_type_zone(gen, type, zone)                                \
>         for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)                   \
>                 for ((type) = 0; (type) < ANON_AND_FILE; (type)++)      \
>                         for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
>
> +#define for_each_evictable_type(type, swappiness)                      \
> +       for ((type) = !(swappiness); (type) <= ((swappiness) != MAX_SWAPPINESS); (type)++)
> +
>  #define get_memcg_gen(seq)     ((seq) % MEMCG_NR_GENS)
>  #define get_memcg_bin(bin)     ((bin) % MEMCG_NR_BINS)
>
> @@ -2677,10 +2683,16 @@ static int get_nr_gens(struct lruvec *lruvec, int type)
>
>  static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
>  {
> -       /* see the comment on lru_gen_folio */
> -       return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
> -              get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
> -              get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
> +       int type;
> +
> +       for (type = 0; type < ANON_AND_FILE; type++) {
> +               int n = get_nr_gens(lruvec, type);
> +
> +               if (n < MIN_NR_GENS || n > MAX_NR_GENS)
> +                       return false;
> +       }
> +
> +       return true;
>  }
>
>  /******************************************************************************
> @@ -3087,9 +3099,8 @@ static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
>         pos->refaulted = lrugen->avg_refaulted[type][tier] +
>                          atomic_long_read(&lrugen->refaulted[hist][type][tier]);
>         pos->total = lrugen->avg_total[type][tier] +
> +                    lrugen->protected[hist][type][tier] +
>                      atomic_long_read(&lrugen->evicted[hist][type][tier]);
> -       if (tier)
> -               pos->total += lrugen->protected[hist][type][tier - 1];
>         pos->gain = gain;
>  }
>
> @@ -3116,17 +3127,15 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
>                         WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
>
>                         sum = lrugen->avg_total[type][tier] +
> +                             lrugen->protected[hist][type][tier] +
>                               atomic_long_read(&lrugen->evicted[hist][type][tier]);
> -                       if (tier)
> -                               sum += lrugen->protected[hist][type][tier - 1];
>                         WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
>                 }
>
>                 if (clear) {
>                         atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
>                         atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
> -                       if (tier)
> -                               WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
> +                       WRITE_ONCE(lrugen->protected[hist][type][tier], 0);
>                 }
>         }
>  }
> @@ -3261,7 +3270,7 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
>                 return true;
>
>         if (vma_is_anonymous(vma))
> -               return !walk->can_swap;
> +               return !walk->swappiness;
>
>         if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
>                 return true;
> @@ -3271,7 +3280,10 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
>                 return true;
>
>         if (shmem_mapping(mapping))
> -               return !walk->can_swap;
> +               return !walk->swappiness;
> +
> +       if (walk->swappiness == MAX_SWAPPINESS)
> +               return true;
>
>         /* to exclude special mappings like dax, etc. */
>         return !mapping->a_ops->read_folio;
> @@ -3359,7 +3371,7 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
>  }
>
>  static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
> -                                  struct pglist_data *pgdat, bool can_swap)
> +                                  struct pglist_data *pgdat)
>  {
>         struct folio *folio;
>
> @@ -3370,10 +3382,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
>         if (folio_memcg(folio) != memcg)
>                 return NULL;
>
> -       /* file VMAs can contain anon pages from COW */
> -       if (!folio_is_file_lru(folio) && !can_swap)
> -               return NULL;
> -
>         return folio;
>  }
>
> @@ -3429,7 +3437,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                 if (pfn == -1)
>                         continue;
>
> -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> +               folio = get_pfn_folio(pfn, memcg, pgdat);
>                 if (!folio)
>                         continue;
>
> @@ -3514,7 +3522,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
>                 if (pfn == -1)
>                         goto next;
>
> -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> +               folio = get_pfn_folio(pfn, memcg, pgdat);
>                 if (!folio)
>                         goto next;
>
> @@ -3726,22 +3734,26 @@ static void clear_mm_walk(void)
>                 kfree(walk);
>  }
>
> -static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> +static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
>  {
>         int zone;
>         int remaining = MAX_LRU_BATCH;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> +       int hist = lru_hist_from_seq(lrugen->min_seq[type]);
>         int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
> -       if (type == LRU_GEN_ANON && !can_swap)
> +       if (type ? swappiness == MAX_SWAPPINESS : !swappiness)
>                 goto done;
>
> -       /* prevent cold/hot inversion if force_scan is true */
> +       /* prevent cold/hot inversion if the type is evictable */
>         for (zone = 0; zone < MAX_NR_ZONES; zone++) {
>                 struct list_head *head = &lrugen->folios[old_gen][type][zone];
>
>                 while (!list_empty(head)) {
>                         struct folio *folio = lru_to_folio(head);
> +                       int refs = folio_lru_refs(folio);
> +                       int tier = lru_tier_from_refs(refs);
> +                       int delta = folio_nr_pages(folio);
>
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
>                         VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> @@ -3751,6 +3763,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
>                         new_gen = folio_inc_gen(lruvec, folio, false);
>                         list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
>
> +                       WRITE_ONCE(lrugen->protected[hist][type][tier],
> +                                  lrugen->protected[hist][type][tier] + delta);
> +
>                         if (!--remaining)
>                                 return false;
>                 }
> @@ -3762,7 +3777,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
>         return true;
>  }
>
> -static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> +static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
>  {
>         int gen, type, zone;
>         bool success = false;
> @@ -3772,7 +3787,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
>         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
>
>         /* find the oldest populated generation */
> -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +       for_each_evictable_type(type, swappiness) {
>                 while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
>                         gen = lru_gen_from_seq(min_seq[type]);
>
> @@ -3788,13 +3803,17 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
>         }
>
>         /* see the comment on lru_gen_folio */
> -       if (can_swap) {
> -               min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
> -               min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
> +       if (swappiness && swappiness != MAX_SWAPPINESS) {
> +               unsigned long seq = lrugen->max_seq - MIN_NR_GENS;
> +
> +               if (min_seq[LRU_GEN_ANON] > seq && min_seq[LRU_GEN_FILE] < seq)
> +                       min_seq[LRU_GEN_ANON] = seq;
> +               else if (min_seq[LRU_GEN_FILE] > seq && min_seq[LRU_GEN_ANON] < seq)
> +                       min_seq[LRU_GEN_FILE] = seq;
>         }
>
> -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> -               if (min_seq[type] == lrugen->min_seq[type])
> +       for_each_evictable_type(type, swappiness) {
> +               if (min_seq[type] <= lrugen->min_seq[type])
>                         continue;
>
>                 reset_ctrl_pos(lruvec, type, true);
> @@ -3805,8 +3824,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
>         return success;
>  }
>
> -static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> -                       bool can_swap, bool force_scan)
> +static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
>  {
>         bool success;
>         int prev, next;
> @@ -3824,13 +3842,11 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>         if (!success)
>                 goto unlock;
>
> -       for (type = ANON_AND_FILE - 1; type >= 0; type--) {
> +       for (type = 0; type < ANON_AND_FILE; type++) {
>                 if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
>                         continue;
>
> -               VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
> -
> -               if (inc_min_seq(lruvec, type, can_swap))
> +               if (inc_min_seq(lruvec, type, swappiness))
>                         continue;
>
>                 spin_unlock_irq(&lruvec->lru_lock);
> @@ -3874,7 +3890,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>  }
>
>  static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> -                              bool can_swap, bool force_scan)
> +                              int swappiness, bool force_scan)
>  {
>         bool success;
>         struct lru_gen_mm_walk *walk;
> @@ -3885,7 +3901,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>         VM_WARN_ON_ONCE(seq > READ_ONCE(lrugen->max_seq));
>
>         if (!mm_state)
> -               return inc_max_seq(lruvec, seq, can_swap, force_scan);
> +               return inc_max_seq(lruvec, seq, swappiness);
>
>         /* see the comment in iterate_mm_list() */
>         if (seq <= READ_ONCE(mm_state->seq))
> @@ -3910,7 +3926,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>
>         walk->lruvec = lruvec;
>         walk->seq = seq;
> -       walk->can_swap = can_swap;
> +       walk->swappiness = swappiness;
>         walk->force_scan = force_scan;
>
>         do {
> @@ -3920,7 +3936,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
>         } while (mm);
>  done:
>         if (success) {
> -               success = inc_max_seq(lruvec, seq, can_swap, force_scan);
> +               success = inc_max_seq(lruvec, seq, swappiness);
>                 WARN_ON_ONCE(!success);
>         }
>
> @@ -3961,13 +3977,13 @@ static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
>  {
>         int gen, type, zone;
>         unsigned long total = 0;
> -       bool can_swap = get_swappiness(lruvec, sc);
> +       int swappiness = get_swappiness(lruvec, sc);
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         DEFINE_MAX_SEQ(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
> -       for (type = !can_swap; type < ANON_AND_FILE; type++) {
> +       for_each_evictable_type(type, swappiness) {
>                 unsigned long seq;
>
>                 for (seq = min_seq[type]; seq <= max_seq; seq++) {
> @@ -3987,6 +4003,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
>  {
>         int gen;
>         unsigned long birth;
> +       int swappiness = get_swappiness(lruvec, sc);
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
> @@ -3996,8 +4013,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
>         if (!lruvec_is_sizable(lruvec, sc))
>                 return false;
>
> -       /* see the comment on lru_gen_folio */
> -       gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
> +       gen = lru_gen_from_seq(evictable_min_seq(min_seq, swappiness));
>         birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
>
>         return time_is_before_jiffies(birth + min_ttl);
> @@ -4064,7 +4080,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>         unsigned long addr = pvmw->address;
>         struct vm_area_struct *vma = pvmw->vma;
>         struct folio *folio = pfn_folio(pvmw->pfn);
> -       bool can_swap = !folio_is_file_lru(folio);
>         struct mem_cgroup *memcg = folio_memcg(folio);
>         struct pglist_data *pgdat = folio_pgdat(folio);
>         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> @@ -4117,7 +4132,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>                 if (pfn == -1)
>                         continue;
>
> -               folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
> +               folio = get_pfn_folio(pfn, memcg, pgdat);
>                 if (!folio)
>                         continue;
>
> @@ -4333,8 +4348,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                 gen = folio_inc_gen(lruvec, folio, false);
>                 list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
>
> -               WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
> -                          lrugen->protected[hist][type][tier - 1] + delta);
> +               WRITE_ONCE(lrugen->protected[hist][type][tier],
> +                          lrugen->protected[hist][type][tier] + delta);
>                 return true;
>         }
>
> @@ -4533,7 +4548,6 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
>  {
>         int i;
>         int type;
> -       int scanned;
>         int tier = -1;
>         DEFINE_MIN_SEQ(lruvec);
>
> @@ -4558,21 +4572,23 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
>         else
>                 type = get_type_to_scan(lruvec, swappiness, &tier);
>
> -       for (i = !swappiness; i < ANON_AND_FILE; i++) {
> +       for_each_evictable_type(i, swappiness) {

Thanks for working on solving the reported issues, but one concern
about this for_each_evictable_type  macro and its usage here.

It basically forbids eviction of file pages with "swappiness == 200"
even for global pressure, this is a quite a change.

For both active / inactive or MGLRU, max swappiness used to make
kernel try reclaim anon as much as possible, but still fall back to
file eviction. Forbidding file eviction may cause unsolvable OOM,
unlike anon pages, killing process won't necessarily release file
pages, so the system could hung easily.

For existing systems with swappiness == 200 which were running fine
before, may also hit OOM very quickly.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ