linux-kernel - Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOUHufavCOqwkm4BJJzHY+RUOafFBLH7t0O+KRbw=ns-RdYwdA@mail.gmail.com>
Date: Fri, 15 Dec 2023 00:23:52 -0700
From: Yu Zhao <yuzhao@...gle.com>
To: Henry Huang <henry.hj@...group.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	谈鉴锋 <henry.tjf@...group.com>, 
	朱辉(茶水) <teawater@...group.com>, 
	akpm@...ux-foundation.org
Subject: Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@...group.com> wrote:
>
> Multi-Gen LRU page-table walker clears pte young flag, but it doesn't
> clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check
> whether one page is accessed, it would tell us this page is idle,
> but actually this page has been accessed.
>
> For those unmapped filecache pages, page idle flag would not been
> cleared in folio_mark_accessed if Multi-Gen LRU is enabled.
> So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether
> a filecache page is read or written.
>
> What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag.
> If one page is accessed, it would set page young flag. Multi-Gen LRU
> page-table walker should check both page&pte young flags.
>
> how-to-reproduce-problem
>
> idle_page_track
>    a tools to track process accessed memory during a specific time
> usage
>    idle_page_track $pid $time
> how-it-works
>    1. scan process vma from /proc/$pid/maps
>    2. vfn --> pfn from /proc/$pid/pagemap
>    3. write /sys/kernel/mm/page_idle/bitmap to
>       mark phy page idle flag and clear pte young flag
>    4. sleep $time
>    5. read /sys/kernel/mm/page_idle/bitmap to
>       test_and_clear pte young flag and
>       return whether phy page is accessed
>
> test ---- test program
>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <unistd.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <fcntl.h>
>
>  int main(int argc, const char *argv[])
>  {
>      char *buf = NULL;
>      char pipe_info[4096];
>      int n;
>      int fd = -1;
>
>      buf = malloc(1024*1024*1024UL);
>      memset(buf, 0, 1024*1024*1024UL);
>      fd = open("access.pipe", O_RDONLY);
>      if (fd < 0)
>          goto out;
>      while (1) {
>          n = read(fd, pipe_info, sizeof(pipe_info));
>          if (!n) {
>              sleep(1);
>              continue;
>          } else if (n < 0) {
>              break;
>          }
>          memset(buf, 0, 1024*1024*1024UL);
>          puts("finish access");
>       }
>  out:
>      if (fd >=0)
>          close(fd);
>      if (buf)
>          free(buf);
>
>      return 0;
>  }
>
> prepare:
> mkfifo access.pipe
> ./test
> ps -ef | grep test
> root       4106   3148  8 06:47 pts/0    00:00:01 ./test
>
> We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan.
>
> case 1: mglru walker break page_idle
> ./idle_page_track 4106 60 &
> sleep 5; echo 1 > access.pipe
> sleep 5;  echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen
>
> the output of idle_page_track is:
> Est(s)     Ref(MB)
> 64.822        1.00
> only found 1MB were accessed during 64.822s, but actually 1024MB were
> accessed.
>
> case 2: page_idle break mglru walker
> echo 1 > access.pipe
> ./idle_page_track 4106 10
> echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen
> lru gen status:
> memcg     8     /user.slice
>  node     0
>           5     772458       1065        9735
>           6     737435     262244          72
>           7     538053       1184         632
>           8      59404       6422           0
> almost pages should be in max_seq-1 queue, but actually not.
>
> Signed-off-by: Henry Huang <henry.hj@...group.com>

Regarding the change itself, it'd cause a slight regression to other
use cases (details below).

> @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                 unsigned long pfn;
>                 struct folio *folio;
>                 pte_t ptent = ptep_get(pte + i);
> +               bool is_pte_young;
>
>                 total++;
>                 walk->mm_stats[MM_LEAF_TOTAL]++;
> @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                 if (pfn == -1)
>                         continue;
>
> -               if (!pte_young(ptent)) {
> -                       walk->mm_stats[MM_LEAF_OLD]++;

Most overhead from page table scanning normally comes from
get_pfn_folio() because it almost always causes a cache miss. This is
like a pointer dereference, whereas scanning PTEs is like streaming an
array (bad vs good cache performance).

pte_young() is here to avoid an unnecessary cache miss from
get_pfn_folio(). Also see the first comment in get_pfn_folio(). It
should be easy to verify the regression -- FlameGraph from the
memcached benchmark in the original commit message should do it.

Would a tracepoint here work for you?



> +               is_pte_young = !!pte_young(ptent);
> +               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young);
> +               if (!folio) {
> +                       if (!is_pte_young)
> +                               walk->mm_stats[MM_LEAF_OLD]++;
>                         continue;
>                 }
>
> -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> -               if (!folio)
> +               if (!folio_test_clear_young(folio) && !is_pte_young) {
> +                       walk->mm_stats[MM_LEAF_OLD]++;
>                         continue;
> +               }
>
> -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> +               if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i))
>                         VM_WARN_ON_ONCE(true);
>
>                 young++;