[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOUHufavCOqwkm4BJJzHY+RUOafFBLH7t0O+KRbw=ns-RdYwdA@mail.gmail.com>
Date: Fri, 15 Dec 2023 00:23:52 -0700
From: Yu Zhao <yuzhao@...gle.com>
To: Henry Huang <henry.hj@...group.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
谈鉴锋 <henry.tjf@...group.com>,
朱辉(茶水) <teawater@...group.com>,
akpm@...ux-foundation.org
Subject: Re: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap
On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@...group.com> wrote:
>
> Multi-Gen LRU page-table walker clears pte young flag, but it doesn't
> clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check
> whether one page is accessed, it would tell us this page is idle,
> but actually this page has been accessed.
>
> For those unmapped filecache pages, page idle flag would not been
> cleared in folio_mark_accessed if Multi-Gen LRU is enabled.
> So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether
> a filecache page is read or written.
>
> What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag.
> If one page is accessed, it would set page young flag. Multi-Gen LRU
> page-table walker should check both page&pte young flags.
>
> how-to-reproduce-problem
>
> idle_page_track
> a tools to track process accessed memory during a specific time
> usage
> idle_page_track $pid $time
> how-it-works
> 1. scan process vma from /proc/$pid/maps
> 2. vfn --> pfn from /proc/$pid/pagemap
> 3. write /sys/kernel/mm/page_idle/bitmap to
> mark phy page idle flag and clear pte young flag
> 4. sleep $time
> 5. read /sys/kernel/mm/page_idle/bitmap to
> test_and_clear pte young flag and
> return whether phy page is accessed
>
> test ---- test program
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> int main(int argc, const char *argv[])
> {
> char *buf = NULL;
> char pipe_info[4096];
> int n;
> int fd = -1;
>
> buf = malloc(1024*1024*1024UL);
> memset(buf, 0, 1024*1024*1024UL);
> fd = open("access.pipe", O_RDONLY);
> if (fd < 0)
> goto out;
> while (1) {
> n = read(fd, pipe_info, sizeof(pipe_info));
> if (!n) {
> sleep(1);
> continue;
> } else if (n < 0) {
> break;
> }
> memset(buf, 0, 1024*1024*1024UL);
> puts("finish access");
> }
> out:
> if (fd >=0)
> close(fd);
> if (buf)
> free(buf);
>
> return 0;
> }
>
> prepare:
> mkfifo access.pipe
> ./test
> ps -ef | grep test
> root 4106 3148 8 06:47 pts/0 00:00:01 ./test
>
> We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan.
>
> case 1: mglru walker break page_idle
> ./idle_page_track 4106 60 &
> sleep 5; echo 1 > access.pipe
> sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen
>
> the output of idle_page_track is:
> Est(s) Ref(MB)
> 64.822 1.00
> only found 1MB were accessed during 64.822s, but actually 1024MB were
> accessed.
>
> case 2: page_idle break mglru walker
> echo 1 > access.pipe
> ./idle_page_track 4106 10
> echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen
> lru gen status:
> memcg 8 /user.slice
> node 0
> 5 772458 1065 9735
> 6 737435 262244 72
> 7 538053 1184 632
> 8 59404 6422 0
> almost pages should be in max_seq-1 queue, but actually not.
>
> Signed-off-by: Henry Huang <henry.hj@...group.com>
Regarding the change itself, it'd cause a slight regression to other
use cases (details below).
> @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> unsigned long pfn;
> struct folio *folio;
> pte_t ptent = ptep_get(pte + i);
> + bool is_pte_young;
>
> total++;
> walk->mm_stats[MM_LEAF_TOTAL]++;
> @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> if (pfn == -1)
> continue;
>
> - if (!pte_young(ptent)) {
> - walk->mm_stats[MM_LEAF_OLD]++;
Most overhead from page table scanning normally comes from
get_pfn_folio() because it almost always causes a cache miss. This is
like a pointer dereference, whereas scanning PTEs is like streaming an
array (bad vs good cache performance).
pte_young() is here to avoid an unnecessary cache miss from
get_pfn_folio(). Also see the first comment in get_pfn_folio(). It
should be easy to verify the regression -- FlameGraph from the
memcached benchmark in the original commit message should do it.
Would a tracepoint here work for you?
> + is_pte_young = !!pte_young(ptent);
> + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young);
> + if (!folio) {
> + if (!is_pte_young)
> + walk->mm_stats[MM_LEAF_OLD]++;
> continue;
> }
>
> - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> - if (!folio)
> + if (!folio_test_clear_young(folio) && !is_pte_young) {
> + walk->mm_stats[MM_LEAF_OLD]++;
> continue;
> + }
>
> - if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> + if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i))
> VM_WARN_ON_ONCE(true);
>
> young++;
Powered by blists - more mailing lists