linux-kernel - Regression of madvise(MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <dd620dbd-6d71-7553-d1e9-95676ff12c82@nutanix.com>
Date:   Fri, 4 Mar 2022 17:55:58 +0000
From:   Ivan Teterevkov <ivan.teterevkov@...anix.com>
To:     minchan@...nel.org, akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
        mhocko@...e.com, hannes@...xchg.org, timmurray@...gle.com,
        joel@...lfernandes.org, surenb@...gle.com, dancol@...gle.com,
        shakeelb@...gle.com, sonnyrao@...gle.com, oleksandr@...hat.com,
        hdanton@...a.com, lizeb@...gle.com, dave.hansen@...el.com,
        kirill.shutemov@...ux.intel.com
Subject: Regression of madvise(MADV_COLD) on shmem?

Hi folks,

I want to check if there's a regression in the madvise(MADV_COLD) 
behaviour with shared memory or my understanding of how it works is 
inaccurate.

The MADV_COLD advice was introduced in Linux 5.4 and allowed the users 
to mark selected memory ranges as more "inactive" than others, 
overruling the default LRU accounting. It helped to preserve the working 
set of an application. With more recent kernels, e.g. at least 
5.17.0-rc6 and 5.10.42, MADV_COLD has stopped working as expected. 
Please take a look at a short program that demonstrates it:

     /*
      * madvise(MADV_COLD) demo.
      */
     #include <assert.h>
     #include <stdio.h>
     #include <stdlib.h>
     #include <string.h>
     #include <sys/mman.h>

     /* Requires the kernel 5.4 or newer. */
     #ifndef MADV_COLD
     #define MADV_COLD 20
     #endif

     #define GIB(x) ((size_t)(x) << 30)

     int main(void)
     {
         char *shmem, *zeroes;
         int page_size = getpagesize();
         size_t i;

         /* Allocate 8 GiB of shared memory. */
         shmem = mmap(/* addr */ NULL,
                      /* length */ GIB(8),
                      /* prot */ PROT_READ | PROT_WRITE,
                      /* flags */ MAP_SHARED | MAP_ANONYMOUS,
                      /* fd */ -1,
                      /* offset */ 0);
         assert(shmem != MAP_FAILED);

         /* Allocate a zero page for future use. */
         zeroes = calloc(1, page_size);
         assert(zeroes != NULL);

         /* Put 1 GiB blob at the beginning of the shared memory range. */
         memset(shmem, 0xaa, GIB(1));

         /* Read memory adjacent to the blob. */
         for (i = GIB(1); i < GIB(8); i = i + page_size) {
             int res = memcmp(shmem + i, zeroes, page_size);
             assert(res == 0);

             /* Cooldown a zero page and make it "less active" than the 
blob.
              * Under memory pressure, it'll likely become a reclaim target
              * and thus will help to preserve the blob in memory.
              */
             res = madvise(shmem + i, page_size, MADV_COLD);
             assert(res == 0);
         }

         /* Let the user check smaps. */
         printf("done\n");
         pause();

         free(zeroes);
         munmap(shmem, GIB(8));

         return 0;
     }

How to run this program:

1. Create a "test" cgroup with a memory limit of 3 GiB.

1.1. cgroup v1:

     # mkdir /sys/fs/cgroup/memory/test
     # echo 3G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes

1.2. cgroup v2:

     # mkdir /sys/fs/cgroup/test
     # echo 3G > /sys/fs/cgroup/test/memory.max

2. Enable at least a 1 GiB swap device.

3. Run the program in the "test" cgroup:

     # cgexec -g memory:test ./a.out

4. Wait until it has finished, i.e. has printed "done".

5. Check the shared memory VMA stats.

5.1. In 5.17.0-rc6 and 5.10.42:

     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
     7f8ed4648000-7f90d4648000 rw-s 00000000 00:01 2055 
      /dev/zero (deleted)
     Size:            8388608 kB
     KernelPageSize:        4 kB
     MMUPageSize:           4 kB
     Rss:             3119556 kB
     Pss:             3119556 kB
     Shared_Clean:          0 kB
     Shared_Dirty:          0 kB
     Private_Clean:   3119556 kB
     Private_Dirty:         0 kB
     Referenced:            0 kB
     Anonymous:             0 kB
     LazyFree:              0 kB
     AnonHugePages:         0 kB
     ShmemPmdMapped:        0 kB
     FilePmdMapped:         0 kB
     Shared_Hugetlb:        0 kB
     Private_Hugetlb:       0 kB
     Swap:            1048576 kB
     SwapPss:               0 kB
     Locked:                0 kB
     THPeligible:    0
     VmFlags: rd wr sh mr mw me ms sd

5.2. In 5.4.109:

     # cat /proc/$(pidof a.out)/smaps | grep -A 21 -B 1 8388608
     7fca5f78b000-7fcc5f78b000 rw-s 00000000 00:01 173051 
      /dev/zero (deleted)
     Size:            8388608 kB
     KernelPageSize:        4 kB
     MMUPageSize:           4 kB
     Rss:             3121504 kB
     Pss:             3121504 kB
     Shared_Clean:          0 kB
     Shared_Dirty:          0 kB
     Private_Clean:   2072928 kB
     Private_Dirty:   1048576 kB
     Referenced:            0 kB
     Anonymous:             0 kB
     LazyFree:              0 kB
     AnonHugePages:         0 kB
     ShmemPmdMapped:        0 kB
     FilePmdMapped:        0 kB
     Shared_Hugetlb:        0 kB
     Private_Hugetlb:       0 kB
     Swap:                  0 kB
     SwapPss:               0 kB
     Locked:                0 kB
     THPeligible:            0
     VmFlags: rd wr sh mr mw me ms

There's a noticeable difference in the "Swap" reports so that the older 
kernel doesn't swap the blob, but the newer ones do.

According to ftrace, the newer kernels still call deactivate_page() in 
madvise_cold():

# trace-cmd record -p function_graph -g madvise_cold
# trace-cmd report | less
     a.out-4877  [000]  1485.266106: funcgraph_entry: 
|  madvise_cold() {
     a.out-4877  [000]  1485.266115: funcgraph_entry: 
|    walk_page_range() {
     a.out-4877  [000]  1485.266116: funcgraph_entry: 
|      __walk_page_range() {
     a.out-4877  [000]  1485.266117: funcgraph_entry: 
|        madvise_cold_or_pageout_pte_range() {
     a.out-4877  [000]  1485.266118: funcgraph_entry:        0.179 us 
|          deactivate_page();

(The irrelevant bits are removed for brevity.)

It makes me think there may be a regression in MADV_COLD. Please let me 
know what do you reckon?

Thanks,
Ivan