[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87y29nbtji.fsf@disp2133>
Date: Fri, 30 Jul 2021 12:28:17 -0500
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Tiberiu A Georgescu <tiberiu.georgescu@...anix.com>
Cc: akpm@...ux-foundation.org, viro@...iv.linux.org.uk,
peterx@...hat.com, david@...hat.com, christian.brauner@...ntu.com,
adobriyan@...il.com, songmuchun@...edance.com, axboe@...nel.dk,
vincenzo.frascino@....com, catalin.marinas@....com,
peterz@...radead.org, chinwen.chang@...iatek.com,
linmiaohe@...wei.com, jannh@...gle.com, apopple@...dia.com,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, ivan.teterevkov@...anix.com,
florian.schmidt@...anix.com, carl.waldspurger@...anix.com,
jonathan.davies@...anix.com
Subject: Re: [PATCH 0/1] pagemap: swap location for shared pages
Tiberiu A Georgescu <tiberiu.georgescu@...anix.com> writes:
> This patch follows up on a previous RFC:
> 20210714152426.216217-1-tiberiu.georgescu@...anix.com
>
> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> entry is cleared. In many cases, there is no difference between swapped-out
> shared pages and newly allocated, non-dirty pages in the pagemap
> interface.
What is the point?
You say a shared swapped out page is the same as a clean shared page
and you are exactly correct. What is the point in knowing a shared
page was swapped out? What does is the gain?
I tried to understand the point by looking at your numbers below
and everything I could see looked worse post patch.
Eric
> Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
> #define NPAGES (256)
> /* map 1MiB shared memory */
> size_t pagesize = getpagesize();
> char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> /* Dirty new pages. */
> for (i = 0; i < PAGES; i++)
> p[i * pagesize] = i;
>
> Run the above program in a small cgroup, which causes swapping:
> /* Initialise cgroup & run a program */
> $ echo 512K > foo/memory.limit_in_bytes
> $ echo 60 > foo/memory.swappiness
> $ cgexec -g memory:foo ./pagemap-test
>
> Check the pagemap report. Example of the current expected output:
> $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 00000710 e1 6b 06 00 00 00 00 a1 9e eb 06 00 00 00 00 a1 |.k..............|
> 00000720 6b ee 06 00 00 00 00 a1 a5 a4 05 00 00 00 00 a1 |k...............|
> 00000730 5c bf 06 00 00 00 00 a1 90 b6 06 00 00 00 00 a1 |\...............|
>
> The first pagemap entries are reported as zeroes, indicating the pages have
> never been allocated while they have actually been swapped out.
>
> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> make use of the XArray associated with the virtual memory area struct
> passed as an argument. The XArray contains the location of virtual pages in
> the page cache, swap cache or on disk. If they are in either of the caches,
> then the original implementation still works. If not, then the missing
> information will be retrieved from the XArray.
>
> Performance
> ============
> I measured the performance of the patch on a single socket Xeon E5-2620
> machine, with 128GiB of RAM and 128GiB of swap storage. These were the
> steps taken:
>
> 1. Run example pagemap-test code on a cgroup
> a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
> b. allocate 16GiB (about 4 million pages);
> c. dirty 0,50 or 100% of pages;
> d. do this for both private and shared memory.
> 2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
> for each possible configuration above
> a. 3 times for warm up;
> b. 10 times to measure performance.
> Use `time` or another performance measuring tool.
>
> Results (averaged over 10 iterations):
> +--------+------------+------------+
> | dirty% | pre patch | post patch |
> +--------+------------+------------+
> private|anon | 0% | 8.15s | 8.40s |
> | 50% | 11.83s | 12.19s |
> | 100% | 12.37s | 12.20s |
> +--------+------------+------------+
> shared|anon | 0% | 8.17s | 8.18s |
> | 50% | (*) 10.43s | 37.43s |
> | 100% | (*) 10.20s | 38.59s |
> +--------+------------+------------+
>
> (*): reminder that pre-patch produces incorrect pagemap entries for swapped
> out pages.
>
> From run to run the above results are stable (mostly <1% stderr).
>
> The amount of time it takes for a full read of the pagemap depends on the
> granularity used by dd to read the pagemap file. Even though the access is
> sequential, the script only reads 8 bytes at a time, running pagemap_read()
> COUNT times (one time for each page in a 16GiB area).
>
> To reduce overhead, we can use batching for large amounts of sequential
> access. We can make dd read multiple page entries at a time,
> allowing the kernel to make optimisations and yield more throughput.
>
> Performance in real time (seconds) of
> `dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
> count=$((4194304 / $BATCH))`:
> +---------------------------------+ +---------------------------------+
> | Shared, Anon, 50% dirty | | Shared, Anon, 100% dirty |
> +-------+------------+------------+ +-------+------------+------------+
> | Batch | Pre-patch | Post-patch | | Batch | Pre-patch | Post-patch |
> +-------+------------+------------+ +-------+------------+------------+
> | 1 | (*) 10.43s | 37.43s | | 1 | (*) 10.20s | 38.59s |
> | 2 | (*) 5.25s | 18.77s | | 2 | (*) 5.15s | 19.37s |
> | 4 | (*) 2.63s | 9.42s | | 4 | (*) 2.63s | 9.74s |
> | 8 | (*) 1.38s | 4.80s | | 8 | (*) 1.35s | 4.94s |
> | 16 | (*) 0.73s | 2.46s | | 16 | (*) 0.72s | 2.54s |
> | 32 | (*) 0.40s | 1.31s | | 32 | (*) 0.41s | 1.34s |
> | 64 | (*) 0.25s | 0.72s | | 64 | (*) 0.24s | 0.74s |
> | 128 | (*) 0.16s | 0.43s | | 128 | (*) 0.16s | 0.44s |
> | 256 | (*) 0.12s | 0.28s | | 256 | (*) 0.12s | 0.29s |
> | 512 | (*) 0.10s | 0.21s | | 512 | (*) 0.10s | 0.22s |
> | 1024 | (*) 0.10s | 0.20s | | 1024 | (*) 0.10s | 0.21s |
> +-------+------------+------------+ +-------+------------+------------+
>
> To conclude, in order to make the most of the underlying mechanisms of
> pagemap and xarray, one should be using batching to achieve better
> performance.
>
> Future Work
> ============
>
> Note: there are PTE flags which currently do not survive the swap out when
> the page is shmem: SOFT_DIRTY and UFFD_WP.
>
> A solution for saving the state of the UFFD_WP flag has been proposed by
> Peter Xu in the patch linked below. The concept and mechanism proposed
> could be extended to include the SOFT_DIRTY bit as well:
> 20210715201422.211004-1-peterx@...hat.com
> Our patches are mostly orthogonal.
>
> Kind regards,
> Tibi
>
> Tiberiu A Georgescu (1):
> pagemap: report swap location for shared pages
>
> fs/proc/task_mmu.c | 38 ++++++++++++++++++++++++++++++--------
> 1 file changed, 30 insertions(+), 8 deletions(-)
Powered by blists - more mailing lists