linux-kernel - Re: [PATCH v4] proc/ksm: add ksm stats to /proc/pid/smaps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b380a853-4466-d060-1084-dcdd65a5ce13@redhat.com>
Date:   Wed, 23 Aug 2023 12:09:40 +0200
From:   David Hildenbrand <david@...hat.com>
To:     Stefan Roesch <shr@...kernel.io>, kernel-team@...com
Cc:     akpm@...ux-foundation.org, linux-fsdevel@...r.kernel.org,
        hannes@...xchg.org, riel@...riel.com, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v4] proc/ksm: add ksm stats to /proc/pid/smaps

On 22.08.23 20:05, Stefan Roesch wrote:
> With madvise and prctl KSM can be enabled for different VMA's. Once it
> is enabled we can query how effective KSM is overall. However we cannot
> easily query if an individual VMA benefits from KSM.
> 
> This commit adds a KSM section to the /prod/<pid>/smaps file. It reports
> how many of the pages are KSM pages. The returned value for KSM is
> independent of the use of the shared zeropage.

Maybe phrase that to something like "The returned value for KSM includes
KSM-placed zeropages, so we can observe the actual KSM benefit independent
of the usage of the shared zeropage for KSM.".

But thinking about it (see below), maybe we really just let any user figure that out by
temporarily disabling the shared zeropage.

So this would be

"It reports how many of the pages are KSM pages. Note that KSM-placed zeropages
are not included, only actual KSM pages."

> 
> Here is a typical output:
> 
> 7f420a000000-7f421a000000 rw-p 00000000 00:00 0
> Size:             262144 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Rss:               51212 kB
> Pss:                8276 kB
> Shared_Clean:        172 kB
> Shared_Dirty:      42996 kB
> Private_Clean:       196 kB
> Private_Dirty:      7848 kB
> Referenced:        15388 kB
> Anonymous:         51212 kB
> KSM:               41376 kB
> LazyFree:              0 kB
> AnonHugePages:         0 kB
> ShmemPmdMapped:        0 kB
> FilePmdMapped:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:             202016 kB
> SwapPss:            3882 kB
> Locked:                0 kB
> THPeligible:    0
> ProtectionKey:         0
> ksm_state:          0
> ksm_skip_base:      0
> ksm_skip_count:     0
> VmFlags: rd wr mr mw me nr mg anon
> 
> This information also helps with the following workflow:
> - First enable KSM for all the VMA's of a process with prctl.
> - Then analyze with the above smaps report which VMA's benefit the most
> - Change the application (if possible) to add the corresponding madvise
> calls for the VMA's that benefit the most
> 
> Signed-off-by: Stefan Roesch <shr@...kernel.io>
> ---
>   Documentation/filesystems/proc.rst |  4 ++++
>   fs/proc/task_mmu.c                 | 16 +++++++++++-----
>   2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 7897a7dafcbc..d5bdfd59f5b0 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -461,6 +461,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
>       Private_Dirty:         0 kB
>       Referenced:          892 kB
>       Anonymous:             0 kB
> +    KSM:                   0 kB
>       LazyFree:              0 kB
>       AnonHugePages:         0 kB
>       ShmemPmdMapped:        0 kB
> @@ -501,6 +502,9 @@ accessed.
>   a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>   and a page is modified, the file page is replaced by a private anonymous copy.
>   
> +"KSM" shows the amount of anonymous memory that has been de-duplicated. The
> +value is independent of the use of shared zeropage.

Maybe here as well.

> +
>   "LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
>   The memory isn't freed immediately with madvise(). It's freed in memory
>   pressure if the memory is clean. Please note that the printed value might
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 51315133cdc2..4532caa8011c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -4,6 +4,7 @@
>   #include <linux/hugetlb.h>
>   #include <linux/huge_mm.h>
>   #include <linux/mount.h>
> +#include <linux/ksm.h>
>   #include <linux/seq_file.h>
>   #include <linux/highmem.h>
>   #include <linux/ptrace.h>
> @@ -396,6 +397,7 @@ struct mem_size_stats {
>   	unsigned long swap;
>   	unsigned long shared_hugetlb;
>   	unsigned long private_hugetlb;
> +	unsigned long ksm;
>   	u64 pss;
>   	u64 pss_anon;
>   	u64 pss_file;
> @@ -435,9 +437,9 @@ static void smaps_page_accumulate(struct mem_size_stats *mss,
>   	}
>   }
>   
> -static void smaps_account(struct mem_size_stats *mss, struct page *page,
> -		bool compound, bool young, bool dirty, bool locked,
> -		bool migration)
> +static void smaps_account(struct mem_size_stats *mss, pte_t *pte,
> +		struct page *page, bool compound, bool young, bool dirty,
> +		bool locked, bool migration)
>   {
>   	int i, nr = compound ? compound_nr(page) : 1;
>   	unsigned long size = nr * PAGE_SIZE;
> @@ -452,6 +454,9 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>   			mss->lazyfree += size;
>   	}
>   
> +	if (PageKsm(page) && (!pte || !is_ksm_zero_pte(*pte)))

I think this won't work either way, because smaps_pte_entry() never ends up calling
this function with !page. And the shared zeropage here always gives us !page.

What would work is:


diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 15ddf4653a19..ef6f39d7c5a2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -528,6 +528,9 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
                 page = vm_normal_page(vma, addr, ptent);
                 young = pte_young(ptent);
                 dirty = pte_dirty(ptent);
+
+               if (!page && is_ksm_zero_pte(ptent))
+                       mss->ksm += size;
         } else if (is_swap_pte(ptent)) {
                 swp_entry_t swpent = pte_to_swp_entry(ptent);
  

That means that "KSM" can be bigger than "Anonymous" and "RSS" when the shared
zeropage is used.

Interestingly, right now we account each KSM page individually towards
"Anonymous" and "RSS".

So if we have 100 times the same KSM page in a VMA, we will have 100 times anon
and 100 times rss.

Thinking about it, I guess considering the KSM-placed zeropage indeed adds more
confusion to that. Eventually, we might just want separate "Shared-zeropages" count.


So maybe v3 is better, clarifying the documentation a bit, that the
KSM-placed zeropage is not considered.

Sorry for changing my mind :D Thoughts?

-- 
Cheers,

David / dhildenb