linux-kernel - Re: [PATCH v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c04f5611-162c-441c-b94c-79032f09e015@redhat.com>
Date: Fri, 26 Apr 2024 08:57:36 +0200
From: David Hildenbrand <david@...hat.com>
To: John Hubbard <jhubbard@...dia.com>, Matthew Wilcox <willy@...radead.org>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 linux-doc@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
 Jonathan Corbet <corbet@....net>,
 "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
 Zi Yan <ziy@...dia.com>, Yang Shi <yang.shi@...ux.alibaba.com>,
 Ryan Roberts <ryan.roberts@....com>
Subject: Re: [PATCH v1] mm/khugepaged: replace page_mapcount() check by
 folio_likely_mapped_shared()

On 26.04.24 03:23, John Hubbard wrote:
> On 4/25/24 1:06 AM, David Hildenbrand wrote:
>> On 25.04.24 07:40, John Hubbard wrote:
>>> On 4/24/24 9:17 PM, Matthew Wilcox wrote:
>>>> On Wed, Apr 24, 2024 at 09:00:50PM -0700, John Hubbard wrote:
> ...
>> We'll talk more about all that at LSF/MM in the mapcount session. A spoiler:
> 
> Looking forward to it. And as an aside, this year it feels like the mm
> code is changing relatively fast. So many large and small improvements
> have happened or are in progress.

Yes, it's happening on a very fast pace (and it's hard for me to get 
reasonable work done while still keeping reviewing that much ...).

I'll note, that replacing a page-based interface by a folio-based 
interface should not be shocking news in 2024, and that the issues with 
mapcounts for large folios have been a recurring topic at LSF/MM and on 
the mailing list.

> 
> 
>>
>> page_mapcount() in the context of large folios:
>> * Is a misunderstood function (e.g., page_mapcount() vs page_count()
>>     checks, mapped = !page_mapcount() checks).
>> * Is a misleading function (e.g., page_mapped() == folio_mapped() but
>>     page_mapcount() != folio_mapcount())
>>
>> We could just rename it to "folio_precise_page_mapcount()", but then, once we tackle the subpage mapcount optimizations (initially using a separate kernel config toggle), we'll have to teach each caller about an alternative that gets the job done, and it's not that easy to prevent further reuse around the kernel.
>>
>> If you look at linux-next, we're down to 5 page_mapcount() calls in fs/proc/, so I'll relocate it to fs/proc/internal.h to prevent any further use - once the s390x change lands in the next merge window.
>>
>> Regarding the subpage mapcount optimizations, I can further add:
>> * (un)map performance improvements for PTE-mapped THP
>> * Preparation for folio_order() > PMD_ORDER, where the current scheme
>>     won't scale and needs further adjustments/complexity to even keep it
>>     working
>> * Preparation for hugetlb-like vmemmap optimizations until we have
>>     memdescs / dynamically allocated folios
>> * (Paving the way for partially mapping hugetlb folios that faced
>>      similar issues? Not sure if that ever gets real, though)
>>
>> Is this patch ahead of its time? LSF/MM is just around the corner, and I'm planning on posting the other relevant patches in the next months.
> 
> I think so, yes. There is a lot of context required to understand the
> motivation, and more required in order to figure out if it is safe,
> and if it still provides "good" behavior.

I think the motivation for removing page_mapcount() should be very clear 
at this point: a single remaining user in mm/ should be warranted, and 
the faster it is gone the better.

[case in point: I even have another potential user [1] in my mailbox 
that should be using a folio interface, well, or PG_anon_exclusive :) ]

[1] https://lore.kernel.org/all/Zirw0uINbP6GxFiK@bender.morinfr.org/T/#u

Regarding removing subpage mapcounts I agree: I added too many details 
that made it look harder to understand :P

> 
> I still think it's mostly harmless, though, so being ahead of its time
> is not necessarily an indictment. :)

I didn't express my thought clearly: LSF/MM is just around the corner 
and the discussion we are having here is the perfect preparation for 
that session! :)

I don't particularly care if we merge this patch now or after the next 
merge window along with the remaining page_mapcount() removal.

Discussing the impact of this change is the important piece. :)

[...]

>> Thanks for having a look!
>>
>> I'm only a bit concerned about folio_likely_mapped_shared() "false negatives" (detecting exclusive although shared), until we have a more precise folio_likely_mapped_shared() variant to not unexpectedly waste memory.
>>
>> Imagine someone would be setting "khugepaged_max_ptes_shared=0", and then we have an area where (I think this is the extreme case):
>>
>> * We map 256 subpages of a 2M folio that are shared 256 times with a
>>     child process.
>> * We don't map the first subpage.
>> * One PTE maps another page and is pte_write().
>> * 255 PTEs are pte_none().
>>
>> folio_likely_mapped_shared() would return "false".
>>
>> But then my thinking is:
>> * We are already wasting 256 subpages that are free in the 2M folio.
>>     Sure, we might be able to relaim it when deferred splitting.
>> * Why would someone set khugepaged_max_ptes_shared=0 but leave
>>     khugepaged_max_ptes_none set that high that we would allow 255
>>     pte_none?
>> * If the child is a short-living subprocess, we don't really care
>> * Any futher writes to unbacked/R/O PTEs in that PMD area would COW and
>>     consume memory.
>>
>> So I had to use more and more "ifs" to construct a scenario where we might end up wasting 1M of memory, at which point I decided "this is really a corner case" and likely not worth the worry.
>>
>> If we run into real issues, though, it will be easy to just inline page_mapcount() here to resolve it; but the less special-casing the better.
>>
> 
> OK. I'll need to think through some more of these cases. And meanwhile, I
> was poking around from the other direction: just injection test it by
> pasting in "true" or "false", in place of calling folio_likely_mapped_shared().
> And see what changes.

Highly appreciated!

> 
> The "true" test lets me fail several khugepaged selftests, while the "false"
> test just increases the counter in /proc/vmstat.
> 
> That's more of a black box way of poking at it, just to have another facet
> of testing. Because it is good to ensure that we really do have test
> coverage if we're changing the code. Anyway, just ideas.

Yes, all makes sense.

I'm very interested if there are valid concerns that the "false 
negatives" are unacceptable: it would be another case for why we really 
want to make folio_likely_mapped_shared() precise. For me it's clear 
that we want to make it precise, but so far I am not convinced that it 
is absolutely required in the khugepaged context.

Thanks!

-- 
Cheers,

David / dhildenb