[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c04f5611-162c-441c-b94c-79032f09e015@redhat.com>
Date: Fri, 26 Apr 2024 08:57:36 +0200
From: David Hildenbrand <david@...hat.com>
To: John Hubbard <jhubbard@...dia.com>, Matthew Wilcox <willy@...radead.org>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
linux-doc@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
Jonathan Corbet <corbet@....net>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Zi Yan <ziy@...dia.com>, Yang Shi <yang.shi@...ux.alibaba.com>,
Ryan Roberts <ryan.roberts@....com>
Subject: Re: [PATCH v1] mm/khugepaged: replace page_mapcount() check by
folio_likely_mapped_shared()
On 26.04.24 03:23, John Hubbard wrote:
> On 4/25/24 1:06 AM, David Hildenbrand wrote:
>> On 25.04.24 07:40, John Hubbard wrote:
>>> On 4/24/24 9:17 PM, Matthew Wilcox wrote:
>>>> On Wed, Apr 24, 2024 at 09:00:50PM -0700, John Hubbard wrote:
> ...
>> We'll talk more about all that at LSF/MM in the mapcount session. A spoiler:
>
> Looking forward to it. And as an aside, this year it feels like the mm
> code is changing relatively fast. So many large and small improvements
> have happened or are in progress.
Yes, it's happening on a very fast pace (and it's hard for me to get
reasonable work done while still keeping reviewing that much ...).
I'll note, that replacing a page-based interface by a folio-based
interface should not be shocking news in 2024, and that the issues with
mapcounts for large folios have been a recurring topic at LSF/MM and on
the mailing list.
>
>
>>
>> page_mapcount() in the context of large folios:
>> * Is a misunderstood function (e.g., page_mapcount() vs page_count()
>> checks, mapped = !page_mapcount() checks).
>> * Is a misleading function (e.g., page_mapped() == folio_mapped() but
>> page_mapcount() != folio_mapcount())
>>
>> We could just rename it to "folio_precise_page_mapcount()", but then, once we tackle the subpage mapcount optimizations (initially using a separate kernel config toggle), we'll have to teach each caller about an alternative that gets the job done, and it's not that easy to prevent further reuse around the kernel.
>>
>> If you look at linux-next, we're down to 5 page_mapcount() calls in fs/proc/, so I'll relocate it to fs/proc/internal.h to prevent any further use - once the s390x change lands in the next merge window.
>>
>> Regarding the subpage mapcount optimizations, I can further add:
>> * (un)map performance improvements for PTE-mapped THP
>> * Preparation for folio_order() > PMD_ORDER, where the current scheme
>> won't scale and needs further adjustments/complexity to even keep it
>> working
>> * Preparation for hugetlb-like vmemmap optimizations until we have
>> memdescs / dynamically allocated folios
>> * (Paving the way for partially mapping hugetlb folios that faced
>> similar issues? Not sure if that ever gets real, though)
>>
>> Is this patch ahead of its time? LSF/MM is just around the corner, and I'm planning on posting the other relevant patches in the next months.
>
> I think so, yes. There is a lot of context required to understand the
> motivation, and more required in order to figure out if it is safe,
> and if it still provides "good" behavior.
I think the motivation for removing page_mapcount() should be very clear
at this point: a single remaining user in mm/ should be warranted, and
the faster it is gone the better.
[case in point: I even have another potential user [1] in my mailbox
that should be using a folio interface, well, or PG_anon_exclusive :) ]
[1] https://lore.kernel.org/all/Zirw0uINbP6GxFiK@bender.morinfr.org/T/#u
Regarding removing subpage mapcounts I agree: I added too many details
that made it look harder to understand :P
>
> I still think it's mostly harmless, though, so being ahead of its time
> is not necessarily an indictment. :)
I didn't express my thought clearly: LSF/MM is just around the corner
and the discussion we are having here is the perfect preparation for
that session! :)
I don't particularly care if we merge this patch now or after the next
merge window along with the remaining page_mapcount() removal.
Discussing the impact of this change is the important piece. :)
[...]
>> Thanks for having a look!
>>
>> I'm only a bit concerned about folio_likely_mapped_shared() "false negatives" (detecting exclusive although shared), until we have a more precise folio_likely_mapped_shared() variant to not unexpectedly waste memory.
>>
>> Imagine someone would be setting "khugepaged_max_ptes_shared=0", and then we have an area where (I think this is the extreme case):
>>
>> * We map 256 subpages of a 2M folio that are shared 256 times with a
>> child process.
>> * We don't map the first subpage.
>> * One PTE maps another page and is pte_write().
>> * 255 PTEs are pte_none().
>>
>> folio_likely_mapped_shared() would return "false".
>>
>> But then my thinking is:
>> * We are already wasting 256 subpages that are free in the 2M folio.
>> Sure, we might be able to relaim it when deferred splitting.
>> * Why would someone set khugepaged_max_ptes_shared=0 but leave
>> khugepaged_max_ptes_none set that high that we would allow 255
>> pte_none?
>> * If the child is a short-living subprocess, we don't really care
>> * Any futher writes to unbacked/R/O PTEs in that PMD area would COW and
>> consume memory.
>>
>> So I had to use more and more "ifs" to construct a scenario where we might end up wasting 1M of memory, at which point I decided "this is really a corner case" and likely not worth the worry.
>>
>> If we run into real issues, though, it will be easy to just inline page_mapcount() here to resolve it; but the less special-casing the better.
>>
>
> OK. I'll need to think through some more of these cases. And meanwhile, I
> was poking around from the other direction: just injection test it by
> pasting in "true" or "false", in place of calling folio_likely_mapped_shared().
> And see what changes.
Highly appreciated!
>
> The "true" test lets me fail several khugepaged selftests, while the "false"
> test just increases the counter in /proc/vmstat.
>
> That's more of a black box way of poking at it, just to have another facet
> of testing. Because it is good to ensure that we really do have test
> coverage if we're changing the code. Anyway, just ideas.
Yes, all makes sense.
I'm very interested if there are valid concerns that the "false
negatives" are unacceptable: it would be another case for why we really
want to make folio_likely_mapped_shared() precise. For me it's clear
that we want to make it precise, but so far I am not convinced that it
is absolutely required in the khugepaged context.
Thanks!
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists