lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 24 Apr 2024 18:36:32 +0200
From: David Hildenbrand <david@...hat.com>
To: Yang Shi <shy828301@...il.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 linux-doc@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
 Jonathan Corbet <corbet@....net>,
 "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
 Zi Yan <ziy@...dia.com>, Yang Shi <yang.shi@...ux.alibaba.com>,
 John Hubbard <jhubbard@...dia.com>, Ryan Roberts <ryan.roberts@....com>
Subject: Re: [PATCH v1] mm/khugepaged: replace page_mapcount() check by
 folio_likely_mapped_shared()

On 24.04.24 18:28, Yang Shi wrote:
> On Wed, Apr 24, 2024 at 5:26 AM David Hildenbrand <david@...hat.com> wrote:
>>
>> We want to limit the use of page_mapcount() to places where absolutely
>> required, to prepare for kernel configs where we won't keep track of
>> per-page mapcounts in large folios.
>>
>> khugepaged is one of the remaining "more challenging" page_mapcount()
>> users, but we might be able to move away from page_mapcount() without
>> resulting in a significant behavior change that would warrant
>> special-casing based on kernel configs.
>>
>> In 2020, we first added support to khugepaged for collapsing COW-shared
>> pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared
>> across fork"), followed by support for collapsing PTE-mapped THP in commit
>> 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages")
>> and limiting the memory waste via the "page_count() > 1" check in commit
>> 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable").
>>
>> As a default, khugepaged will allow up to half of the PTEs to map shared
>> pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged
>> setting.
>>
>> khugepaged does currently not care about swapcache page references, and
>> does not check under folio lock: so in some corner cases the "shared vs.
>> exclusive" detection might be a bit off, making us detect "exclusive" when
>> it's actually "shared".
>>
>> Most of our anonymous folios in the system are usually exclusive. We
>> frequently see sharing of anonymous folios for a short period of time,
>> after which our short-lived suprocesses either quit or exec().
>>
>> There are some famous examples, though, where child processes exist for a
>> long time, and where memory is COW-shared with a lot of processes
>> (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for
>> reducing the memory footprint. We don't want to suddenly change the
>> behavior to result in a significant increase in memory waste.
>>
>> Interestingly, khugepaged will only collapse an anonymous THP if at least
>> one PTE is writable. After fork(), that means that something (usually a
>> page fault) populated at least a single exclusive anonymous THP in that PMD
>> range.
>>
>> So ... what happens when we switch to "is this folio mapped shared"
>> instead of "is this page mapped shared" by using
>> folio_likely_mapped_shared()?
>>
>> For "not-COW-shared" folios, small folios and for THPs (large
>> folios) that are completely mapped into at least one process,
>> switching to folio_likely_mapped_shared() will not result in a change.
>>
>> We'll only see a change for COW-shared PTE-mapped THPs that are
>> partially mapped into all involved processes.
>>
>> There are two cases to consider:
>>
>> (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP
>>
>>    If the folio is detected as exclusive, and it actually is exclusive,
>>    there is no change: page_mapcount() == 1. This is the common case
>>    without fork() or with short-lived child processes.
>>
>>    folio_likely_mapped_shared() might currently still detect a folio as
>>    exclusive although it is shared (false negatives): if the first page is
>>    not mapped multiple times and if the average per-page mapcount is smaller
>>    than 1, implying that (1) the folio is partially mapped and (2) if we are
>>    responsible for many mapcounts by mapping many pages others can't
>>    ("mostly exclusive") (3) if we are not responsible for many mapcounts by
>>    mapping little pages ("mostly shared") it won't make a big impact on the
>>    end result.
>>
>>    So while we might now detect a page as "exclusive" although it isn't,
>>    it's not expected to make a big difference in common cases.
>>
>> (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP
>>
>>    folio_likely_mapped_shared() will never detect a large anonymous folio
>>    as shared although it is exclusive: there are no false positives.
>>
>>    If we detect a THP as shared, at least one page of the THP is mapped by
>>    another process. It could well be that some pages are actually exclusive.
>>    For example, our child processes could have unmapped/COW'ed some pages
>>    such that they would now be exclusive to out process, which we now
>>    would treat as still-shared.
> 
> IIUC, case A may under-count shared PTEs, however on the opposite side
> case B may over-count shared PTEs, right? So the impact may depend on
> what value is used by max_shared_ptes tunable. It may have a more
> noticeable impact on a very conservative setting (i.e. max_shared_ptes
> == 1) if it is under-counted or on a very aggressive setting (i.e.
> max_shared_ptes == 510) if it is over-counted.

Thanks for reading all of that!

Right, and mostly affecting corner cases. I'm not concerned about (B) 
really. I was more concerned about (A) before I optimized 
folio_likely_mapped_shared() using the large mapcount.

> 
> So I agree it should not matter much for common cases. AFAIK, the
> usecase for aggressive setting should be very rare, but conservative
> setting may be more usual, so improving the under-count for
> conservative setting may be worth it.

Yes, sorting out A completely is what I 'm working on, but it will 
likely not be available on all kernel configs, at least initially.

And unless there is a good reason, I want to avoid having 
config-dependent stuff all over the kernel -- and just move 
page_mapcount() to task_mmu.c where it can no longer be (ab)used.

Thanks!

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ