lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ef6df9b-f4a7-43d1-9de7-19ad2d82bbff@arm.com>
Date: Mon, 1 Sep 2025 14:20:07 +0530
From: Dev Jain <dev.jain@....com>
To: David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
 kas@...nel.org, willy@...radead.org, hughd@...gle.com
Cc: ziy@...dia.com, baolin.wang@...ux.alibaba.com,
 lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, npache@...hat.com,
 ryan.roberts@....com, baohua@...nel.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs


On 01/09/25 2:02 pm, David Hildenbrand wrote:
> On 01.09.25 09:48, Dev Jain wrote:
>> Currently khugepaged does not collapse a region which does not have a
>> single writable page. This is wasteful since, apart from any 
>> non-writable
>> memory mapped by the application, there are a lot of non-writable VMAs
>> which will benefit from collapsing - the VMAs of the executable, those
>> of the glibc, vvar and vdso, which won't be unmapped during the lifetime
>> of the process, as opposed to other VMAs which maybe unmapped.
>
> Are these anonymous folios? ("VMAs of the executable"), or is you 
> description
> misleading?

Oops. I dropped writable everywhere and failed to notice that the 
callsites are

all for anon collapse. So I'll only mention vvar, vdso and other 
non-writable

VMAs in the description.

>
>> Therefore,
>> remove this restriction and allow khugepaged to collapse a VMA with
>> arbitrary protections.
>>
>> Along with this, currently MADV_COLLAPSE does not perform a collapse 
>> on a
>> non-writable VMA, and this restriction is nowhere to be found on the
>> manpage - the restriction itself sounds wrong to me since the user knows
>> the protection of the memory it has mapped, so collapsing read-only
>> memory via madvise() should be a choice of the user which shouldn't
>> be overriden by the kernel.
>>
>> I dug into the history of this and couldn't find any concrete reason of
>> the current behaviour - [1] is the v1 of the original khugepaged patch
>> which required all ptes to be writable. [2] is the v1 of the patch which
>> changed this behaviour to require at least one pte to be writable. The
>> closest thing I could find was: in response to [2], Kirill says in [3] -
>> "As a side effect it will effectively allow collapse in PROT_READ vmas,
>> right? I'm not convinced it's a good idea." (Although Kirill realizes in
>> [4] that this was not the intention of the patch).
>>
>> I can see performance improvements on mmtests run on an arm64 machine
>> comparing with 6.17-rc2. (I) denotes statistically significant 
>> improvement,
>> (R) denotes statistically significant regression (Please ignore the
>> numbers in the middle column):
>
> I once dug into that myself as well as part of
>
> commit 1bafe96e89f056cb6e25d47451fb16aee2c7c4d0
> Author: David Hildenbrand <david@...hat.com>
> Date:   Wed Apr 24 14:26:30 2024 +0200
>
>     mm/khugepaged: replace page_mapcount() check by 
> folio_likely_mapped_shared()
>
> where I noted:
>
>     Interestingly, khugepaged will only collapse an anonymous THP if 
> at least
>     one PTE is writable.  After fork(), that means that something 
> (usually a
>     page fault) populated at least a single exclusive anonymous THP in 
> that
>     PMD range.
>     The problem I was concerned with (also documented in that patch) 
> should no
> longer apply ever since we changed how folio_maybe_mapped_shared() 
> operates.
>
> So yes, I don't see a good reason to fail on R/O PTEs
>
>>
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+ 
>>
>> | mmtests/hackbench                  | process-pipes-1 
>> (seconds)                                |                 0.145 
>> |                   -0.06% |
>> |                                    | process-pipes-4 
>> (seconds)                                |                0.4335 
>> |                   -0.27% |
>> |                                    | process-pipes-7 
>> (seconds)                                |                 0.823 
>> |              (I) -12.13% |
>> |                                    | process-pipes-12 
>> (seconds)                               |    1.3538333333333334 
>> |               (I) -5.32% |
>> |                                    | process-pipes-21 
>> (seconds)                               |    1.8971666666666664 
>> |               (I) -2.87% |
>> |                                    | process-pipes-30 
>> (seconds)                               |    2.5023333333333335 
>> |               (I) -3.39% |
>> |                                    | process-pipes-48 
>> (seconds)                               |                3.4305 
>> |               (I) -5.65% |
>> |                                    | process-pipes-79 
>> (seconds)                               |     4.245833333333334 
>> |               (I) -6.74% |
>> |                                    | process-pipes-110 
>> (seconds)                              |     5.114833333333333 
>> |               (I) -6.26% |
>> |                                    | process-pipes-141 
>> (seconds)                              |                6.1885 
>> |               (I) -4.99% |
>> |                                    | process-pipes-172 
>> (seconds)                              |     7.231833333333334 
>> |               (I) -4.45% |
>> |                                    | process-pipes-203 
>> (seconds)                              |     8.393166666666668 
>> |               (I) -3.65% |
>> |                                    | process-pipes-234 
>> (seconds)                              |     9.487499999999999 
>> |               (I) -3.45% |
>> |                                    | process-pipes-256 
>> (seconds)                              |    10.316166666666666 
>> |               (I) -3.47% |
>> |                                    | process-sockets-1 
>> (seconds)                              |                 0.289 
>> |                    2.13% |
>> |                                    | process-sockets-4 
>> (seconds)                              |    0.7596666666666666 
>> |                    1.02% |
>> |                                    | process-sockets-7 
>> (seconds)                              |    1.1663333333333334 
>> |                   -0.26% |
>> |                                    | process-sockets-12 
>> (seconds)                             |    1.8641666666666665 
>> |                   -1.24% |
>> |                                    | process-sockets-21 
>> (seconds)                             |    3.0773333333333333 
>> |                    0.01% |
>> |                                    | process-sockets-30 
>> (seconds)                             |                4.2405 
>> |                   -0.15% |
>> |                                    | process-sockets-48 
>> (seconds)                             |     6.459666666666666 
>> |                    0.15% |
>> |                                    | process-sockets-79 
>> (seconds)                             |    10.156833333333333 
>> |                    1.45% |
>> |                                    | process-sockets-110 
>> (seconds)                            |    14.317833333333333 
>> |                   -1.64% |
>> |                                    | process-sockets-141 
>> (seconds)                            |               20.8735 
>> |               (I) -4.27% |
>> |                                    | process-sockets-172 
>> (seconds)                            |    26.205333333333332 
>> |                    0.30% |
>> |                                    | process-sockets-203 
>> (seconds)                            |    31.298000000000002 
>> |                   -1.71% |
>> |                                    | process-sockets-234 
>> (seconds)                            |    36.104000000000006 
>> |                   -1.94% |
>> |                                    | process-sockets-256 
>> (seconds)                            |     39.44016666666667 
>> |                   -0.71% |
>> |                                    | thread-pipes-1 
>> (seconds)                                 | 0.17550000000000002 
>> |                    0.66% |
>> |                                    | thread-pipes-4 
>> (seconds)                                 | 0.44716666666666666 
>> |                    1.66% |
>> |                                    | thread-pipes-7 
>> (seconds)                                 | 0.7345 
>> |                   -0.17% |
>> |                                    | thread-pipes-12 
>> (seconds)                                |     1.405833333333333 
>> |               (I) -4.12% |
>> |                                    | thread-pipes-21 
>> (seconds)                                |    2.0113333333333334 
>> |               (I) -2.13% |
>> |                                    | thread-pipes-30 
>> (seconds)                                |    2.6648333333333336 
>> |               (I) -3.78% |
>> |                                    | thread-pipes-48 
>> (seconds)                                |    3.6341666666666668 
>> |               (I) -5.77% |
>> |                                    | thread-pipes-79 
>> (seconds)                                |                4.4085 
>> |               (I) -5.31% |
>> |                                    | thread-pipes-110 
>> (seconds)                               |     5.374666666666666 
>> |               (I) -6.12% |
>> |                                    | thread-pipes-141 
>> (seconds)                               |     6.385666666666666 
>> |               (I) -4.00% |
>> |                                    | thread-pipes-172 
>> (seconds)                               |     7.403000000000001 
>> |               (I) -3.01% |
>> |                                    | thread-pipes-203 
>> (seconds)                               |     8.570333333333332 
>> |               (I) -2.62% |
>> |                                    | thread-pipes-234 
>> (seconds)                               |     9.719166666666666 
>> |               (I) -2.00% |
>> |                                    | thread-pipes-256 
>> (seconds)                               |    10.552833333333334 
>> |               (I) -2.30% |
>> |                                    | thread-sockets-1 
>> (seconds)                               |                0.3065 
>> |                (R) 2.39% |
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+ 
>>
>>
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+ 
>>
>> | mmtests/sysbench-mutex             | sysbenchmutex-1 
>> (usec)                                   |    194.38333333333333 
>> |                   -0.02% |
>> |                                    | sysbenchmutex-4 
>> (usec)                                   |               200.875 
>> |                   -0.02% |
>> |                                    | sysbenchmutex-7 
>> (usec)                                   |    201.23000000000002 
>> |                    0.00% |
>> |                                    | sysbenchmutex-12 
>> (usec)                                  |    201.77666666666664 
>> |                    0.12% |
>> |                                    | sysbenchmutex-21 
>> (usec)                                  |                203.03 
>> |                   -0.40% |
>> |                                    | sysbenchmutex-30 
>> (usec)                                  |               203.285 
>> |                    0.08% |
>> |                                    | sysbenchmutex-48 
>> (usec)                                  |    231.30000000000004 
>> |                    2.59% |
>> |                                    | sysbenchmutex-79 
>> (usec)                                  |               362.075 
>> |                   -0.80% |
>> |                                    | sysbenchmutex-110 
>> (usec)                                 |     516.8233333333334 
>> |                   -3.87% |
>> |                                    | sysbenchmutex-128 
>> (usec)                                 |     593.3533333333334 
>> |               (I) -4.46% |
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+ 
>>
>>
>> No regressions were observed with mm-selftests.
>>
>> [1] 
>> https://lore.kernel.org/all/679861e2e81b32a0ae08.1264054854@v2.random/
>> [2] 
>> https://lore.kernel.org/all/1421999256-3881-1-git-send-email-ebru.akagunduz@gmail.com/
>> [3] https://lore.kernel.org/all/20150123113701.GB5975@node.dhcp.inet.fi/
>> [4] https://lore.kernel.org/all/20150123155802.GA7011@node.dhcp.inet.fi/
>>
>> Signed-off-by: Dev Jain <dev.jain@....com>
>> ---
>> Based on mm-new.
>>
>> Not very sure of the tracing parts which this patch changes. I have kept
>> the writable portion for the tracing to maintain backward compat, just
>> dropped it as a collapse condition.
>>
>>   include/trace/events/huge_memory.h |  2 +-
>>   mm/khugepaged.c                    | 11 +++--------
>>   2 files changed, 4 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/trace/events/huge_memory.h 
>> b/include/trace/events/huge_memory.h
>> index 2305df6cb485..f2472c1c132a 100644
>> --- a/include/trace/events/huge_memory.h
>> +++ b/include/trace/events/huge_memory.h
>> @@ -19,7 +19,7 @@
>>       EM( SCAN_PTE_NON_PRESENT,    "pte_non_present")        \
>>       EM( SCAN_PTE_UFFD_WP,        "pte_uffd_wp")            \
>>       EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage")        \
>> -    EM( SCAN_PAGE_RO,        "no_writable_page")        \
>> +    EM( SCAN_PAGE_RO,        "no_writable_page") /* deprecated */    \
>>       EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page")        \
>>       EM( SCAN_PAGE_NULL,        "page_null")            \
>>       EM( SCAN_SCAN_ABORT,        "scan_aborted")            \
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 4ec324a4c1fe..5ef8482597a9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -39,7 +39,7 @@ enum scan_result {
>>       SCAN_PTE_NON_PRESENT,
>>       SCAN_PTE_UFFD_WP,
>>       SCAN_PTE_MAPPED_HUGEPAGE,
>> -    SCAN_PAGE_RO,
>> +    SCAN_PAGE_RO,    /* deprecated */
>
> Why can't we remove that completely.
>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ