linux-kernel - Re: [PATCH v5 0/7] Optimize mprotect() for large folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7d21fff7-bf2b-4362-b2cf-0cd92fe0cf7c@lucifer.local>
Date: Fri, 18 Jul 2025 19:53:11 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Dev Jain <dev.jain@....com>
Cc: akpm@...ux-foundation.org, ryan.roberts@....com, david@...hat.com,
        willy@...radead.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        catalin.marinas@....com, will@...nel.org, Liam.Howlett@...cle.com,
        vbabka@...e.cz, jannh@...gle.com, anshuman.khandual@....com,
        peterx@...hat.com, joey.gouly@....com, ioworker0@...il.com,
        baohua@...nel.org, kevin.brodsky@....com, quic_zhenhuah@...cinc.com,
        christophe.leroy@...roup.eu, yangyicong@...ilicon.com,
        linux-arm-kernel@...ts.infradead.org, hughd@...gle.com,
        yang@...amperecomputing.com, ziy@...dia.com
Subject: Re: [PATCH v5 0/7] Optimize mprotect() for large folios

On Fri, Jul 18, 2025 at 03:20:16PM +0530, Dev Jain wrote:
>
> On 18/07/25 2:32 pm, Dev Jain wrote:
> > Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes
> > are painted with the contig bit, then ptep_get() will iterate through all
> > 16 entries to collect a/d bits. Hence this optimization will result in
> > a 16x reduction in the number of ptep_get() calls. Next,
> > ptep_modify_prot_start() will eventually call contpte_try_unfold() on
> > every contig block, thus flushing the TLB for the complete large folio
> > range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on
> > each contig block, and only do them on the starting and ending
> > contig block.
> >
> > For split folios, there will be no pte batching; the batch size returned
> > by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
> > still point to the same large folio; for arm64, this results in the
> > optimization described above, and for other arches, a minor improvement
> > is expected due to a reduction in the number of function calls.
> >
> > mm-selftests pass on arm64. I have some failing tests on my x86 VM already;
> > no new tests fail as a result of this patchset.
> >
> > We use the following test cases to measure performance, mprotect()'ing
> > the mapped memory to read-only then read-write 40 times:
> >
> > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
> > pte-mapping those THPs
> > Test case 2: Mapping 1G of memory with 64K mTHPs
> > Test case 3: Mapping 1G of memory with 4K pages
> >
> > Average execution time on arm64, Apple M3:
> > Before the patchset:
> > T1: 2.1 seconds   T2: 2 seconds   T3: 1 second
> >
> > After the patchset:
> > T1: 0.65 seconds   T2: 0.7 seconds   T3: 1.1 seconds
> >
>
> For the note: the numbers are different from the previous versions.
> I must have run the test for more number of iterations and then
> pasted the test program here for 40 iterations, that's why the mismatch.
>

Thanks for this clarification!