linux-kernel - Re: [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <40112a27-eddb-4c1a-a859-a34e202e6564@arm.com>
Date: Tue, 23 Jan 2024 12:28:25 +0000
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>, linux-kernel@...r.kernel.org
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
 Matthew Wilcox <willy@...radead.org>, Russell King <linux@...linux.org.uk>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Dinh Nguyen <dinguyen@...nel.org>, Michael Ellerman <mpe@...erman.id.au>,
 Nicholas Piggin <npiggin@...il.com>,
 Christophe Leroy <christophe.leroy@...roup.eu>,
 "Aneesh Kumar K.V" <aneesh.kumar@...nel.org>,
 "Naveen N. Rao" <naveen.n.rao@...ux.ibm.com>,
 Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt
 <palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>,
 Alexander Gordeev <agordeev@...ux.ibm.com>,
 Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
 Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>,
 Sven Schnelle <svens@...ux.ibm.com>, "David S. Miller"
 <davem@...emloft.net>, linux-arm-kernel@...ts.infradead.org,
 linuxppc-dev@...ts.ozlabs.org, linux-riscv@...ts.infradead.org,
 linux-s390@...r.kernel.org, sparclinux@...r.kernel.org
Subject: Re: [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP

On 23/01/2024 12:19, David Hildenbrand wrote:
> [...]
> 
>>
>> I wrote some documentation for this (based on Matthew's docs for set_ptes() in
>> my version. Perhaps it makes sense to add it here, given this is overridable by
>> the arch.
>>
>> /**
>>   * wrprotect_ptes - Write protect a consecutive set of pages.
>>   * @mm: Address space that the pages are mapped into.
>>   * @addr: Address of first page to write protect.
>>   * @ptep: Page table pointer for the first entry.
>>   * @nr: Number of pages to write protect.
>>   *
>>   * May be overridden by the architecture, else implemented as a loop over
>>   * ptep_set_wrprotect().
>>   *
>>   * Context: The caller holds the page table lock. The PTEs are all in the same
>>   * PMD.
>>   */
>>
> 
> I could have sworn I had a documentation at some point. Let me add some, thanks.
> 
> [...]
> 
>>> +
>>> +    /*
>>> +     * If we likely have to copy, just don't bother with batching. Make
>>> +     * sure that the common "small folio" case stays as fast as possible
>>> +     * by keeping the batching logic separate.
>>> +     */
>>> +    if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
>>> +        nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
>>> +        if (folio_test_anon(folio)) {
>>> +            folio_ref_add(folio, nr);
>>> +            if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
>>> +                                  nr, src_vma))) {
>>
>> What happens if its not the first page of the batch that fails here? Aren't you
>> signalling that you need a prealloc'ed page for the wrong pte? Shouldn't you
>> still batch copy all the way up to the failing page first? Perhaps it all comes
>> out in the wash and these events are so infrequent that we don't care about the
>> lost batching opportunity?
> 
> I assume you mean the weird corner case that some folio pages in the range have
> PAE set, others don't -- and the folio maybe pinned.
> 
> In that case, we fallback to individual pages, and might have preallocated a
> page although we wouldn't have to preallocate one for processing the next page
> (that doesn't have PAE set).
> 
> It should all work, although not optimized to the extreme, and as it's a corner
> case, we don't particularly care. Hopefully, in the future we'll only have a
> single PAE flag per folio.
> 
> Or am I missing something?

No, your explanation makes sense. Just wanted to check this all definitely
worked, because the flow is slightly different to my previous version that was
doing try_dup_rmap page-by-page.

> 
>>
>>> +                folio_ref_sub(folio, nr);
>>> +                return -EAGAIN;
>>> +            }
>>> +            rss[MM_ANONPAGES] += nr;
>>> +            VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
>>> +        } else {
>>> +            folio_ref_add(folio, nr);
>>
>> Perhaps hoist this out to immediately after folio_pte_batch() since you're
>> calling it on both branches?
> 
> Makes sense, thanks.
>