linux-kernel - Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0b48135a-a44b-4b5a-a33b-abd3a3b47ff8@arm.com>
Date:   Tue, 5 Dec 2023 14:16:53 +0000
From:   Ryan Roberts <ryan.roberts@....com>
To:     David Hildenbrand <david@...hat.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Ard Biesheuvel <ardb@...nel.org>,
        Marc Zyngier <maz@...nel.org>,
        Oliver Upton <oliver.upton@...ux.dev>,
        James Morse <james.morse@....com>,
        Suzuki K Poulose <suzuki.poulose@....com>,
        Zenghui Yu <yuzenghui@...wei.com>,
        Andrey Ryabinin <ryabinin.a.a@...il.com>,
        Alexander Potapenko <glider@...gle.com>,
        Andrey Konovalov <andreyknvl@...il.com>,
        Dmitry Vyukov <dvyukov@...gle.com>,
        Vincenzo Frascino <vincenzo.frascino@....com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Matthew Wilcox <willy@...radead.org>,
        Yu Zhao <yuzhao@...gle.com>,
        Mark Rutland <mark.rutland@....com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        John Hubbard <jhubbard@...dia.com>, Zi Yan <ziy@...dia.com>,
        Barry Song <21cnbao@...il.com>,
        Alistair Popple <apopple@...dia.com>,
        Yang Shi <shy828301@...il.com>
Cc:     linux-arm-kernel@...ts.infradead.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork()

On 05/12/2023 12:04, David Hildenbrand wrote:
> On 05.12.23 12:30, Ryan Roberts wrote:
>> On 04/12/2023 17:27, David Hildenbrand wrote:
>>>>
>>>> With rmap batching from [1] -- rebased+changed on top of that -- we could turn
>>>> that into an effective (untested):
>>>>
>>>>            if (page && folio_test_anon(folio)) {
>>>> +               nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>> end,
>>>> +                                               pte, enforce_uffd_wp,
>>>> &nr_dirty,
>>>> +                                               &nr_writable);
>>>>                    /*
>>>>                     * If this page may have been pinned by the parent process,
>>>>                     * copy the page immediately for the child so that we'll
>>>> always
>>>>                     * guarantee the pinned page won't be randomly replaced
>>>> in the
>>>>                     * future.
>>>>                     */
>>>> -               folio_get(folio);
>>>> -               if (unlikely(folio_try_dup_anon_rmap_pte(folio, page,
>>>> src_vma))) {
>>>> +               folio_ref_add(folio, nr);
>>>> +               if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, nr,
>>>> src_vma))) {
>>>>                            /* Page may be pinned, we have to copy. */
>>>> -                       folio_put(folio);
>>>> -                       return copy_present_page(dst_vma, src_vma, dst_pte,
>>>> src_pte,
>>>> -                                                addr, rss, prealloc, page);
>>>> +                       folio_ref_sub(folio, nr);
>>>> +                       ret = copy_present_page(dst_vma, src_vma, dst_pte,
>>>> +                                               src_pte, addr, rss, prealloc,
>>>> +                                               page);
>>>> +                       return ret == 0 ? 1 : ret;
>>>>                    }
>>>> -               rss[MM_ANONPAGES]++;
>>>> +               rss[MM_ANONPAGES] += nr;
>>>>            } else if (page) {
>>>> -               folio_get(folio);
>>>> -               folio_dup_file_rmap_pte(folio, page);
>>>> -               rss[mm_counter_file(page)]++;
>>>> +               nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>> end,
>>>> +                                               pte, enforce_uffd_wp,
>>>> &nr_dirty,
>>>> +                                               &nr_writable);
>>>> +               folio_ref_add(folio, nr);
>>>> +               folio_dup_file_rmap_ptes(folio, page, nr);
>>>> +               rss[mm_counter_file(page)] += nr;
>>>>            }
>>>>
>>>>
>>>> We'll have to test performance, but it could be that we want to specialize
>>>> more on !folio_test_large(). That code is very performance-sensitive.
>>>>
>>>>
>>>> [1] https://lkml.kernel.org/r/20231204142146.91437-1-david@redhat.com
>>>
>>> So, on top of [1] without rmap batching but with a slightly modified version of
>>
>> Can you clarify what you mean by "without rmap batching"? I thought [1]
>> implicitly adds rmap batching? (e.g. folio_dup_file_rmap_ptes(), which you've
>> added in the code snippet above).
> 
> Not calling the batched variants but essentially doing what your code does (with
> some minor improvements, like updating the rss counters only once).
> 
> The snipped above is only linked below. I had the performance numbers for [1]
> ready, so I gave it a test on top of that.
> 
> To keep it simple, you might just benchmark w and w/o your patches.
> 
>>
>>> yours (that keeps the existing code structure as pointed out and e.g., updates
>>> counter updates), running my fork() microbenchmark with a 1 GiB of memory:
>>>
>>> Compared to [1], with all order-0 pages it gets 13--14% _slower_ and with all
>>> PTE-mapped THP (order-9) it gets ~29--30% _faster_.
>>
>> What test are you running - I'd like to reproduce if possible, since it sounds
>> like I've got some work to do to remove the order-0 regression.
> 
> Essentially just allocating 1 GiB of memory an measuring how long it takes to
> call fork().
> 
> order-0 benchmarks:
> 
> https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/order-0-benchmarks.c?ref_type=heads
> 
> e.g.,: $ ./order-0-benchmarks fork 100
> 
> 
> pte-mapped-thp benchmarks:
> 
> https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-thp-benchmarks.c?ref_type=heads
> 
> e.g.,: $ ./pte-mapped-thp-benchmarks fork 100
> 
> 
> Ideally, pin to one CPU and get stable performance numbers by disabling
> SMT+turbo etc.

This is great - thanks! I'll get to work...

> 
>>
>>>
>>> So looks like we really want to have a completely seprate code path for
>>> "!folio_test_large()" to keep that case as fast as possible. And "Likely" we
>>> want to use "likely(!folio_test_large()". ;)
>>
>> Yuk, but fair enough. If I can repro the perf numbers, I'll have a go a
>> reworking this.
>>
>> I think you're also implicitly suggesting that this change needs to depend on
>> [1]? Which is a shame...
> 
> Not necessarily. It certainly cleans up the code, but we can do that in any
> order reasonable.
> 
>>
>> I guess I should also go through a similar exercise for patch 2 in this series.
> 
> 
> Yes. There are "unmap" and "pte-dontneed" benchmarks contained in both files above.
>