linux-kernel - Re: [PATCH RFC] mm: entirely reuse the whole anon mTHP in do_wp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a0e654ac-dd69-4186-b1bd-3470a32e06d9@redhat.com>
Date: Sat, 31 Aug 2024 13:02:51 +0200
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>,
 Chuanhua Han <hanchuanhua@...o.com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 Ryan Roberts <ryan.roberts@....com>, Zi Yan <ziy@...dia.com>,
 Chris Li <chrisl@...nel.org>, Kairui Song <kasong@...cent.com>,
 Kalesh Singh <kaleshsingh@...gle.com>, Suren Baghdasaryan <surenb@...gle.com>
Subject: Re: [PATCH RFC] mm: entirely reuse the whole anon mTHP in do_wp_page

On 31.08.24 12:49, Barry Song wrote:
> On Sat, Aug 31, 2024 at 10:29 PM David Hildenbrand <david@...hat.com> wrote:
>>
>> On 31.08.24 12:21, Barry Song wrote:
>>> On Sat, Aug 31, 2024 at 10:07 PM David Hildenbrand <david@...hat.com> wrote:
>>>>
>>>> On 31.08.24 11:55, Barry Song wrote:
>>>>> On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@...hat.com> wrote:
>>>>>>
>>>>>> On 31.08.24 11:23, Barry Song wrote:
>>>>>>> From: Barry Song <v-songbaohua@...o.com>
>>>>>>>
>>>>>>> On a physical phone, it's sometimes observed that deferred_split
>>>>>>> mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
>>>>>>> indicates that the majority of these originate from the typical fork
>>>>>>> scenario.
>>>>>>> When the child process either execs or exits, the parent process should
>>>>>>> ideally be able to reuse the entire mTHP. However, the current kernel
>>>>>>> lacks this capability and instead places the mTHP into split_deferred,
>>>>>>> performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
>>>>>>>
>>>>>>>      main()
>>>>>>>      {
>>>>>>>      #define SIZE 1024 * 1024UL
>>>>>>>              void *p = malloc(SIZE);
>>>>>>>              memset(p, 0x11, SIZE);
>>>>>>>              if (fork() == 0)
>>>>>>>                      exec(....);
>>>>>>>             /*
>>>>>>>           * this will trigger cow one subpage from
>>>>>>>           * mTHP and put mTHP into split_deferred
>>>>>>>           * list
>>>>>>>           */
>>>>>>>          *(int *)(p + 10) = 10;
>>>>>>>          printf("done\n");
>>>>>>>          while(1);
>>>>>>>      }
>>>>>>>
>>>>>>> This leads to two significant issues:
>>>>>>>
>>>>>>> * Memory Waste: Before the mTHP is fully split by the shrinker,
>>>>>>> it wastes memory. In extreme cases, such as with a 64KB mTHP,
>>>>>>> the memory usage could be 64KB + 60KB until the last subpage
>>>>>>> is written, at which point the mTHP is freed.
>>>>>>>
>>>>>>> * Fragmentation and Performance Loss: It destroys large folios
>>>>>>> (negating the performance benefits of CONT-PTE) and fragments memory.
>>>>>>>
>>>>>>> To address this, we should aim to reuse the entire mTHP in such cases.
>>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
>>>>>>> entirely_reuse argument because I’m not sure if there are still cases
>>>>>>> where we reuse a subpage within an mTHP. For now, I’m setting
>>>>>>> entirely_reuse to true only for the newly supported case, while all
>>>>>>> other cases still get false. Please let me know if this is incorrect—if
>>>>>>> we don’t reuse subpages at all, we could remove the argument.
>>>>>>
>>>>>> See [1] I sent out this week, that is able to reuse even without
>>>>>> scanning page tables. If we find the the folio is exclusive we could try
>>>>>> processing surrounding PTEs that map the same folio.
>>>>>>
>>>>>> [1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com
>>>>>
>>>>> Great! It looks like I missed your patch again. Since you've implemented this
>>>>> in a better way, I’d prefer to use your patchset.
>>>>
>>>> I wouldn't say better, just more universally. And while taking care of
>>>> properly sync'ing the mapcount vs. refcount :P
>>>>
>>>>>
>>>>> I’m curious about how you're handling ptep_set_access_flags_nr() or similar
>>>>> things because I couldn’t find the related code in your patch 10/17:
>>>>>
>>>>> [PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID
>>>>>
>>>>> Am I missing something?
>>>>
>>>> The idea is to keep individual write faults as fast as possible. So the
>>>> patch set keeps it simple and only reuses a single PTE at a time,
>>>> setting that one PAE and mapping it writable.
>>>
>>> I got your point, thanks! as anyway the mTHP has been exclusive,
>>> so the following nr-1 minor page faults will set their particular PTE
>>> to writable one by one.
>>
>> Yes, assuming you would get these page faults, and assuming you would
>> get them in the near future.
>>
>>>
>>>>
>>>> As the patch states, it might be reasonable to optimize some cases,
>>>> maybe also only on some architectures. For example to fault-around and
>>>> map the other ones writable as well. It might not always be desirable
>>>> though, especially not for larger folios.
>>>
>>> as anyway, the mTHP has been entirely exclusive, setting all PTEs
>>> directly to writable should help reduce nr - 1 minor page faults and
>>> ideally help reduce CONTPTE unfold and fold?
>>
>> Yes, doing that on CONTPTE granularity would very likely make sense. For
>> anything bigger than that, I am not sure.
>>
>> Assuming we have a 1M folio mapped by PTEs. Trying to fault-around in
>> aligned CONTPTE granularity likely makes sense. Bigger than that, I am
>> not convinced.
>>
> 
> I see. maybe we can have something like:
> 
> static bool pte_fault_around_estimate(int nr)
> {
>         if (nr / arch_batched_ptes_nr() < 16)
>               return true;
> 
>         return false;
> }
> 
> if (pte_fault_around_estimate(folio_nr_pages(folio)))
>         set all ptes;
> 
> for arm64, arch_batched_ptes_nr()  == 16. for
> arch without cont-pte or similar things,
> arch_batched_ptes_nr()  == 1.

Yes, something like that would be my take.

After we know that we can reuse the large folio, we'll try scanning 
starting from the aligned PTE. If we find that we can batch, we'll batch 
that part. Otherwise we'll simply fallback to a single one.

Handling batching across VMAs is a bit harder. We might be able to 
batch, or might not ... We could have the CONT_PTE bit set across VMAs, 
but might not necessarily be able to batch (e.g., some VMAs are read-only).

> 
> Just some rough ideas; all the naming might be quite messy.
> 
> at least, we won't lose the benefit of reduced TLB miss
> before all nr_pages are written for aarch64 :-)
> 
>>>
>>> What is the downside to doing that? I also don't think mapping them
>>> all together will waste memory?
>>
>> No, it's all about increasing the latency of individual write faults.
>>
> 
> i see, i assume it won't be worse than the current case where we have to
> allocate small folios and copy? and folio allocation can even further incur
> direct reclamation?

Yes, it would certainly better than what we currently have. Almost 
everything would likely be better than what we currently have. :)

-- 
Cheers,

David / dhildenb