[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d7c54bbf-ed99-43c7-99a7-a9be70071b4a@redhat.com>
Date: Mon, 16 Jun 2025 18:13:04 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Dev Jain <dev.jain@....com>, akpm@...ux-foundation.org,
Liam.Howlett@...cle.com, vbabka@...e.cz, jannh@...gle.com, pfalcato@...e.de,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, peterx@...hat.com,
ryan.roberts@....com, mingo@...nel.org, libang.li@...group.com,
maobibo@...ngson.cn, zhengqi.arch@...edance.com, baohua@...nel.org,
anshuman.khandual@....com, willy@...radead.org, ioworker0@...il.com,
yang@...amperecomputing.com, baolin.wang@...ux.alibaba.com, ziy@...dia.com,
hughd@...gle.com
Subject: Re: [PATCH v4 2/2] mm: Optimize mremap() by PTE batching
[...]
>>> }
>>> - set_pte_at(mm, new_addr, new_ptep, pte);
>>> + set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>
>>
>> What I dislike is that some paths work on a single PTE, and we implicitly have to know
>> that they don't apply for !pte_present.
>
> I hate any kind of implicit knowledge like this.
>
>>
>> Like
>> if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>
> I also despise [with words I cannot use on-list] how uffd is implemented.
>
> It's _nothing but_ ad-hoc stuff like this spawned all around the place.
>
> It's hateful.
>
>>
>> Will not get batched yet. And that is hidden inside the pte_marker_uffd_wp check ...
>>
>> Should we properly separate both paths (present vs. !present), and while at it, do
>> some more cleanups? I'm thinking of the following on top (only compile-tested)
>
> I'd like to see that, but I think maybe better as a follow up series?
I'd do the split as a cleanup patch upfront.
Then add the batching for the pte_present() case.
>
> On the other hand, this does improve this quite a bit. Could also be another
> patch in the series.
>
>>
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 18b215521adae..b88abf02b34e0 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -155,21 +155,6 @@ static void drop_rmap_locks(struct vm_area_struct *vma)
>> i_mmap_unlock_write(vma->vm_file->f_mapping);
>> }
>> -static pte_t move_soft_dirty_pte(pte_t pte)
>> -{
>> - /*
>> - * Set soft dirty bit so we can notice
>> - * in userspace the ptes were moved.
>> - */
>> -#ifdef CONFIG_MEM_SOFT_DIRTY
>> - if (pte_present(pte))
>> - pte = pte_mksoft_dirty(pte);
>> - else if (is_swap_pte(pte))
>> - pte = pte_swp_mksoft_dirty(pte);
>> -#endif
>> - return pte;
>> -}
>> -
>> static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> pte_t *ptep, pte_t pte, int max_nr)
>> {
>> @@ -260,7 +245,6 @@ static int move_ptes(struct pagetable_move_control *pmc,
>> VM_WARN_ON_ONCE(!pte_none(*new_ptep));
>> nr_ptes = 1;
>> - max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> old_pte = ptep_get(old_ptep);
>> if (pte_none(old_pte))
>> continue;
>> @@ -277,24 +261,34 @@ static int move_ptes(struct pagetable_move_control *pmc,
>> * flushed.
>> */
>> if (pte_present(old_pte)) {
>> + max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> old_pte, max_nr_ptes);
>> force_flush = true;
>> - }
>> - pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>> - pte = move_pte(pte, old_addr, new_addr);
>> - pte = move_soft_dirty_pte(pte);
>> -
>> - if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> - pte_clear(mm, new_addr, new_ptep);
>> - else {
>> - if (need_clear_uffd_wp) {
>> - if (pte_present(pte))
>> - pte = pte_clear_uffd_wp(pte);
>> - else if (is_swap_pte(pte))
>> +
>> + pte = get_and_clear_full_ptes(mm, old_addr, old_ptep,
>> + nr_ptes, 0);
>> + /*
>> + * Moving present PTEs requires special care on some
>> + * archs.
>> + */
>> + pte = move_pte(pte, old_addr, new_addr);
>
> I guess we're good with only doing this in pte_present() case because the only
> arch that implements this, sparc, does a present check anyway.
Yes, we could then remove the call from pte_present(), see below.
>
>> + /* make userspace aware that this pte moved. */
>> + pte = pte_mksoft_dirty(pte);
>> + if (need_clear_uffd_wp)
>> + pte = pte_clear_uffd_wp(pte);
>> + set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> + } else if (need_clear_uffd_wp && pte_marker_uffd_wp(pte)) {
>> + pte_clear(mm, old_addr, old_ptep);
>
> Same comment as below re: pte_clear().
>
> I see you've dropped pte_clear(mm, new_addr, new_ptep) which I guess is
> purposefully?
>
> I do think that it is pointless yes.
Yeah, that's what I raised below.
>
>> + } else {
>> + pte_clear(mm, old_addr, old_ptep);
>
> I guess this is intended to replace ptep_get_and_clear_full_ptes() above in the
> single PTE case... no? Is this sufficient?
>
> In the original code we'd always do ptep_get_and_clear().
>
> I think the key difference is page_table_check_pte_clear().
>
> I notice, hilariously, that there is a ptep_clear() that _does_ call this. What
> a mess.
ptep_get_and_clear() is only relevant when something is present and
could change concurrently (especially A/D bits managed by HW).
We already obtained the pte, it's not present, and now just want to
clear it.
>
>
>> + if (is_swap_pte(pte)) {
>> + if (need_clear_uffd_wp)
>> pte = pte_swp_clear_uffd_wp(pte);
>> + /* make userspace aware that this pte moved. */
>> + pte = pte_swp_mksoft_dirty(pte);
>> }
>> - set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>> + set_pte_at(mm, new_addr, new_ptep, pte);
>> }
>> }
>>
>>
>> Note that I don't know why we had the existing
>>
>> - if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
>> - pte_clear(mm, new_addr, new_ptep);
>>
>>
>> I thought we would always expect that the destination pte is already pte_none() ?
>
> I think this is because we already did the move_pte() call in the original code
> before checking this:
Oh, maybe that's why.
>
> pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> pte = move_pte(pte, old_addr, new_addr);
> pte = move_soft_dirty_pte(pte);
>
> if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
> pte_clear(mm, new_addr, new_ptep);
>
> But maybe it's because there was a presumption move_pte() would like you know,
> move a PTE entry? Which it doesn't, it - only on SPARC - does a hook to flush
> the dcache.
I wish we could remove move_pte() completely. Or just notify about the
move of a present pte ... because it doesn't ever modify the pte val.
So renaming it to "arch_notify_move_pte()" or sth. like that that is a
void function might even be better.
... I think we could go one step further if we already have the folio:
we could call it arch_notify_move_folio_pte(), and simplify the sparc
implementation ...
Anyhow, the move_pte() cleanup can be done separately. Splitting the
present from !present case here should probably be done as a cleanup
upfront.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists