[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ea2b584d-ce66-dcb0-a667-bdf259bdf194@redhat.com>
Date: Thu, 3 Nov 2022 11:51:26 +0100
From: David Hildenbrand <david@...hat.com>
To: Nadav Amit <namit@...are.com>
Cc: kernel list <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>,
linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Mel Gorman <mgorman@...hsingularity.net>,
Dave Chinner <david@...morbit.com>,
Peter Xu <peterx@...hat.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Hugh Dickins <hughd@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Michael Ellerman <mpe@...erman.id.au>,
Nicholas Piggin <npiggin@...il.com>,
Mike Rapoport <rppt@...nel.org>,
Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable()
to replace savedwrite
On 03.11.22 11:45, David Hildenbrand wrote:
> On 02.11.22 22:22, Nadav Amit wrote:
>> On Nov 2, 2022, at 12:12 PM, David Hildenbrand <david@...hat.com> wrote:
>>
>>> !! External Email
>>>
>>> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
>>> NUMA hinting fault") added remembering write permissions using ordinary
>>> pte_write() for PROT_NONE mapped pages to avoid write faults when
>>> remapping the page !PROT_NONE on NUMA hinting faults.
>>>
>>
>> [ snip ]
>>
>> Here’s a very shallow reviewed with some minor points...
>
> Appreciated.
>
>>
>>> ---
>>> include/linux/mm.h | 2 ++
>>> mm/huge_memory.c | 28 +++++++++++++++++-----------
>>> mm/ksm.c | 9 ++++-----
>>> mm/memory.c | 19 ++++++++++++++++---
>>> mm/mprotect.c | 7 ++-----
>>> 5 files changed, 41 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 25ff9a14a777..a0deeece5e87 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>> #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
>>> MM_CP_UFFD_WP_RESOLVE)
>>>
>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> + pte_t pte);
>>
>> It might not be customary, but how about marking it as __pure?
>
> Right, there is no a single use of __pure in the mm domain.
>
>>
>>> extern unsigned long change_protection(struct mmu_gather *tlb,
>>> struct vm_area_struct *vma, unsigned long start,
>>> unsigned long end, pgprot_t newprot,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 2ad68e91896a..45abd27d75a0 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>> unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>> int page_nid = NUMA_NO_NODE;
>>> int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>> - bool migrated = false;
>>> - bool was_writable = pmd_savedwrite(oldpmd);
>>> + bool try_change_writable, migrated = false;
>>> int flags = 0;
>>>
>>> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>> goto out;
>>> }
>>>
>>> + /* See mprotect_fixup(). */
>>> + if (vma->vm_flags & VM_SHARED)
>>> + try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>> + else
>>> + try_change_writable = !!(vma->vm_flags & VM_WRITE);
>>
>> Do you find it better to copy the code instead of extracting it to a
>> separate function?
>
> Yeah, you're right ;) usually the issue is coming up with a suitable name. Let me try.
>
> vma_wants_manual_writability_change() hmm ...
>
>>
>>> +
>>> pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>> page = vm_normal_page_pmd(vma, haddr, pmd);
>>> if (!page)
>>> goto out_map;
>>>
>>> /* See similar comment in do_numa_page for explanation */
>>> - if (!was_writable)
>>> + if (try_change_writable && !pmd_write(pmd) &&
>>> + can_change_pmd_writable(vma, vmf->address, pmd))
>>> + pmd = pmd_mkwrite(pmd);
>>> + if (!pmd_write(pmd))
>>> flags |= TNF_NO_GROUP;
>>>
>>> page_nid = page_to_nid(page);
>>> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>> /* Restore the PMD */
>>> pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>> pmd = pmd_mkyoung(pmd);
>>> - if (was_writable)
>>> +
>>> + /* Similar to mprotect() protection updates, avoid write faults. */
>>> + if (try_change_writable && !pmd_write(pmd) &&
>>> + can_change_pmd_writable(vma, vmf->address, pmd))
>>
>> Why do I have a deja-vu? :)
>>
>> There must be a way to avoid the redundant code and specifically the call to
>> can_change_pmd_writable(), no?
>
> The issue is that as soon as we drop the page table lock, that information is stale.
> Especially, after we fail migration.
>
> So the following should work, however, if we fail migration we wouldn't map the
> page writable and would have to re-calculate:
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..a997625641e4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4674,10 +4674,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> struct vm_area_struct *vma = vmf->vma;
> struct page *page = NULL;
> int page_nid = NUMA_NO_NODE;
> + bool writable = false;
> int last_cpupid;
> int target_nid;
> pte_t pte, old_pte;
> - bool was_writable = pte_savedwrite(vmf->orig_pte);
> int flags = 0;
>
> /*
> @@ -4696,6 +4696,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> old_pte = ptep_get(vmf->pte);
> pte = pte_modify(old_pte, vma->vm_page_prot);
>
> + /*
> + * Detect now whether the PTE is or can be writable. Note that this
> + * information is valid as long as we're holding the PT lock, so also on
> + * the remap path below.
> + */
> + writable = pte_write(pte);
> + if (!writable && vma_wants_manual_writability_change(vma) &&
> + can_change_pte_writable(vma, vmf->address, pte);
> + writable = true;
> + }
> +
> page = vm_normal_page(vma, vmf->address, pte);
> if (!page || is_zone_device_page(page))
> goto out_map;
> @@ -4712,7 +4723,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> * pte_dirty has unpredictable behaviour between PTE scan updates,
> * background writeback, dirty balancing and application behaviour.
> */
> - if (!was_writable)
> + if (!writable)
> flags |= TNF_NO_GROUP;
>
> /*
> @@ -4738,6 +4749,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> put_page(page);
> goto out_map;
> }
> + writable = false;
> pte_unmap_unlock(vmf->pte, vmf->ptl);
>
> /* Migrate to the requested node */
> @@ -4767,7 +4779,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
> pte = pte_modify(old_pte, vma->vm_page_prot);
> pte = pte_mkyoung(pte);
> - if (was_writable)
> + if (writable)
> pte = pte_mkwrite(pte);
> ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
> update_mmu_cache(vma, vmf->address, vmf->pte);
>
>
> To me, the less error-prone approach is to re-calculate.
Hmm, thinking again, the "if (unlikely(!pte_same(*vmf->pte,
vmf->orig_pte))) {" check might actually not require us to recalculate.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists