linux-kernel - Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ea2b584d-ce66-dcb0-a667-bdf259bdf194@redhat.com>
Date:   Thu, 3 Nov 2022 11:51:26 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Nadav Amit <namit@...are.com>
Cc:     kernel list <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        linuxppc-dev <linuxppc-dev@...ts.ozlabs.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Dave Chinner <david@...morbit.com>,
        Peter Xu <peterx@...hat.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Hugh Dickins <hughd@...gle.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Michael Ellerman <mpe@...erman.id.au>,
        Nicholas Piggin <npiggin@...il.com>,
        Mike Rapoport <rppt@...nel.org>,
        Anshuman Khandual <anshuman.khandual@....com>
Subject: Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable()
 to replace savedwrite

On 03.11.22 11:45, David Hildenbrand wrote:
> On 02.11.22 22:22, Nadav Amit wrote:
>> On Nov 2, 2022, at 12:12 PM, David Hildenbrand <david@...hat.com> wrote:
>>
>>> !! External Email
>>>
>>> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
>>> NUMA hinting fault") added remembering write permissions using ordinary
>>> pte_write() for PROT_NONE mapped pages to avoid write faults when
>>> remapping the page !PROT_NONE on NUMA hinting faults.
>>>
>>
>> [ snip ]
>>
>> Here’s a very shallow reviewed with some minor points...
> 
> Appreciated.
> 
>>
>>> ---
>>> include/linux/mm.h |  2 ++
>>> mm/huge_memory.c   | 28 +++++++++++++++++-----------
>>> mm/ksm.c           |  9 ++++-----
>>> mm/memory.c        | 19 ++++++++++++++++---
>>> mm/mprotect.c      |  7 ++-----
>>> 5 files changed, 41 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 25ff9a14a777..a0deeece5e87 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
>>> #define  MM_CP_UFFD_WP_ALL                 (MM_CP_UFFD_WP | \
>>>                                              MM_CP_UFFD_WP_RESOLVE)
>>>
>>> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>> +                            pte_t pte);
>>
>> It might not be customary, but how about marking it as __pure?
> 
> Right, there is no a single use of __pure in the mm domain.
> 
>>
>>> extern unsigned long change_protection(struct mmu_gather *tlb,
>>>                                struct vm_area_struct *vma, unsigned long start,
>>>                                unsigned long end, pgprot_t newprot,
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 2ad68e91896a..45abd27d75a0 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>          unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>>>          int page_nid = NUMA_NO_NODE;
>>>          int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
>>> -       bool migrated = false;
>>> -       bool was_writable = pmd_savedwrite(oldpmd);
>>> +       bool try_change_writable, migrated = false;
>>>          int flags = 0;
>>>
>>>          vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
>>> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>                  goto out;
>>>          }
>>>
>>> +       /* See mprotect_fixup(). */
>>> +       if (vma->vm_flags & VM_SHARED)
>>> +               try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
>>> +       else
>>> +               try_change_writable = !!(vma->vm_flags & VM_WRITE);
>>
>> Do you find it better to copy the code instead of extracting it to a
>> separate function?
> 
> Yeah, you're right ;) usually the issue is coming up with a suitable name. Let me try.
> 
> vma_wants_manual_writability_change() hmm ...
> 
>>
>>> +
>>>          pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>>          page = vm_normal_page_pmd(vma, haddr, pmd);
>>>          if (!page)
>>>                  goto out_map;
>>>
>>>          /* See similar comment in do_numa_page for explanation */
>>> -       if (!was_writable)
>>> +       if (try_change_writable && !pmd_write(pmd) &&
>>> +            can_change_pmd_writable(vma, vmf->address, pmd))
>>> +               pmd = pmd_mkwrite(pmd);
>>> +       if (!pmd_write(pmd))
>>>                  flags |= TNF_NO_GROUP;
>>>
>>>          page_nid = page_to_nid(page);
>>> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>          /* Restore the PMD */
>>>          pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>>>          pmd = pmd_mkyoung(pmd);
>>> -       if (was_writable)
>>> +
>>> +       /* Similar to mprotect() protection updates, avoid write faults. */
>>> +       if (try_change_writable && !pmd_write(pmd) &&
>>> +            can_change_pmd_writable(vma, vmf->address, pmd))
>>
>> Why do I have a deja-vu? :)
>>
>> There must be a way to avoid the redundant code and specifically the call to
>> can_change_pmd_writable(), no?
> 
> The issue is that as soon as we drop the page table lock, that information is stale.
> Especially, after we fail migration.
> 
> So the following should work, however, if we fail migration we wouldn't map the
> page writable and would have to re-calculate:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..a997625641e4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4674,10 +4674,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           struct vm_area_struct *vma = vmf->vma;
>           struct page *page = NULL;
>           int page_nid = NUMA_NO_NODE;
> +       bool writable = false;
>           int last_cpupid;
>           int target_nid;
>           pte_t pte, old_pte;
> -       bool was_writable = pte_savedwrite(vmf->orig_pte);
>           int flags = 0;
>    
>           /*
> @@ -4696,6 +4696,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           old_pte = ptep_get(vmf->pte);
>           pte = pte_modify(old_pte, vma->vm_page_prot);
>    
> +       /*
> +        * Detect now whether the PTE is or can be writable. Note that this
> +        * information is valid as long as we're holding the PT lock, so also on
> +        * the remap path below.
> +        */
> +       writable = pte_write(pte);
> +       if (!writable && vma_wants_manual_writability_change(vma) &&
> +           can_change_pte_writable(vma, vmf->address, pte);
> +           writable = true;
> +       }
> +
>           page = vm_normal_page(vma, vmf->address, pte);
>           if (!page || is_zone_device_page(page))
>                   goto out_map;
> @@ -4712,7 +4723,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>            * pte_dirty has unpredictable behaviour between PTE scan updates,
>            * background writeback, dirty balancing and application behaviour.
>            */
> -       if (!was_writable)
> +       if (!writable)
>                   flags |= TNF_NO_GROUP;
>    
>           /*
> @@ -4738,6 +4749,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>                   put_page(page);
>                   goto out_map;
>           }
> +       writable = false;
>           pte_unmap_unlock(vmf->pte, vmf->ptl);
>    
>           /* Migrate to the requested node */
> @@ -4767,7 +4779,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>           old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
>           pte = pte_modify(old_pte, vma->vm_page_prot);
>           pte = pte_mkyoung(pte);
> -       if (was_writable)
> +       if (writable)
>                   pte = pte_mkwrite(pte);
>           ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
>           update_mmu_cache(vma, vmf->address, vmf->pte);
> 
> 
> To me, the less error-prone approach is to re-calculate.

Hmm, thinking again, the "if (unlikely(!pte_same(*vmf->pte, 
vmf->orig_pte))) {" check might actually not require us to recalculate.

-- 
Thanks,

David / dhildenb