lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <709c0648-c9c5-4197-83f2-64d36293b99e@arm.com> Date: Fri, 6 Sep 2024 14:30:07 +0530 From: Dev Jain <dev.jain@....com> To: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org, david@...hat.com, willy@...radead.org, kirill.shutemov@...ux.intel.com Cc: anshuman.khandual@....com, catalin.marinas@....com, cl@...two.org, vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com, dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org, jack@...e.cz, mark.rutland@....com, hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com, peterx@...hat.com, ioworker0@...il.com, jglisse@...gle.com, linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org Subject: Re: [PATCH v2 2/2] mm: Allocate THP on hugezeropage wp-fault On 9/6/24 14:13, Ryan Roberts wrote: > On 06/09/2024 08:05, Dev Jain wrote: >> On 9/5/24 18:44, Ryan Roberts wrote: >>> On 04/09/2024 11:09, Dev Jain wrote: >>>> Introduce do_huge_zero_wp_pmd() to handle wp-fault on a hugezeropage and >>>> replace it with a PMD-mapped THP. Change the helpers introduced in the >>>> previous patch to flush TLB entry corresponding to the hugezeropage, >>>> and preserve PMD uffd-wp marker. In case of failure, fallback to >>>> splitting the PMD. >>>> >>>> Signed-off-by: Dev Jain <dev.jain@....com> >>>> --- >>>> include/linux/huge_mm.h | 6 ++++ >>>> mm/huge_memory.c | 79 +++++++++++++++++++++++++++++++++++------ >>>> mm/memory.c | 5 +-- >>>> 3 files changed, 78 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>>> index e25d9ebfdf89..fdd2cf473a3c 100644 >>>> --- a/include/linux/huge_mm.h >>>> +++ b/include/linux/huge_mm.h >>>> @@ -9,6 +9,12 @@ >>>> #include <linux/kobject.h> >>>> vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); >>>> +vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct vm_area_struct *vma, >>>> + unsigned long haddr, struct folio **foliop, >>>> + unsigned long addr); >>>> +void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, >>>> + struct vm_area_struct *vma, unsigned long haddr, >>>> + pgtable_t pgtable); >>> I don't think you are using either of these outside of huge_memory.c, so not >>> sure you need to declare them here or make them non-static? >> As pointed out by Kirill, you are right, I forgot to drop these from my previous >> approach. >> >>>> int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, >>>> pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, >>>> struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index 58125fbcc532..150163ad77d3 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -943,9 +943,9 @@ unsigned long thp_get_unmapped_area(struct file *filp, >>>> unsigned long addr, >>>> } >>>> EXPORT_SYMBOL_GPL(thp_get_unmapped_area); >>>> -static vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct >>>> vm_area_struct *vma, >>>> - unsigned long haddr, struct folio **foliop, >>>> - unsigned long addr) >>>> +vm_fault_t thp_fault_alloc(gfp_t gfp, int order, struct vm_area_struct *vma, >>>> + unsigned long haddr, struct folio **foliop, >>>> + unsigned long addr) >>>> { >>>> struct folio *folio = vma_alloc_folio(gfp, order, vma, haddr, true); >>>> @@ -984,21 +984,29 @@ static void __thp_fault_success_stats(struct >>>> vm_area_struct *vma, int order) >>>> count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC); >>>> } >>>> -static void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, >>>> - struct vm_area_struct *vma, unsigned long haddr, >>>> - pgtable_t pgtable) >>>> +void map_pmd_thp(struct folio *folio, struct vm_fault *vmf, >>>> + struct vm_area_struct *vma, unsigned long haddr, >>>> + pgtable_t pgtable) >>>> { >>>> - pmd_t entry; >>>> + pmd_t entry, old_pmd; >>>> + bool is_pmd_none = pmd_none(*vmf->pmd); >>>> entry = mk_huge_pmd(&folio->page, vma->vm_page_prot); >>>> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); >>>> folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE); >>>> folio_add_lru_vma(folio, vma); >>>> - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); >>>> + if (!is_pmd_none) { >>>> + old_pmd = pmdp_huge_clear_flush(vma, haddr, vmf->pmd); >>>> + if (pmd_uffd_wp(old_pmd)) >>>> + entry = pmd_mkuffd_wp(entry); >>> I don't really get this; entry is writable, so I wouldn't expect to also be >>> setting uffd-wp here? That combination is not allowed and is checked for in >>> page_table_check_pte_flags(). >>> >>> It looks like you expect to get here in the uffd-wp-async case, which used to >>> cause the pmd to be split to ptes. I'm guessing (but don't know for sure) that >>> would cause the uffd-wp bit to be set in each of the new ptes, then during >>> fallback to handling the wp fault on the pte, uffd would handle it? >> I guess you are correct; I missed the WARN_ON() in page_table_check_pmd_flags(), >> but I did see, if I read the uffd code correctly, that mfill_atomic() will just >> return in case of pmd_trans_huge(*dst_pmd) while doing a uffd_copy to the >> destination >> location. So preserving pmd_uffd_wp is useless in case a THP is mapped, but I >> did not >> know that in fact it is supposed to be an invalid combination. So, I will drop it, >> unless someone has some other objection. > So what's the correct way to handle uffd-wp-async in wp_huge_pmd()? Just split > it? If so, you can revert your changes in memory.c. I think so. > >>>> + } >>>> + if (pgtable) >>>> + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); >>> Should this call be moved outside of here? It doesn't really feel like it >>> belongs. Could it be called before calling map_pmd_thp() for the site that has a >>> pgtable? >> Every other place I checked, they are doing this: make the entry -> deposit >> pgtable -> >> set_pmd_at(). I guess the general flow is to do the deposit based on the old >> pmd, before >> setting the new one. Which implies: I should at least move this check before I call >> pmdp_huge_clear_flush(). And, since vmf->pmd and creating the new entry has no >> relation, >> I am inclined to do what you are saying. > pgtable_trans_huge_deposit() is just adding the pgtable to a list so that if the > THP needs to be split in future, then we have preallocated the pte pgtable so > the operation can't fail. Yes. > And enqueing it is just under the protection of the > PTL as far as I can tell. So I think the only ordering requirement is that you > both set the pmd and deposit the pgtable under the lock (without dropping it in > between). So you can deposit either before or after map_pmd_thp(). Yes I'll do that before. > And > pmdp_huge_clear_flush() is irrelavent, I think? You mean, in this context? Everywhere, pgtable deposit uses the old pmd value to be replaced as its input, that is, it is called before set_pmd_at(). So calling pgtable deposit after clear_flush() will violate this ordering. I do not think this ordering is really required but I'd rather be safe :) > >>>> set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); >>>> update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); >>>> add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); >>>> - mm_inc_nr_ptes(vma->vm_mm); >>>> + if (is_pmd_none) >>>> + mm_inc_nr_ptes(vma->vm_mm); >>>> } >>>> static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf) >>>> @@ -1576,6 +1584,50 @@ void huge_pmd_set_accessed(struct vm_fault *vmf) >>>> spin_unlock(vmf->ptl); >>>> } >>>> +static vm_fault_t do_huge_zero_wp_pmd_locked(struct vm_fault *vmf, >>>> + unsigned long haddr, >>>> + struct folio *folio) >>>> +{ >>>> + struct vm_area_struct *vma = vmf->vma; >>>> + vm_fault_t ret = 0; >>>> + >>>> + ret = check_stable_address_space(vma->vm_mm); >>>> + if (ret) >>>> + goto out; >>>> + map_pmd_thp(folio, vmf, vma, haddr, NULL); >>>> +out: >>>> + return ret; >>>> +} >>>> + >>>> +static vm_fault_t do_huge_zero_wp_pmd(struct vm_fault *vmf, unsigned long >>>> haddr) >>>> +{ >>>> + struct vm_area_struct *vma = vmf->vma; >>>> + gfp_t gfp = vma_thp_gfp_mask(vma); >>>> + struct mmu_notifier_range range; >>>> + struct folio *folio = NULL; >>>> + vm_fault_t ret = 0; >>>> + >>>> + ret = thp_fault_alloc(gfp, HPAGE_PMD_ORDER, vma, haddr, &folio, >>>> + vmf->address); >>> Just checking: the PTE table was already allocated during the read fault, right? >>> So we don't have to allocate it here. >> Correct, that happens in set_huge_zero_folio(). Thanks for checking. >> >>>> + if (ret) >>>> + goto out; >>>> + >>>> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, haddr, >>>> + haddr + HPAGE_PMD_SIZE); >>>> + mmu_notifier_invalidate_range_start(&range); >>>> + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); >>>> + if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) >>>> + goto unlock; >>>> + ret = do_huge_zero_wp_pmd_locked(vmf, haddr, folio); >>>> + if (!ret) >>>> + __thp_fault_success_stats(vma, HPAGE_PMD_ORDER); >>>> +unlock: >>>> + spin_unlock(vmf->ptl); >>>> + mmu_notifier_invalidate_range_end(&range); >>> I'll confess I don't understand all the mmu notifier rules. >> I confess the same :) >> >>> But the doc at >>> Documentation/mm/mmu_notifier.rst implies that the notification must be done >>> while holding the PTL. Although that's not how wp_page_copy(). Are you confident >>> what you have done is correct? >> Everywhere else, invalidate_range_end() is getting called after dropping the lock, >> one reason is that it has a might_sleep(), and therefore we cannot call it while >> holding a spinlock. I still don't know what exactly these calls mean...but I think >> what I have done is correct. > High level; they are notifying secondary MMUs (e.g. IOMMU) of a change so the > tables of those secondary MMUs can be kept in sync. I don't understand all the > ordering requirement details though. > > I think what you have is probably correct, would be good for someone that knows > what they are talking about to confirm though :) Exactly. > >>> Thanks, >>> Ryan >>> >>>> +out: >>>> + return ret; >>>> +} >>>> + >>>> vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) >>>> { >>>> const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; >>>> @@ -1588,8 +1640,15 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) >>>> vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); >>>> VM_BUG_ON_VMA(!vma->anon_vma, vma); >>>> - if (is_huge_zero_pmd(orig_pmd)) >>>> + if (is_huge_zero_pmd(orig_pmd)) { >>>> + vm_fault_t ret = do_huge_zero_wp_pmd(vmf, haddr); >>>> + >>>> + if (!(ret & VM_FAULT_FALLBACK)) >>>> + return ret; >>>> + >>>> + /* Fallback to splitting PMD if THP cannot be allocated */ >>>> goto fallback; >>>> + } >>>> spin_lock(vmf->ptl); >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 3c01d68065be..c081a25f5173 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -5409,9 +5409,10 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault >>>> *vmf) >>>> if (vma_is_anonymous(vma)) { >>>> if (likely(!unshare) && >>>> userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd)) { >>>> - if (userfaultfd_wp_async(vmf->vma)) >>>> + if (!userfaultfd_wp_async(vmf->vma)) >>>> + return handle_userfault(vmf, VM_UFFD_WP); >>>> + if (!is_huge_zero_pmd(vmf->orig_pmd)) >>>> goto split; >>>> - return handle_userfault(vmf, VM_UFFD_WP); >>>> } >>>> return do_huge_pmd_wp_page(vmf); >>>> }
Powered by blists - more mailing lists