[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZuD9l6D6XuAUb4tP@yzhao56-desk.sh.intel.com>
Date: Wed, 11 Sep 2024 10:16:55 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Peter Xu <peterx@...hat.com>
CC: Andrew Morton <akpm@...ux-foundation.org>, <linux-kernel@...r.kernel.org>,
<linux-mm@...ck.org>, Gavin Shan <gshan@...hat.com>, Catalin Marinas
<catalin.marinas@....com>, <x86@...nel.org>, Ingo Molnar <mingo@...hat.com>,
Paolo Bonzini <pbonzini@...hat.com>, Dave Hansen
<dave.hansen@...ux.intel.com>, Thomas Gleixner <tglx@...utronix.de>,
"Alistair Popple" <apopple@...dia.com>, <kvm@...r.kernel.org>,
<linux-arm-kernel@...ts.infradead.org>, Sean Christopherson
<seanjc@...gle.com>, Oscar Salvador <osalvador@...e.de>, Jason Gunthorpe
<jgg@...dia.com>, Borislav Petkov <bp@...en8.de>, Zi Yan <ziy@...dia.com>,
Axel Rasmussen <axelrasmussen@...gle.com>, David Hildenbrand
<david@...hat.com>, Will Deacon <will@...nel.org>, Kefeng Wang
<wangkefeng.wang@...wei.com>, Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
On Tue, Sep 10, 2024 at 08:16:10AM -0400, Peter Xu wrote:
> On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote:
> > Hi Peter,
>
> Hi, Yan,
>
> >
> > Not sure if I missed anything.
> >
> > It looks that before this patch, pmd/pud are alawys write protected without
> > checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
> > clears dirty bit by moving the dirty value to the software bit.
> >
> > And I have a question that why previously pmd/pud are always write protected.
>
> IIUC this is a separate question - the move of dirty bit in pud_wrprotect()
> is to avoid wrongly creating shadow stack mappings. In our discussion I
> think that's an extra complexity and can be put aside; the dirty bit will
> get recovered in pud_clear_saveddirty() later, so it's not the same as
> pud_mkclean().
But pud_clear_saveddirty() will only set dirty bit when write bit is 1.
>
> AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we
> will not duplicate pgtables in fork() for most of shared file mappings
> (!CoW). Please refer to vma_needs_copy(), and the comment before returning
> false at last. I think it's not strictly is_cow_mapping(), as we're
> checking anon_vma there, however it's mostly it, just to also cover
> MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW
> happened then anon_vma will appear already).
>
> There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps.
> Userfault & mixedmap are not involved in this series at all, so let's
> discuss pfnmaps.
>
> It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant
> to this series, because before this series pfnmap only exists in pte level,
> hence IMO the is_cow_mapping() must exist for pte level as you described,
> because it needs to properly take care of those. Note that in the pte
> processing it also checks pte_write() to make sure it's a COWed page, not a
> RO page cache / pfnmap / ..., for example.
>
> Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that
> pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise
> the whole copy_page_range() could be already skipped. IOW I think they
> only need to process COWed pages here, and those pages require write bit
> removed in both parent and child when fork().
Is it also based on that there's no MAP_SHARED huge DEVMAP pages up to now?
>
> After this series, pfnmaps can appear in the form of pmd/pud, then the
> previous assumption will stop holding true, as we'll still copy pfnmaps
> during fork() always. My guessing of the reason is because most of the
> drivers map pfnmap vmas only during mmap(), it means there can normally
> have no fault() handler at all for those pfns.
>
> In this case, we'll need to also identify whether the page is COWed, using
> the newly added "is_cow_mapping() && pxx_write()" in this series (added
> to pud path, while for pmd path I used a WARN_ON_ONCE instead).
>
> If we don't do that, it means e.g. for a VM_SHARED pfnmap vma, after fork()
> we'll wrongly observe write protected entries. Here the change will make
> sure VM_SHARED can properly persist the write bits on pmds/puds.
>
> Hope that explains.
>
Thanks a lot for such detailed explanation!
>
Powered by blists - more mailing lists