[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171004001559.GD20644@redhat.com>
Date: Tue, 3 Oct 2017 20:15:59 -0400
From: Jerome Glisse <jglisse@...hat.com>
To: Andrea Arcangeli <aarcange@...hat.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Nadav Amit <nadav.amit@...il.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Joerg Roedel <jroedel@...e.de>,
Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
David Woodhouse <dwmw2@...radead.org>,
Alistair Popple <alistair@...ple.id.au>,
Michael Ellerman <mpe@...erman.id.au>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Stephen Rothwell <sfr@...b.auug.org.au>,
iommu@...ts.linux-foundation.org, linuxppc-dev@...ts.ozlabs.org,
linux-next@...r.kernel.org
Subject: Re: [PATCH] mm/mmu_notifier: avoid double notification when it is
useless
On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
>
> On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> > +Case A is obvious you do not want to take the risk for the device to write to
> > +a page that might now be use by some completely different task.
>
> used
>
> > +is true ven if the thread doing the page table update is preempted right after
>
> even
>
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 90731e3b7e58..5706252b828a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> > goto out_free_pages;
> > VM_BUG_ON_PAGE(!PageHead(page), page);
> >
> > + /*
> > + * Leave pmd empty until pte is filled note we must notify here as
> > + * concurrent CPU thread might write to new page before the call to
> > + * mmu_notifier_invalidate_range_end() happen which can lead to a
>
> happens
>
> > + * device seeing memory write in different order than CPU.
> > + *
> > + * See Documentation/vm/mmu_notifier.txt
> > + */
> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > - /* leave pmd empty until pte is filled */
> >
>
> Here we can change the following mmu_notifier_invalidate_range_end to
> skip calling ->invalidate_range. It could be called
> mmu_notifier_invalidate_range_only_end, or other suggestions
> welcome. Otherwise we'll repeat the call for nothing.
>
> We need it inside the PT lock for the ordering issue, but we don't
> need to run it twice.
>
> Same in do_huge_pmd_wp_page, wp_page_copy and
> migrate_vma_insert_page. Every time *clear_flush_notify is used
> mmu_notifier_invalidate_range_only_end should be called after it,
> instead of mmu_notifier_invalidate_range_end.
>
> I think optimizing that part too, fits in the context of this patchset
> (if not in the same patch), because the objective is still the same:
> to remove unnecessary ->invalidate_range calls.
Yes you are right, good idea, i will respin with that too (and with the
various typo you noted thank you for that). I can do 2 patch or 1, i don't
mind either way. I will probably do 2 as first and they can be folded into
1 if people prefer just one.
>
> > + * No need to notify as we downgrading page
>
> are
>
> > + * table protection not changing it to point
> > + * to a new page.
> > + *
>
> > + * No need to notify as we downgrading page table to read only
>
> are
>
> > + * No need to notify as we replacing a read only page with another
>
> are
>
> > @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > if (pte_soft_dirty(pteval))
> > swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > set_pte_at(mm, address, pvmw.pte, swp_pte);
> > - } else
> > + } else {
> > + /*
> > + * We should not need to notify here as we reach this
> > + * case only from freeze_page() itself only call from
> > + * split_huge_page_to_list() so everything below must
> > + * be true:
> > + * - page is not anonymous
> > + * - page is locked
> > + *
> > + * So as it is a shared page and it is locked, it can
> > + * not be remove from the page cache and replace by
> > + * a new page before mmu_notifier_invalidate_range_end
> > + * so no concurrent thread might update its page table
> > + * to point at new page while a device still is using
> > + * this page.
> > + *
> > + * But we can not assume that new user of try_to_unmap
> > + * will have that in mind so just to be safe here call
> > + * mmu_notifier_invalidate_range()
> > + *
> > + * See Documentation/vm/mmu_notifier.txt
> > + */
> > dec_mm_counter(mm, mm_counter_file(page));
> > + mmu_notifier_invalidate_range(mm, address,
> > + address + PAGE_SIZE);
> > + }
> > discard:
> > + /*
> > + * No need to call mmu_notifier_invalidate_range() as we are
> > + * either replacing a present pte with non present one (either
> > + * a swap or special one). We handling the clearing pte case
> > + * above.
> > + *
> > + * See Documentation/vm/mmu_notifier.txt
> > + */
> > page_remove_rmap(subpage, PageHuge(page));
> > put_page(page);
> > - mmu_notifier_invalidate_range(mm, address,
> > - address + PAGE_SIZE);
> > }
> >
> > mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>
> That is the path that unmaps filebacked pages (btw, not necessarily
> shared unlike comment says, they can be private but still filebacked).
I was more refering to the fact that they are in page cache and thus
given current condition they can not be migrated to a new page in our
back. But yes it can be a private mapping, i will fix the comment.
> I'd like some more explanation about the inner working of "that new
> user" as per comment above.
>
> It would be enough to drop mmu_notifier_invalidate_range from above
> without adding it to the filebacked case. The above gives higher prio
> to the hypothetical and uncertain future case, than to the current
> real filebacked case that doesn't need ->invalidate_range inside the
> PT lock, or do you see something that might already need such
> ->invalidate_range?
No i don't see any new user today that might need such invalidate but
i was trying to be extra cautious as i have a tendency to assume that
someone might do a patch that use try_to_unmap() without going through
all the comments in the function and thus possibly using it in a an
unexpected way from mmu_notifier callback point of view. I am fine
with putting the burden on new user to get it right and adding an
extra warning in the function description to try to warn people in a
sensible way.
> I certainly like the patch. I expect ->invalidate_range users will
> like the slight higher complexity in order to eliminate unnecessary
> invalidates that are slowing them down unnecessarily. At the same time
> this is zero risk (because a noop) for all other MMU notifier users
> (those that don't share the primary MMU pagetables, like KVM).
I have another patchset to restore the change_pte optimization for kvm
i want to post too. I will need to do some benchmarking first to make
sure it actualy helps even in a small way.
Thank you for looking into this patch, i will repost with your suggestions
soon.
Jérôme
Powered by blists - more mailing lists