[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YRaXZgBuoYX0sPhS@xz-m1.local>
Date: Fri, 13 Aug 2021 12:01:42 -0400
From: Peter Xu <peterx@...hat.com>
To: Tiberiu Georgescu <tiberiu.georgescu@...anix.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
Alistair Popple <apopple@...dia.com>,
Ivan Teterevkov <ivan.teterevkov@...anix.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
Hugh Dickins <hughd@...gle.com>,
Matthew Wilcox <willy@...radead.org>,
Andrea Arcangeli <aarcange@...hat.com>,
David Hildenbrand <david@...hat.com>,
"Kirill A . Shutemov" <kirill@...temov.name>,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Kravetz <mike.kravetz@...cle.com>
Subject: Re: [PATCH RFC 4/4] mm: Install marker pte when page out for shmem
pages
On Fri, Aug 13, 2021 at 03:18:22PM +0000, Tiberiu Georgescu wrote:
> Hello Peter,
Hi, Tiberiu,
>
> Sorry for commenting so late.
No worry on that. I appreciate a lot your follow up on this problem.
>
> > On 7 Aug 2021, at 04:25, Peter Xu <peterx@...hat.com> wrote:
> >
> > When shmem pages are swapped out, instead of clearing the pte entry, we leave a
> > marker pte showing that this page is swapped out as a hint for pagemap. A new
> > TTU flag is introduced to identify this case.
> >
> > This can be useful for detecting swapped out cold shmem pages. Then after some
> > memory background scanning work (which will fault in the shmem page and
> > confusing page reclaim), we can do MADV_PAGEOUT explicitly on this page to swap
> > it out again as we know it was cold.
> >
> > For pagemap, we don't need to explicitly set PM_SWAP bit, because by nature
> > SWP_PTE_MARKER ptes are already counted as PM_SWAP due to it's format as swap.
> >
> > Signed-off-by: Peter Xu <peterx@...hat.com>
> > ---
> > fs/proc/task_mmu.c | 1 +
> > include/linux/rmap.h | 1 +
> > mm/rmap.c | 19 +++++++++++++++++++
> > mm/vmscan.c | 2 +-
> > 4 files changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index eb97468dfe4c..21b8594abc1d 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1384,6 +1384,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> > if (pm->show_pfn)
> > frame = swp_type(entry) |
> > (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> > + /* NOTE: this covers PTE_MARKER_PAGEOUT too */
> > flags |= PM_SWAP;
> > if (is_pfn_swap_entry(entry))
> > page = pfn_swap_entry_to_page(entry);
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index c976cc6de257..318a0e95c7fb 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -95,6 +95,7 @@ enum ttu_flags {
> > * do a final flush if necessary */
> > TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
> > * caller holds it */
> > + TTU_HINT_PAGEOUT = 0x100, /* Hint for pageout operation */
> > };
> >
> > #ifdef CONFIG_MMU
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index b9eb5c12f3fe..24a70b36b6da 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1384,6 +1384,22 @@ void page_remove_rmap(struct page *page, bool compound)
> > unlock_page_memcg(page);
> > }
> >
> > +static inline void
> > +pte_marker_install(struct vm_area_struct *vma, pte_t *pte,
> > + struct page *page, unsigned long address)
> > +{
> > +#ifdef CONFIG_PTE_MARKER_PAGEOUT
> > + swp_entry_t entry;
> > + pte_t pteval;
> > +
> > + if (vma_is_shmem(vma) && !PageAnon(page) && pte_none(*pte)) {
> > + entry = make_pte_marker_entry(PTE_MARKER_PAGEOUT);
> > + pteval = swp_entry_to_pte(entry);
> > + set_pte_at(vma->vm_mm, address, pte, pteval);
> > + }
> > +#endif
> > +}
> > +
> > /*
> > * @arg: enum ttu_flags will be passed to this argument
> > */
> > @@ -1628,6 +1644,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > */
> > dec_mm_counter(mm, mm_counter_file(page));
> > }
> > +
> > + if (flags & TTU_HINT_PAGEOUT)
> > + pte_marker_install(vma, pvmw.pte, page, address);
> > discard:
> > /*
> > * No need to call mmu_notifier_invalidate_range() it has be
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 4620df62f0ff..4754af6fa24b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1493,7 +1493,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> > * processes. Try to unmap it here.
> > */
> > if (page_mapped(page)) {
> > - enum ttu_flags flags = TTU_BATCH_FLUSH;
> > + enum ttu_flags flags = TTU_BATCH_FLUSH | TTU_HINT_PAGEOUT;
> > bool was_swapbacked = PageSwapBacked(page);
> >
> > if (unlikely(PageTransHuge(page)))
> > --
> > 2.32.0
> >
> The RFC looks good to me. It makes it much cleaner to verify whether the page
> is in swap or not, and there is great performance benefit compared to my patch.
>
> Currently, your patch does not have a way of putting the swap offset onto the
> entry when a shared page is swapped, so it does not cover all use cases
> IMO.
Could you remind me on why you need the swap offset? I was trying to
understand your scenaior and I hope I summarized it right: what we want to do
is being able to MADV_PAGEOUT pages that was swapped out. Then IIUC the
PM_SWAP bit should be enough for it. Did I miss something important?
>
> To counter that, I added my patch on top of yours. By making use of your
> mechanism, we don't need to read the page cache xarray when we pages
> are definitely none. This reduces much unnecessary overhead.
>
> Main diff in fs/proc/task_mmu.c:pte_to_pagemap_entry():
> } else if (is_swap_pte(pte)) {
> swp_entry_t entry;
> ...
> entry = pte_to_swp_entry(pte);
> + if (is_pte_marker_entry(entry)) {
> + void *xa_entry = get_xa_entry_at_vma_addr(vma, addr);
> ...
>
> For reference: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/.
>
> After some preliminary performance tests on VMs, I noticed a significant
> performance improvement in my patch, when yours is added. Here is the
> most interesting result:
>
> Program I tested it on: Same as https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
>
> Specs:
> Memory: 32GB memory, 64GB swap
> Params: 16GB allocated shmem, 50% dirty, 4GB cgroup, no batching
>
> In short, the pagemap read deals with
> * ~4GB of physical memory (present pages),
> * ~4GB in swap (swapped pages)
> * 8GB non-dirty (none pages).
>
> Results:
> +------------+-------+-------+
> | Peter\Tibi | Pre | Post | (in seconds)
> +------------+-------+-------+
> | Pre | 11.82 | 38.44 |
> | Post | 13.28 | 22.34 |
> +------------+-------+-------+
>
> With your patch, mine yields the results in twice the time, compared to 4x.
> I am aware it is not ideal, but until it is decided whether a whole different
> system is preferred to PTE markers, our solutions together currently deliver
> the best performance for correctness. Thank you for finding this solution.
Thanks for trying that out! It's great to know these test results.
>
> I am happy to introduce a configuration variable to enable my pagemap
> patch only if necessary.
Right. We can use a config for that, but I didn't mention when replying to
your previous patchset because I thought w/ or w/o the config it should still
be better to use the pte markers because it should be more efficient.
However I think I need to double check I didn't miss anything that you're
looking for. E.g. I need to understand why swp offset matters here as I asked.
I must confess that cannot be trivially done with pte markers yet - keeping a
swap hint is definitely not the same story as keeping a swp entry.
> Either way, if performance is a must, batching is still the best way to
> access multiple pagemap entries.
I agree, especially when we have pmd pgtable locks things can happen
concurrently.
It's just that it's a pity the major overhead comparing to the old way is at
page cache look up, especially as you pointed out - the capability to identify
used ptes with empty ptes matters. That's kind of orthogonal to batching to me.
Thanks,
--
Peter Xu
Powered by blists - more mailing lists