[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YuqmKmVtVYnkDF7J@xz-m1.local>
Date: Wed, 3 Aug 2022 12:45:30 -0400
From: Peter Xu <peterx@...hat.com>
To: Nadav Amit <nadav.amit@...il.com>
Cc: LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org,
Andrea Arcangeli <aarcange@...hat.com>,
Andi Kleen <andi.kleen@...el.com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>,
Hugh Dickins <hughd@...gle.com>,
Huang Ying <ying.huang@...el.com>,
"Kirill A . Shutemov" <kirill@...temov.name>,
Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [PATCH 2/2] mm: Remember young bit for page migrations
On Wed, Aug 03, 2022 at 12:42:54AM -0700, Nadav Amit wrote:
> On Aug 2, 2022, at 6:21 PM, Peter Xu <peterx@...hat.com> wrote:
>
> > When page migration happens, we always ignore the young bit settings in the
> > old pgtable, and marking the page as old in the new page table using either
> > pte_mkold() or pmd_mkold().
> >
> > That's fine from functional-wise, but that's not friendly to page reclaim
> > because the moving page can be actively accessed within the procedure. Not
> > to mention hardware setting the young bit can bring quite some overhead on
> > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> >
> > Actually we can easily remember the young bit configuration and recover the
> > information after the page is migrated. To achieve it, define a new bit in
> > the migration swap offset field to show whether the old pte has young bit
> > set or not. Then when removing/recovering the migration entry, we can
> > recover the young bit even if the page changed.
> >
> > One thing to mention is that here we used max_swapfile_size() to detect how
> > many swp offset bits we have, and we'll only enable this feature if we know
> > the swp offset can be big enough to store both the PFN value and the young
> > bit. Otherwise the young bit is dropped like before.
>
> I gave it some more thought and I am less confident whether this is the best
> solution. Not sure it is not either, so I am raising an alternative with
> pros and cons.
>
> An alternative would be to propagate the access bit into the page (i.e.,
> using folio_set_young()) and then set it back into the PTE later (i.e.,
> based on folio_test_young()). It might even seem that in general it is
> better to always set the page access bit if folio_test_young().
That's indeed an option. It's just that the Young bit (along with Idle
bit) is only defined with PAGE_IDLE feature enabled, or they're all no-op.
Another thing is even though using page flags looks clean, it'll lose the
granule of virtual address spaces when the page can be mapped in multiple
mm/vmas. I don't think there's a major difference here since page reclaim
will collect either pte young or page young (as long as page idle defined)
so it'll be the same. But there'll be other side effects as long as related
to the virtual address space. E.g. extra TLB flush needed as you said even
if the pages were not accessed by some mapping at all, so they become false
positive "young" pages after migration entries recovered.
In short, it'll be slightly less accurate than storing it in pgtables.
>
> This can be simpler and more performant. Setting the access-bit would not
> impact reclaim decisions (as the page is already considered young), would
> not induce overheads on clearing the access-bit (no TLB flush is needed at
> least on x86), and would save the time the CPU takes to set the access bit
> if the page is ever accessed (on x86).
Agreed. These benefits should be shared between both the pgtable approach
or the PageYoung approach you mentioned.
>
> It may also improve the preciseness of page-idle mechanisms and the
> interaction with it.
IIUC this may need extra work for page idle.
Currently when doing rmap walks we either only look at migration entries or
never look at them, according to the PVMW_MIGRATION flag. We'll need to
teach the page idle code and rmap walker to be able to walk with both
present ptes and migration entries at the same time to achieve that
preciseness.
> IIUC, page-idle does not consider migration entries, so
> the user would not get indication that pages under migration are not idle.
> When page-idle is reset, migrated pages might be later reinstated as
> “accessed”, giving wrong indication that the pages are not-idle, when in
> fact they are.
Before this patchset, we'll constantly loosing that young bit, hence it'll
be a false negative after migration entries recovered.
After this patchset, we'll have a possible race iff the page idle was
triggered during migrating some pages, but it's a race condition only and
the young bit will be correctly collected after migration completed.
So I agree it's a problem as you mentioned, but probably it's still better
with current patch than without it.
If ultimately we decided to go with page flags approach as you proposed,
just to mention that setting PageYoung is not enough - that bit is only
used to reserve page reclaim logic and not being interrupted by page idle.
IMO what we really will need is clearing PageIdle instead.
>
> On the negative side, I am not sure whether other archs, that might require
> a TLB flush for resetting the access-bit, and the overhead of doing atomic
> operation to clear the access-bit, would not induce more overhead than they
> would save.
I think your proposal makes sense and looks clean, maybe even cleaner than
the new max_swapfile_size() approach (and definitely nicer than the old one
of mine). It's just that I still want this to happen even without page idle
enabled - at least Fedora doesn't have page idle enabled by default. I'm
not sure whether it'll be worth it to define Young bit just for this (note:
iiuc we don't need Idle bit in this case, but only the Young bit).
The other thing is whether there's other side effect of losing pte level
granularity of young bit, since right after we merge them into the page
flags, then that granule is lost. So far I don't worry a lot on the tlb
flush overhead, but hopefully nothing else we missed.
>
> One more unrelated point - note that remove_migration_pte() would always set
> a clean PTE even when the old one was dirty…
Correct. Say it in another way, at least initial writes perf will still
suffer after migration on x86.
Dirty bit is kind of different in this case so I didn't yet try to cover
it. E.g., we won't lose it even without this patchset but consolidates it
into PageDirty already or it'll be a bug.
I think PageDirty could be cleared during migration procedure, if so we
could be wrongly applying the dirty bit to the recovered pte. I thought
about this before posting this series, but I hesitated on adding dirty bit
altogether with it at least in these initial versions since dirty bit may
need some more justifications.
Please feel free to share any further thoughts on the dirty bit.
Thanks,
--
Peter Xu
Powered by blists - more mailing lists