[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHbLzkqMTpja+RuePjy4Oz=0Eq2j7LP8hwApGZWu9v6MkKs+Ag@mail.gmail.com>
Date: Mon, 5 Feb 2024 11:41:02 -0800
From: Yang Shi <shy828301@...il.com>
To: Lance Yang <ioworker0@...il.com>
Cc: Michal Hocko <mhocko@...e.com>, David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
zokeefe@...gle.com, songmuchun@...edance.com, peterx@...hat.com,
minchan@...nel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] mm/khugepaged: skip copying lazyfree pages on collapse
On Fri, Feb 2, 2024 at 8:17 PM Lance Yang <ioworker0@...il.com> wrote:
>
> Hey Michal, David, Yang,
>
> I sincerely appreciate your time!
>
> I still have two questions that are perplexing me.
>
> First question:
> Given that khugepaged doesn't treat MADV_FREE
> pages as pte_none, why skip the 2M block when all
> the pages within the range are old and unreferenced,
> but won't skip if the partial range is MADV_FREE,
> even if it's not redirtied? Why make this distinction?
> Would it not be more straightforward to maintain
> if either all were skipped or not?
It is just some heuristic in the code and may be some arbitrary
choice. It could controlled in a more fine-grained way if we really
see some workloads get benefit.
>
> Second question:
> Does copying lazyfree pages (not redirtied) to the
> new huge page during khugepaged collapse
> undermine the semantics of MADV_FREE?
> Users mark pages as lazyfree with MADV_FREE,
> expecting these pages to be eventually reclaimed.
> Even without subsequent writes, these pages will
> no longer be reclaimed, even if memory pressure
> occurs.
Yeah, it just means khugepaged wins the race against page reclaim. I'm
supposed the delayed free is one of the design goals of MADV_FREE, and
the risk is the pages may not be freed eventually. If you want
immediate free or more deterministic behavior, you should use
MADV_DONTNEED or munmap IIUC.
>
> BR,
> Lance
>
> On Sat, Feb 3, 2024 at 1:42 AM Yang Shi <shy828301@...il.com> wrote:
> >
> > On Fri, Feb 2, 2024 at 6:53 AM Lance Yang <ioworker0@...il.com> wrote:
> > >
> > > How about blocking khugepaged from
> > > collapsing lazyfree pages? This way,
> > > is it not better to keep the semantics
> > > of MADV_FREE?
> > >
> > > What do you think?
> >
> > First of all, khugepaged doesn't treat MADV_FREE pages as pte_none
> > IIUC. The khugepaged does skip the 2M block if all the pages are old
> > and unreferenced pages in the range in hpage_collapse_scan_pmd(), then
> > repeat the check in collapse_huge_page() again.
> >
> > And MADV_FREE pages are just old and unreferenced. This is actually
> > what your first test case does. The whole 2M range is MADV_FREE range,
> > so they are skipped by khugepaged.
> >
> > But if the partial range is MADV_FREE, khugepaged won't skip them.
> > This is what your second test case does.
> >
> > Secondly, I think it depends on the semantics of MADV_FREE,
> > particularly how to treat the redirtied pages. TBH I'm always confused
> > by the semantics. For example, the page contained "abcd", then it was
> > MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> > the user should expect to see "abcd1234" or "00001234".
> >
> > I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> > valid and available, if I'm wrong please feel free to correct me. If
> > so we should always copy MADV_FREE pages in khugepaged regardless of
> > whether it is redirtied or not otherwise it may incur data corruption.
> > If we don't copy, then the follow up redirty after collapse to the
> > hugepage may return "00001234", right?
> >
> > The current behavior is copying the page.
> >
> > >
> > > Thanks,
> > > Lance
> > >
> > > On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@...e.com> wrote:
> > > >
> > > > On Fri 02-02-24 21:46:45, Lance Yang wrote:
> > > > > Here is a part from the man page explaining
> > > > > the MADV_FREE semantics:
> > > > >
> > > > > The kernel can thus free thesepages, but the
> > > > > freeing could be delayed until memory pressure
> > > > > occurs. For each of the pages that has been
> > > > > marked to be freed but has not yet been freed,
> > > > > the free operation will be canceled if the caller
> > > > > writes into the page. If there is no subsequent
> > > > > write, the kernel can free the pages at any time.
> > > > >
> > > > > IIUC, if there is no subsequent write, lazyfree
> > > > > pages will eventually be reclaimed.
> > > >
> > > > If there is no memory pressure then this might not
> > > > ever happen. User cannot make any assumption about
> > > > their content once madvise call has been done. The
> > > > content has to be considered lost. Sure the userspace
> > > > might have means to tell those pages from zero pages
> > > > and recheck after the write but that is about it.
> > > >
> > > > > khugepaged
> > > > > treats lazyfree pages the same as pte_none,
> > > > > avoiding copying them to the new huge page
> > > > > during collapse. It seems that lazyfree pages
> > > > > are reclaimed before khugepaged collapses them.
> > > > > This aligns with user expectations.
> > > > >
> > > > > However, IMO, if the content of MADV_FREE pages
> > > > > remains valid during collapse, then khugepaged
> > > > > treating lazyfree pages the same as pte_none
> > > > > might not be suitable.
> > > >
> > > > Why?
> > > >
> > > > Unless I am missing something (which is possible of
> > > > course) I do not really see why dropping the content
> > > > of those pages and replacing them with a THP is any
> > > > difference from reclaiming those pages and then faulting
> > > > in a non-THP zero page.
> > > >
> > > > Now, if khugepaged reused the original content of MADV_FREE
> > > > pages that would be a slightly different story. I can
> > > > see why users would expect zero pages to back madvised
> > > > area.
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
Powered by blists - more mailing lists