[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAK1f24nhchLX98so=jmbdm4jF21FfvNbtxNaXxx059e_Da_uOg@mail.gmail.com>
Date: Tue, 20 Feb 2024 18:15:48 +0800
From: Lance Yang <ioworker0@...il.com>
To: "Zach O'Keefe" <zokeefe@...gle.com>, Yang Shi <shy828301@...il.com>,
Michal Hocko <mhocko@...e.com>, David Hildenbrand <david@...hat.com>
Cc: akpm@...ux-foundation.org, songmuchun@...edance.com, peterx@...hat.com,
minchan@...nel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] mm/khugepaged: skip copying lazyfree pages on collapse
Hey Zach, Yang, Michal, and David,
Please accept my sincerest apologies for the delayed
response.
Thanks for the replies; it‘s been very helpful to me! I also
appreciate the valuable information you’ve shared!
I agree that it’s not a good idea to let khugepaged avoid
any pages marked with MADV_FREE.
Thanks again for your time!
Best,
Lance
On Tue, Feb 6, 2024 at 4:27 AM Zach O'Keefe <zokeefe@...gle.com> wrote:
>
> On Mon, Feb 5, 2024 at 11:43 AM Yang Shi <shy828301@...il.com> wrote:
> >
> > On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@...e.com> wrote:
> > >
> > > On Fri 02-02-24 09:42:27, Yang Shi wrote:
> > > > But if the partial range is MADV_FREE, khugepaged won't skip them.
> > > > This is what your second test case does.
> > > >
> > > > Secondly, I think it depends on the semantics of MADV_FREE,
> > > > particularly how to treat the redirtied pages. TBH I'm always confused
> > > > by the semantics. For example, the page contained "abcd", then it was
> > > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> > > > the user should expect to see "abcd1234" or "00001234".
> > >
> > > Correct. You cannot assume the content of the first page as it could
> > > have been reclaimed at any time.
> > >
> > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> > > > valid and available, if I'm wrong please feel free to correct me. If
> > > > so we should always copy MADV_FREE pages in khugepaged regardless of
> > > > whether it is redirtied or not otherwise it may incur data corruption.
> > > > If we don't copy, then the follow up redirty after collapse to the
> > > > hugepage may return "00001234", right?
> > >
> > > Right. As pointed above this is a valid outcome if the page has been
> > > dropped. User has means to tell that from /proc/vmstat though. Not in a
> > > great precision but I think it would be really surprising to not see any
> > > pglazyfreed yet the content is gone. I think it would be legit to call
> > > it a bug. One could argue the bug would be in the accounting rather than
> > > the khugepaged implementation because madvised pages could be dropped at
> > > any time. But I think it makes more sense to copy the existing content.
>
> +1. I agree that the content should be dropped iff pglazyfreed is
> incremented. Of course, we could do that here, but I don't think there
> is a good reason to, and we should just copy the contents.
>
> > Yeah, as long as khugepaged sees the MADV_FREE pages, it means they
> > have "NOT" been dropped yet. It may be dropped later if memory
> > pressure occurs, but anyway khugepaged wins the race and khugepaged
> > can't assume the pages will be dropped before they get redirtied. So
> > copying the content does make sense.
>
> Per Lance, I kinda get that this "undermines" MADV_FREE, insofar that,
> from the user's perspective, that memory which was intended as a
> buffer against OOM kill scenarios, is no longer there to reclaim trivially. I
> don't have a real world example where this is an issue, however. Also,
> not copying the contents doesn't change that fact.
>
> The proper alternative, if you want to make the "undermining"
> argument, is for khugepaged to stay away from hugepage regions with
> any MADV_FREE pages. I think it's fair to assume MADV_FREE'd memory is
> likely cold memory, and therefore not a good hugepage target anyways.
> However, it'd be unfortunate if there were a couple MADV_FREE pages in
> the middle of an otherwise hot / highly-utilized hugepage region that
> would prevent it from being pmd-mapped via khugepaged. But.. this is
> exactly-ish what you get when hugepage-ware system/runtime allocators
> split THPs to free up internal caches.
>
> Best,
> Zach
>
>
> > > --
> > > Michal Hocko
> > > SUSE Labs
Powered by blists - more mailing lists