lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAa6QmRjqob=HQ1K4c+vP5iydM_VA-wd5NcoDLVuX=13NwedSQ@mail.gmail.com>
Date: Mon, 5 Feb 2024 12:26:31 -0800
From: "Zach O'Keefe" <zokeefe@...gle.com>
To: Yang Shi <shy828301@...il.com>
Cc: Michal Hocko <mhocko@...e.com>, Lance Yang <ioworker0@...il.com>, akpm@...ux-foundation.org, 
	david@...hat.com, songmuchun@...edance.com, peterx@...hat.com, 
	minchan@...nel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] mm/khugepaged: skip copying lazyfree pages on collapse

On Mon, Feb 5, 2024 at 11:43 AM Yang Shi <shy828301@...il.com> wrote:
>
> On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Fri 02-02-24 09:42:27, Yang Shi wrote:
> > > But if the partial range is MADV_FREE, khugepaged won't skip them.
> > > This is what your second test case does.
> > >
> > > Secondly, I think it depends on the semantics of MADV_FREE,
> > > particularly how to treat the redirtied pages. TBH I'm always confused
> > > by the semantics. For example, the page contained "abcd", then it was
> > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So
> > > the user should expect to see "abcd1234" or "00001234".
> >
> > Correct. You cannot assume the content of the first page as it could
> > have been reclaimed at any time.
> >
> > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still
> > > valid and available, if I'm wrong please feel free to correct me. If
> > > so we should always copy MADV_FREE pages in khugepaged regardless of
> > > whether it is redirtied or not otherwise it may incur data corruption.
> > > If we don't copy, then the follow up redirty after collapse to the
> > > hugepage may return "00001234", right?
> >
> > Right. As pointed above this is a valid outcome if the page has been
> > dropped. User has means to tell that from /proc/vmstat though. Not in a
> > great precision but I think it would be really surprising to not see any
> > pglazyfreed yet the content is gone. I think it would be legit to call
> > it a bug. One could argue the bug would be in the accounting rather than
> > the khugepaged implementation because madvised pages could be dropped at
> > any time. But I think it makes more sense to copy the existing content.

+1. I agree that the content should be dropped iff pglazyfreed is
incremented. Of course, we could do that here, but I don't think there
is a good reason to, and we should just copy the contents.

> Yeah, as long as khugepaged sees the MADV_FREE pages, it means they
> have "NOT" been dropped yet. It may be dropped later if memory
> pressure occurs, but anyway khugepaged wins the race and khugepaged
> can't assume the pages will be dropped before they get redirtied. So
> copying the content does make sense.

Per Lance, I kinda get that this "undermines" MADV_FREE, insofar that,
from the user's perspective, that memory which was intended as a
buffer against OOM kill scenarios, is no longer there to reclaim trivially. I
don't have a real world example where this is an issue, however. Also,
not copying the contents doesn't change that fact.

The proper alternative, if you want to make the "undermining"
argument, is for khugepaged to stay away from hugepage regions with
any MADV_FREE pages. I think it's fair to assume MADV_FREE'd memory is
likely cold memory, and therefore not a good hugepage target anyways.
However, it'd be unfortunate if there were a couple MADV_FREE pages in
the middle of an otherwise hot / highly-utilized hugepage region that
would prevent it from being pmd-mapped via khugepaged. But.. this is
exactly-ish what you get when hugepage-ware system/runtime allocators
split THPs to free up internal caches.

Best,
Zach


> > --
> > Michal Hocko
> > SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ