linux-kernel - Re: [PATCH v2 1/3] mm: enable MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJHvVcg1pFfFSggGjDCNKt6ZzS07HYNHHQMcmguZACECpBGf=Q@mail.gmail.com>
Date:   Fri, 11 Feb 2022 11:08:14 -0800
From:   Axel Rasmussen <axelrasmussen@...gle.com>
To:     Peter Xu <peterx@...hat.com>
Cc:     Mike Kravetz <mike.kravetz@...cle.com>,
        Linux MM <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Naoya Horiguchi <naoya.horiguchi@...ux.dev>,
        David Hildenbrand <david@...hat.com>,
        Mina Almasry <almasrymina@...gle.com>,
        Michal Hocko <mhocko@...e.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Shuah Khan <shuah@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings

On Thu, Feb 10, 2022 at 6:29 PM Peter Xu <peterx@...hat.com> wrote:
>
> On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote:
> > > Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
> > > this is the only api that can force strip the hugetlb mapped pgtable without
> > > losing pagecache data.
> >
> > Correct.  However, I do not know if uffd-minor users would ever want to
> > do this.  Perhaps?

I talked with some colleagues, and I didn't come up with any
production *requirement* for it, but it may be a convenience in some
cases (make certain code cleaner, e.g. not having to unmap-and-remap
to tear down page tables as Peter mentioned). I think Peter's
assessment below is right.

>
> My understanding is before this patch uffd-minor upon hugetlbfs requires the
> huge file to be mapped twice, one to populate the content, then we'll be able
> to trap MINOR faults via the other mapping.  Or we could munmap() the range and
> remap it again on the same file offset to drop the pgtables, I think. But that
> sounds tricky.  MINOR faults only works with pgtables dropped.
>
> With DONTNEED upon hugetlbfs we can rely on one single mapping of the file,
> because we can explicitly drop the pgtables of hugetlbfs files without any
> other tricks.
>
> However I have no real use case of it.  Initially I thought it could be useful
> for QEMU because QEMU migration routine is run with the same mm context with
> the hypervisor, so by default is doesn't have two mappings of the same guest
> memory.  If QEMU wants to leverage minor faults, DONTNEED could help.).
>
> However when I was measuring bitmap transfer (assuming that's what minor fault
> could help with qemu's postcopy) there some months ago I found it's not as slow
> as I thought at all..  Either I could have missed something, or we're facing
> different problems with what it is when uffd minor is firstly proposed by Axel.

Re: the bitmap, that matters most on machines with lots of RAM. For
example, GCE offers some VMs with up to 12 *TB* of RAM
(https://cloud.google.com/compute/docs/memory-optimized-machines), I
think with this size of machine we see a significant benefit, as it
may take some significant time for the bitmap to arrive over the
network.

But I think that's a bit of an edge case, most machines are not that
big. :) I think the benefit is more often seen just in avoiding
copies. E.g. if we find a page is already up-to-date after precopy, we
just install PTEs, no copying or page allocation needed. And even when
we have to go fetch a page over the network, one can imagine an RDMA
setup where we can avoid any copies/allocations at all even in that
case. I suppose this also has a bigger effect on larger machines, e.g.
ones that are backed by 1G pages instead of 4k.

>
> This is probably too out of topic, though..  Let me go back..
>
> Said that, one thing I'm not sure about DONTNEED on hugetlb is whether this
> could further abuse DONTNEED, as the original POSIX definition is as simple as:
>
>   The application expects that it will not access the specified address range
>   in the near future.
>
> Linux did it by tearing down pgtable, which looks okay so far.  It could be a
> bit more weird to apply it to hugetlbfs because from its definition it's a hint
> to page reclaims, however hugetlbfs is not a target of page reclaim, neither is
> it LRU-aware.  It goes further into some MADV_ZAP styled syscall.
>
> I think it could still be fine as posix doesn't define that behavior
> specifically on hugetlb so it can be defined by Linux, but not sure whether
> there can be other implications.
>
> Thanks,
>
> --
> Peter Xu
>