linux-kernel - Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YWYWyUMcgoAJqi3V@t490s>
Date:   Wed, 13 Oct 2021 07:14:17 +0800
From:   Peter Xu <peterx@...hat.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     Michal Hocko <mhocko@...e.com>,
        David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Minchan Kim <minchan@...nel.org>,
        Colin Cross <ccross@...gle.com>,
        Suren Baghdasarya <surenb@...gle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>
Subject: Re: [RFC PATCH 0/8] mm/madvise: support
 process_madvise(MADV_DONTNEED)

On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
> 
> 
> > On Sep 29, 2021, at 12:52 AM, Michal Hocko <mhocko@...e.com> wrote:
> > 
> > On Mon 27-09-21 12:12:46, Nadav Amit wrote:
> >> 
> >>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@...e.com> wrote:
> >>> 
> >>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> >>> [...]
> >>>> The manager is notified on memory regions that it should monitor
> >>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> >>>> using the remote-userfaultfd that you saw on the second thread. When it wants
> >>>> to reclaim (anonymous) memory, it:
> >>>> 
> >>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> >>>>  UFFD-WP to do so efficiently, a patch which I did not send yet).
> >>>> 2. Calls process_vm_readv() to read that memory of that process.
> >>>> 3. Write it back to “swap”.
> >>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> >>> 
> >>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
> >> 
> >> Providing hints to the kernel takes you so far to a certain extent.
> >> The kernel does not want to (for a good reason) to be completely
> >> configurable when it comes to reclaim and prefetch policies. Doing
> >> so from userspace allows you to be fully configurable.
> > 
> > I am sorry but I do not follow. Your scenario is describing a user
> > space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
> > designed for. What are you missing in the existing functionality?
> 
> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
> many aspects of paging out memory:
> 
> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
>    on non-contiguous memory).
> 3. No guarantee the page is actually reclaimed (e.g., writeback)
>    and the time it takes place.
> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
>    as non-performant as it is cannot be used for swap AFAIK).
> 5. Other operations (e.g., locking, working set tracking) that
>    might not be necessary or interfere.
> 
> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
> of userfaultfd to trap page-faults and react accordingly, so you
> are also prevented from:
> 
> 6. Having your own custom prefetching policy in response to #PF.
> 
> There are additional use-cases I can try to formalize in which
> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
> is pretty clear, I think: one is a hint that only applied to
> page reclamation. The other enables the direct control of
> userspace over (almost) all aspects of paging.
> 
> As I suggested before, if it is preferred, this can be a UFFD
> IOCTL instead of process_madvise() behavior, thereby lowering
> the risk of a misuse.

(Sorry to join so late..)

Yeah I'm wondering whether that could add one extra layer of security.  But as
you mentioned, we've already have process_vm_writev(), then it's indeed not
strong reason to reject process_madvise(DONTNEED) too, it seems.

Not sure whether you're aware of the umap project from LLNL:

  https://github.com/LLNL/umap

>From what I can tell, that's really doing very similar thing as what you
proposed here, but it's just a local version of things.  IOW in umap the
DONTNEED can be done locally with madvise() already in the umap maintained
threads.  That close the need to introduce the new process_madvise() interface
and it's definitely safer as it's per-mm and per-task.

I think you mentioned above that the tracee program will need to cooperate in
this case, I'm wondering whether some solution like umap would be fine too as
that also requires cooperation of the tracee program, it's just that the
cooperation may be slightly more than your solution but frankly I think that's
still trivial and before I understand the details of your solution I can't
really tell..

E.g. for a program to use umap, I think it needs to replace mmap() to umap()
where we want the buffers to be managed by umap library rather than the kernel,
then link against the umap library should work.  If the remote solution you're
proposing requires similar (or even more complicated) cooperation, then it'll
be controversial whether that can be done per-mm just like how umap designed
and used.  So IMHO it'll be great to share more details on those parts if umap
cannot satisfy the current need - IMHO it satisfies all the features you
described on fully customized pageout and page faulting in, it's just done in a
single mm.

Thanks,

-- 
Peter Xu