linux-kernel - Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YVQbMREcRaCbUaUv@dhcp22.suse.cz>
Date:   Wed, 29 Sep 2021 09:52:17 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Nadav Amit <nadav.amit@...il.com>
Cc:     David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Peter Xu <peterx@...hat.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Minchan Kim <minchan@...nel.org>,
        Colin Cross <ccross@...gle.com>,
        Suren Baghdasarya <surenb@...gle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>
Subject: Re: [RFC PATCH 0/8] mm/madvise: support
 process_madvise(MADV_DONTNEED)

On Mon 27-09-21 12:12:46, Nadav Amit wrote:
> 
> > On Sep 27, 2021, at 5:16 AM, Michal Hocko <mhocko@...e.com> wrote:
> > 
> > On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> > [...]
> >> The manager is notified on memory regions that it should monitor
> >> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> >> using the remote-userfaultfd that you saw on the second thread. When it wants
> >> to reclaim (anonymous) memory, it:
> >> 
> >> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> >>   UFFD-WP to do so efficiently, a patch which I did not send yet).
> >> 2. Calls process_vm_readv() to read that memory of that process.
> >> 3. Write it back to “swap”.
> >> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> > 
> > Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
> 
> Providing hints to the kernel takes you so far to a certain extent.
> The kernel does not want to (for a good reason) to be completely
> configurable when it comes to reclaim and prefetch policies. Doing
> so from userspace allows you to be fully configurable.

I am sorry but I do not follow. Your scenario is describing a user
space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
designed for. What are you missing in the existing functionality?

> > MADV_DONTNEED on a remote process has been proposed in the past several
> > times and it has always been rejected because it is a free ticket to all
> > sorts of hard to debug problems as it is just a free ticket for a remote
> > memory corruption. An additional capability requirement might reduce the
> > risk to some degree but I still do not think this is a good idea.
> 
> I would argue that there is nothing bad that remote MADV_DONTNEED can do
> that process_vm_writev() cannot do as well (putting aside ptrace).

I am not arguing this would be the first syscall to allow tricky and
hard to debug corruptions if used without care.

> process_vm_writev() is checking:
> 
> 	mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS)
> 
> Wouldn't adding such a condition suffice?

This would be a minimum requirement. Another one is a sensible usecase
that is not covered by an existing functionality.

-- 
Michal Hocko
SUSE Labs