linux-kernel - Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X9P6XuRG1l1Q6zdR@google.com>
Date:   Fri, 11 Dec 2020 15:01:50 -0800
From:   Minchan Kim <minchan@...nel.org>
To:     Jann Horn <jannh@...gle.com>
Cc:     Christoph Hellwig <hch@....de>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...nel.org>,
        Michal Hocko <mhocko@...e.com>,
        David Rientjes <rientjes@...gle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Roman Gushchin <guro@...com>, Rik van Riel <riel@...riel.com>,
        Christian Brauner <christian@...uner.io>,
        Oleg Nesterov <oleg@...hat.com>,
        Tim Murray <timmurray@...gle.com>,
        Linux API <linux-api@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        kernel-team <kernel-team@...roid.com>
Subject: Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on
 entire memory range

On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> +CC Christoph Hellwig for opinions on compat
> 
> On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim <minchan@...nel.org> wrote:
> > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > process_madvise requires a vector of address ranges to be provided for
> > > its operations. When an advice should be applied to the entire process,
> > > the caller process has to obtain the list of VMAs of the target process
> > > by reading the /proc/pid/maps or some other way. The cost of this
> > > operation grows linearly with increasing number of VMAs in the target
> > > process. Even constructing the input vector can be non-trivial when
> > > target process has several thousands of VMAs and the syscall is being
> > > issued during high memory pressure period when new allocations for such
> > > a vector would only worsen the situation.
> > > In the case when advice is being applied to the entire memory space of
> > > the target process, this creates an extra overhead.
> > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > advise a memory range of the target process. For now, to keep it simple,
> > > only the entire process memory range is supported, vec and vlen inputs
> > > in this mode are ignored and can be NULL and 0.
> > > Instead of returning the number of bytes that advice was successfully
> > > applied to, the syscall in this mode returns 0 on success. This is due
> > > to the fact that the number of bytes would not be useful for the caller
> > > that does not know the amount of memory the call is supposed to affect.
> > > Besides, the ssize_t return type can be too small to hold the number of
> > > bytes affected when the operation is applied to a large memory range.
> >
> > Can we just use one element in iovec to indicate entire address rather
> > than using up the reserved flags?
> >
> >         struct iovec {
> >                 .iov_base = NULL,
> >                 .iov_len = (~(size_t)0),
> >         };
> 
> In addition to Suren's objections, I think it's also worth considering
> how this looks in terms of compat API. If a compat process does
> process_madvise() on another compat process, it would be specifying
> the maximum 32-bit number, rather than the maximum 64-bit number, so
> you'd need special code to catch that case, which would be ugly.
> 
> And when a compat process uses this API on a non-compat process, it
> semantically gets really weird: The actual address range covered would
> be larger than the address range specified.
> 
> And if we want different access checks for the two flavors in the
> future, gating that different behavior on special values in the iovec
> would feel too magical to me.
> 
> And the length value SIZE_MAX doesn't really make sense anyway because
> the length of the whole address space would be SIZE_MAX+1, which you
> can't express.
> 
> So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> a magic number here.

Can't we simply pass NULL as iovec as special id, then?