[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <X9P6XuRG1l1Q6zdR@google.com>
Date: Fri, 11 Dec 2020 15:01:50 -0800
From: Minchan Kim <minchan@...nel.org>
To: Jann Horn <jannh@...gle.com>
Cc: Christoph Hellwig <hch@....de>,
Suren Baghdasaryan <surenb@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Michal Hocko <mhocko@...nel.org>,
Michal Hocko <mhocko@...e.com>,
David Rientjes <rientjes@...gle.com>,
Matthew Wilcox <willy@...radead.org>,
Johannes Weiner <hannes@...xchg.org>,
Roman Gushchin <guro@...com>, Rik van Riel <riel@...riel.com>,
Christian Brauner <christian@...uner.io>,
Oleg Nesterov <oleg@...hat.com>,
Tim Murray <timmurray@...gle.com>,
Linux API <linux-api@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>,
kernel list <linux-kernel@...r.kernel.org>,
kernel-team <kernel-team@...roid.com>
Subject: Re: [PATCH 1/2] mm/madvise: allow process_madvise operations on
entire memory range
On Fri, Dec 11, 2020 at 09:27:46PM +0100, Jann Horn wrote:
> +CC Christoph Hellwig for opinions on compat
>
> On Thu, Nov 26, 2020 at 12:22 AM Minchan Kim <minchan@...nel.org> wrote:
> > On Mon, Nov 23, 2020 at 09:39:42PM -0800, Suren Baghdasaryan wrote:
> > > process_madvise requires a vector of address ranges to be provided for
> > > its operations. When an advice should be applied to the entire process,
> > > the caller process has to obtain the list of VMAs of the target process
> > > by reading the /proc/pid/maps or some other way. The cost of this
> > > operation grows linearly with increasing number of VMAs in the target
> > > process. Even constructing the input vector can be non-trivial when
> > > target process has several thousands of VMAs and the syscall is being
> > > issued during high memory pressure period when new allocations for such
> > > a vector would only worsen the situation.
> > > In the case when advice is being applied to the entire memory space of
> > > the target process, this creates an extra overhead.
> > > Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to
> > > advise a memory range of the target process. For now, to keep it simple,
> > > only the entire process memory range is supported, vec and vlen inputs
> > > in this mode are ignored and can be NULL and 0.
> > > Instead of returning the number of bytes that advice was successfully
> > > applied to, the syscall in this mode returns 0 on success. This is due
> > > to the fact that the number of bytes would not be useful for the caller
> > > that does not know the amount of memory the call is supposed to affect.
> > > Besides, the ssize_t return type can be too small to hold the number of
> > > bytes affected when the operation is applied to a large memory range.
> >
> > Can we just use one element in iovec to indicate entire address rather
> > than using up the reserved flags?
> >
> > struct iovec {
> > .iov_base = NULL,
> > .iov_len = (~(size_t)0),
> > };
>
> In addition to Suren's objections, I think it's also worth considering
> how this looks in terms of compat API. If a compat process does
> process_madvise() on another compat process, it would be specifying
> the maximum 32-bit number, rather than the maximum 64-bit number, so
> you'd need special code to catch that case, which would be ugly.
>
> And when a compat process uses this API on a non-compat process, it
> semantically gets really weird: The actual address range covered would
> be larger than the address range specified.
>
> And if we want different access checks for the two flavors in the
> future, gating that different behavior on special values in the iovec
> would feel too magical to me.
>
> And the length value SIZE_MAX doesn't really make sense anyway because
> the length of the whole address space would be SIZE_MAX+1, which you
> can't express.
>
> So I'm in favor of a new flag, and strongly against using SIZE_MAX as
> a magic number here.
Can't we simply pass NULL as iovec as special id, then?
Powered by blists - more mailing lists