[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YjSnTu9QZiiZQE7A@google.com>
Date: Fri, 18 Mar 2022 08:37:50 -0700
From: Minchan Kim <minchan@...nel.org>
To: Charan Teja Kalla <quic_charante@...cinc.com>
Cc: Nadav Amit <nadav.amit@...il.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>,
David Rientjes <rientjes@...gle.com>,
Stephen Rothwell <sfr@...b.auug.org.au>,
Edgar Arriaga García
<edgararriaga@...gle.com>, Michal Hocko <mhocko@...e.com>,
linux-mm <linux-mm@...ck.org>,
LKML <linux-kernel@...r.kernel.org>,
"# 5 . 10+" <stable@...r.kernel.org>
Subject: Re: [PATCH V2,2/2] mm: madvise: skip unmapped vma holes passed to
process_madvise
On Fri, Mar 18, 2022 at 07:35:41PM +0530, Charan Teja Kalla wrote:
> Thank you for valuable inputs.
>
> On 3/18/2022 2:08 AM, Nadav Amit wrote:
> >>>>>> IMO, it's worth to note in man page.
> >>>>>>
> >>>>> Or the current patch for just ENOMEM is sufficient here and we just have
> >>>>> to update the man page?
> >>>> I think the "On success, process_madvise() returns the number of bytes
> >>>> advised" behaviour sounds useful. But madvise() doesn't do that.
> >>>>
> >>>> RETURN VALUE
> >>>> On success, madvise() returns zero. On error, it returns -1 and errno
> >>>> is set to indicate the error.
> >>>>
> >>>> So why is it desirable in the case of process_madvise()?
> >>> Since process_madvise deal with multiple ranges and could fail at one of
> >>> them in the middle or pocessing, people could decide where the call
> >>> failed and then make a strategy whether they will abort at the point or
> >>> continue to hint next addresses. Here, problem of the strategy is API
> >>> doesn't return any error vaule if it has processed any bytes so they
> >>> would have limitation to decide a policy. That's the limitation for
> >>> every vector IO syscalls, unfortunately.
> >>>
> >>>>
> >>>>
> >>>> And why was process_madvise() designed this way? Or was it
> >>>> always simply an error in the manpage?
> >> Taking a closer look, indeed manpage seems to be wrong.
> >> https://elixir.bootlin.com/linux/v5.17-rc8/source/mm/madvise.c#L1154
> >> indicates that in the presence of unmapped holes madvise will skip
> >> them but will return ENOMEM and that's what process_madvise is
> >> ultimately returning in this case. So, the manpage claim of "This
> >> return value may be less than the total number of requested bytes, if
> >> an error occurred after some iovec elements were already processed."
> >> does not reflect the reality in our case because the return value will
> >> be -ENOMEM. After the desired behavior is finalized I'll modify the
> >> manpage accordingly.
> > Since process_madvise() might be used in sort of non-cooperative mode,
> > I think that the caller cannot guarantee that it knows exactly the
> > memory layout of the process whose memory it madvise’s. I know that
> > MADV_DONTNEED for instance is not supported (at least today) by
> > process_madvise(), but if it were, the caller may want which exact
> > memory was madvise'd even if the target process ran some other
> > memory layout changing syscalls (e.g., munmap()).
> >
> > IOW, skipping holes and just returning the total number of madvise’d
> > bytes might not be enough.
>
> Then does the advised bytes range by default including holes is a
> correct design?
> Say the [start, len) range passed in the iovec by the user contains the
> layout like, vma1 -- hole-- vma2 -- hole -- vma3.
>
> Under ideal case, where all vma's are eligible for advise, the total
> bytes processed returning should be vma3->end - vma1->start. This is
> success case.
>
> Now, say that vma1 is succeeded but vma2(say VM_LOCKED) is failed at
> advise. In such case processed bytes will be
> vma2->start-vma1->start(still consider hole as bytes processed), so that
> user may restart/skip at vma2, then continue. This return type will be
> partially processed bytes.
>
> If the system doesn't found any VMA in the passed range by user, it
> returns ENOMEM as not a single advisable vma is found in the range.
As I mentioned in other reply, let's do not make any exception(i.e.,
skipping hole) for vectored memory syscall but exact processed bytes
on the exact ranges.
Powered by blists - more mailing lists