[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yj0QzhDAA3mz90ly@google.com>
Date: Thu, 24 Mar 2022 17:46:06 -0700
From: Minchan Kim <minchan@...nel.org>
To: Charan Teja Kalla <quic_charante@...cinc.com>
Cc: Michal Hocko <mhocko@...e.com>, akpm@...ux-foundation.org,
surenb@...gle.com, vbabka@...e.cz, rientjes@...gle.com,
nadav.amit@...il.com, edgararriaga@...gle.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Johannes Weiner <hannes@...xchg.org>
Subject: Re: [PATCH 2/2] mm: madvise: return exact bytes advised with
process_madvise under error
On Thu, Mar 24, 2022 at 09:15:57PM +0530, Charan Teja Kalla wrote:
> Thanks Michal for the inputs.
>
> On 3/24/2022 6:44 PM, Michal Hocko wrote:
> > On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote:
> >> From: Charan Teja Reddy <quic_charante@...cinc.com>
> >>
> >> The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with
> >> process_madvise") fixes the issue to return number of bytes that are
> >> successfully advised before hitting error with iovec elements
> >> processing. But, when the user passed unmapped ranges in iovec, the
> >> syscall ignores these holes and continues processing and returns ENOMEM
> >> in the end, which is same as madvise semantic. This is a problem for
> >> vector processing where user may want to know how many bytes were
> >> exactly processed in a iovec element to make better decissions in the
> >> user space. As in ENOMEM case, we processed all bytes in a iovec element
> >> but still returned error which will confuse the user whether it is
> >> failed or succeeded to advise.
> > Do you have any specific example where the initial semantic is really
> > problematic or is this mostly a theoretical problem you have found when
> > reading the code?
> >
> >
> >> As an example, consider below ranges were passed by the user in struct
> >> iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and
> >> iovec3(ranges: vma4). In the current implementation, it fully advise
> >> iovec1 and iovec2 but just returns number of processed bytes as iovec1
> >> range. Then user may repeat the processing of iovec2, which is already
> >> processed, which then returns with ENOMEM. Then user may want to skip
> >> iovec2 and starts processing from iovec3. Here because of wrong return
> >> processed bytes, iovec2 is processed twice.
> > I think you should be much more specific why this is actually a problem.
> > This would surely be less optimal but is this a correctness issue?
> >
>
> Yes, this is a problem found when reading the code, but IMO we can
> easily expect an invalid vma/hole in the passed range because we are
> operating on other process VMA. More than solving the problem of being
> less optimal, this can be looked in the direction of helping the user to
> take better policy decisions with this syscall. And, not better policy
> decisions from user is just being sub optimal(i.e. issuing the syscall
> again on the processed range) with this syscall.
>
> Having said that, at present I don't have any reports/unit test showing
> the existing semantic is really a problematic.
>
> > [...]
> >> + vma = find_vma_prev(mm, start, &prev);
> >> + if (vma && start > vma->vm_start)
> >> + prev = vma;
> >> +
> >> + blk_start_plug(&plug);
> >> + for (;;) {
> >> + /*
> >> + * It it hits a unmapped address range in the [start, end),
> >> + * stop processing and return ENOMEM.
> >> + */
> >> + if (!vma || start < vma->vm_start) {
> >> + error = -ENOMEM;
> >> + goto out;
> >> + }
> >> +
> >> + tmp = vma->vm_end;
> >> + if (end < tmp)
> >> + tmp = end;
> >> +
> >> + error = madvise_vma_behavior(vma, &prev, start, tmp, behavior);
> >> + if (error)
> >> + goto out;
> >> + tmp_bytes_advised += tmp - start;
> >> + start = tmp;
> >> + if (prev && start < prev->vm_end)
> >> + start = prev->vm_end;
> >> + if (start >= end)
> >> + goto out;
> >> + if (prev)
> >> + vma = prev->vm_next;
> >> + else
> >> + vma = find_vma(mm, start);
> >> + }
> >> +out:
> >> + /*
> >> + * partial_bytes_advised may contain non-zero bytes indicating
> >> + * the number of bytes advised before failure. Holds zero incase
> >> + * of success.
> >> + */
> >> + *partial_bytes_advised = error ? tmp_bytes_advised : 0;
> > Although this looks like a fix I am not sure it is future proof.
> > madvise_vma_behavior doesn't report which part of the range has been
> > really processed. I do not think that currently supported madvise modes
> > for process_madvise support an early break out with return to the
> > userspace (madvise_cold_or_pageout_pte_range bails on fatal signals for
EINVAL due to can_madv_lru_vma since it countered VM_PFNMAP which is not
rare in Android. User process could fiter them out via looking
/proc/pid/smaps properly but it's too expensive.
A idea to fiter them out from /proc/<pid>/maps is checking shared
flags such as rw-s or ---s(even though it's not accurate, it would work
effectively).
> > example) but this can change in the future and then you are back to
> > "imprecise" return value problem. Yes, this is a theoretical problem
>
> Agree here with the "imprecise" return value problem with processing a
> VMA range. Yes when it is decided to return proper processed value from
> madvise_vma_behavior(), this code too may need the maintenance.
>
> > but so it sounds the problem you are trying to fix IMHO. I think it
> > would be better to live with imprecise return values reporting rather
> > than aiming for perfection which would be fragile and add a future
> > maintenance burden.
Actually, I don't think the maintainace cost would be that big.
Having said, I agree the patch should justify with number how it would
be painful since it's more of optimization.
Thanks.
Powered by blists - more mailing lists