[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABb0KFE6Y=a5DQKjy3vKeP9YURwri3JHNKTCnN7PzOPOxr9SKQ@mail.gmail.com>
Date: Tue, 11 Apr 2023 11:29:44 +0200
From: Michał Mirosław <emmir@...gle.com>
To: Muhammad Usama Anjum <usama.anjum@...labora.com>
Cc: Mike Rapoport <rppt@...nel.org>, Peter Xu <peterx@...hat.com>,
David Hildenbrand <david@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrei Vagin <avagin@...il.com>,
Danylo Mocherniuk <mdanylo@...gle.com>,
Paul Gofman <pgofman@...eweavers.com>,
Cyrill Gorcunov <gorcunov@...il.com>,
Nadav Amit <namit@...are.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Shuah Khan <shuah@...nel.org>,
Christian Brauner <brauner@...nel.org>,
Yang Shi <shy828301@...il.com>,
Vlastimil Babka <vbabka@...e.cz>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Yun Zhou <yun.zhou@...driver.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Alex Sierra <alex.sierra@....com>,
Matthew Wilcox <willy@...radead.org>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
"Gustavo A . R . Silva" <gustavoars@...nel.org>,
Dan Williams <dan.j.williams@...el.com>,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
Greg KH <gregkh@...uxfoundation.org>, kernel@...labora.com
Subject: Re: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and
optionally clear info about PTEs
On Fri, 7 Apr 2023 at 13:11, Muhammad Usama Anjum
<usama.anjum@...labora.com> wrote:
> On 4/7/23 3:14 PM, Michał Mirosław wrote:
> > On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
> > <usama.anjum@...labora.com> wrote:
> >> On 4/7/23 12:34 PM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
> >>> <usama.anjum@...labora.com> wrote:
> >>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
> >>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> >>>>> <usama.anjum@...labora.com> wrote:
> > [...]
> >>>>>>>> + /*
> >>>>>>>> + * Allocate smaller buffer to get output from inside the page walk
> >>>>>>>> + * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >>>>>>>> + * we want to return output to user in compact form where no two
> >>>>>>>> + * consecutive regions should be continuous and have the same flags.
> >>>>>>>> + * So store the latest element in p.cur between different walks and
> >>>>>>>> + * store the p.cur at the end of the walk to the user buffer.
> >>>>>>>> + */
> >>>>>>>> + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >>>>>>>> + GFP_KERNEL);
> >>>>>>>> + if (!p.vec)
> >>>>>>>> + return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> + walk_start = walk_end = start;
> >>>>>>>> + while (walk_end < end && !ret) {
> >>>>>>>
> >>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
> >>>>>>> error will be lost) - is it intended?
> >>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
> >>>>>> there was more memory to walk over. We don't treat this error. So when
> >>>>>> buffer gets full, we stop walking over further as user buffer has gotten
> >>>>>> full and return as success.
> >>>>>
> >>>>> Thanks. What's the difference between -ENOSPC and
> >>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> >>>>> flow).
> >>>> -ENOSPC --> user buffer has been filled completely
> >>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
> >>>> still have more space
> >>>
> >>> What is the difference in code behaviour when those two cases are
> >>> compared? (I'd expect none.)
> >> There is difference:
> >> We add data to user buffer. If it succeeds with return code 0, we engage
> >> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
> >> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
> >> to the user buffer.
> >
> > Thanks! I see it now. I see a few more corner cases here:
> > 1. If we did engage WP but fail to copy the vector we return -EFAULT
> > but the WP is already engaged. I'm not sure this is something worth
> > guarding against, but documenting that would be helpful I think.
> Sure.
>
> > 2. If uffd_wp_range() fails, but we have already processed pages
> > earlier, we should treat the error like ENOSPC and back out the failed
> > range (the earier changes would be lost otherwise).
> Backing out is easier to do for hugepages. But for normal pages, we'll have
> to write some code to find where the current data was added (in cur or in
> vec) and back out from that. I'll have to write some more code to avoid the
> side-effects as well.
If I read the code correctly, the last page should always be in `cur`
and on failure only a single page is needed to be backed out. Did I
miss something?
> But aren't we going over-engineering here? Error occurred and we are trying
> to keep the previously generated correct data and returning successfully
> still to the user? I don't think we should do this. An error is error. We
> should return the error simply even if the memory flags would get lost. We
> don't know what caused the error in uffd_wp_range(). Under normal
> situation, we there shouldn't have had error.
In this case it means that on (intermittent) allocation error we get
inconsistent or non-deterministic results. I wouldn't want to be the
one debugging this later - I'd prefer either the syscall be
"exception-safe" (give consistent and predictable output) or kill the
process.
Best Regards
Michał Mirosław
Powered by blists - more mailing lists