linux-kernel - Re: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABb0KFE6Y=a5DQKjy3vKeP9YURwri3JHNKTCnN7PzOPOxr9SKQ@mail.gmail.com>
Date:   Tue, 11 Apr 2023 11:29:44 +0200
From:   Michał Mirosław <emmir@...gle.com>
To:     Muhammad Usama Anjum <usama.anjum@...labora.com>
Cc:     Mike Rapoport <rppt@...nel.org>, Peter Xu <peterx@...hat.com>,
        David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrei Vagin <avagin@...il.com>,
        Danylo Mocherniuk <mdanylo@...gle.com>,
        Paul Gofman <pgofman@...eweavers.com>,
        Cyrill Gorcunov <gorcunov@...il.com>,
        Nadav Amit <namit@...are.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Shuah Khan <shuah@...nel.org>,
        Christian Brauner <brauner@...nel.org>,
        Yang Shi <shy828301@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Yun Zhou <yun.zhou@...driver.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Alex Sierra <alex.sierra@....com>,
        Matthew Wilcox <willy@...radead.org>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        "Gustavo A . R . Silva" <gustavoars@...nel.org>,
        Dan Williams <dan.j.williams@...el.com>,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
        Greg KH <gregkh@...uxfoundation.org>, kernel@...labora.com
Subject: Re: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and
 optionally clear info about PTEs

On Fri, 7 Apr 2023 at 13:11, Muhammad Usama Anjum
<usama.anjum@...labora.com> wrote:
> On 4/7/23 3:14 PM, Michał Mirosław wrote:
> > On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
> > <usama.anjum@...labora.com> wrote:
> >> On 4/7/23 12:34 PM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
> >>> <usama.anjum@...labora.com> wrote:
> >>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
> >>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> >>>>> <usama.anjum@...labora.com> wrote:
> > [...]
> >>>>>>>> +       /*
> >>>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
> >>>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >>>>>>>> +        * we want to return output to user in compact form where no two
> >>>>>>>> +        * consecutive regions should be continuous and have the same flags.
> >>>>>>>> +        * So store the latest element in p.cur between different walks and
> >>>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
> >>>>>>>> +        */
> >>>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >>>>>>>> +                             GFP_KERNEL);
> >>>>>>>> +       if (!p.vec)
> >>>>>>>> +               return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +       walk_start = walk_end = start;
> >>>>>>>> +       while (walk_end < end && !ret) {
> >>>>>>>
> >>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
> >>>>>>> error will be lost) - is it intended?
> >>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
> >>>>>> there was more memory to walk over. We don't treat this error. So when
> >>>>>> buffer gets full, we stop walking over further as user buffer has gotten
> >>>>>> full and return as success.
> >>>>>
> >>>>> Thanks. What's the difference between -ENOSPC and
> >>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> >>>>> flow).
> >>>> -ENOSPC --> user buffer has been filled completely
> >>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
> >>>>                             still have more space
> >>>
> >>> What is the difference in code behaviour when those two cases are
> >>> compared? (I'd expect none.)
> >> There is difference:
> >> We add data to user buffer. If it succeeds with return code 0, we engage
> >> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
> >> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
> >> to the user buffer.
> >
> > Thanks! I see it now. I see a few more corner cases here:
> > 1. If we did engage WP but fail to copy the vector we return -EFAULT
> > but the WP is already engaged. I'm not sure this is something worth
> > guarding against, but documenting that would be helpful I think.
> Sure.
>
> > 2. If uffd_wp_range() fails, but we have already processed pages
> > earlier, we should treat the error like ENOSPC and back out the failed
> > range (the earier changes would be lost otherwise).
> Backing out is easier to do for hugepages. But for normal pages, we'll have
> to write some code to find where the current data was added (in cur or in
> vec) and back out from that. I'll have to write some more code to avoid the
> side-effects as well.

If I read the code correctly, the last page should always be in `cur`
and on failure only a single page is needed to be backed out. Did I
miss something?

> But aren't we going over-engineering here? Error occurred and we are trying
> to keep the previously generated correct data and returning successfully
> still to the user? I don't think we should do this. An error is error. We
> should return the error simply even if the memory flags would get lost. We
> don't know what caused the error in uffd_wp_range(). Under normal
> situation, we there shouldn't have had error.

In this case it means that on (intermittent) allocation error we get
inconsistent or non-deterministic results. I wouldn't want to be the
one debugging this later - I'd prefer either the syscall be
"exception-safe" (give consistent and predictable output) or kill the
process.

Best Regards
Michał Mirosław