linux-kernel - Re: [PATCH v18 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABb0KFFaXgJD99pWfx3MC+qrq5jUaPis_kZo6U8yL_8xdp0GJA@mail.gmail.com>
Date:   Wed, 14 Jun 2023 20:00:18 +0200
From:   Michał Mirosław <emmir@...gle.com>
To:     Muhammad Usama Anjum <usama.anjum@...labora.com>
Cc:     Peter Xu <peterx@...hat.com>, David Hildenbrand <david@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrei Vagin <avagin@...il.com>,
        Danylo Mocherniuk <mdanylo@...gle.com>,
        Paul Gofman <pgofman@...eweavers.com>,
        Cyrill Gorcunov <gorcunov@...il.com>,
        Mike Rapoport <rppt@...nel.org>, Nadav Amit <namit@...are.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Shuah Khan <shuah@...nel.org>,
        Christian Brauner <brauner@...nel.org>,
        Yang Shi <shy828301@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Yun Zhou <yun.zhou@...driver.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Alex Sierra <alex.sierra@....com>,
        Matthew Wilcox <willy@...radead.org>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        "Gustavo A . R . Silva" <gustavoars@...nel.org>,
        Dan Williams <dan.j.williams@...el.com>,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-kselftest@...r.kernel.org,
        Greg KH <gregkh@...uxfoundation.org>, kernel@...labora.com
Subject: Re: [PATCH v18 2/5] fs/proc/task_mmu: Implement IOCTL to get and
 optionally clear info about PTEs

(A quick reply to answer open questions in case they help the next version.)

On Wed, 14 Jun 2023 at 19:10, Muhammad Usama Anjum
<usama.anjum@...labora.com> wrote:
> On 6/14/23 8:14 PM, Michał Mirosław wrote:
> > On Wed, 14 Jun 2023 at 15:46, Muhammad Usama Anjum
> > <usama.anjum@...labora.com> wrote:
> >>
> >> On 6/14/23 3:36 AM, Michał Mirosław wrote:
> >>> On Tue, 13 Jun 2023 at 12:29, Muhammad Usama Anjum
> >>> <usama.anjum@...labora.com> wrote:
[...]
> >>>> +       if (cur_buf->bitmap == bitmap &&
> >>>> +           cur_buf->start + cur_buf->len * PAGE_SIZE == addr) {
> >>>> +               cur_buf->len += n_pages;
> >>>> +               p->found_pages += n_pages;
> >>>> +       } else {
> >>>> +               if (cur_buf->len && p->vec_buf_index >= p->vec_buf_len)
> >>>> +                       return -ENOMEM;
> >>>
> >>> Shouldn't this be -ENOSPC? -ENOMEM usually signifies that the kernel
> >>> ran out of memory when allocating, not that there is no space in a
> >>> user-provided buffer.
> >> There are 3 kinds of return values here:
> >> * PM_SCAN_FOUND_MAX_PAGES (1) ---> max_pages have been found. Abort the
> >> page walk from next entry
> >> * 0 ---> continue the page walk
> >> * -ENOMEM --> Abort the page walk from current entry, user buffer is full
> >> which is not error, but only a stop signal. This -ENOMEM is just
> >> differentiater from (1). This -ENOMEM is for internal use and isn't
> >> returned to user.
> >
> > But why ENOSPC is not good here? I was used before, I think.
> -ENOSPC is being returned in form of true error from
> pagemap_scan_hugetlb_entry(). So I'd to remove -ENOSPC from here as it
> wasn't true error here, it was only a way to abort the walk immediately.
> I'm liking the following erturn code from here now:
>
> #define PM_SCAN_BUFFER_FULL     (-256)

I guess this will be reworked anyway, but I'd prefer this didn't need
custom errors etc. If we agree to decoupling the selection and GET
output, it could be:

bool is_interesting_page(p, flags); // this one does the
required/anyof/excluded match
size_t output_range(p, start, len, flags); // this one fills the
output vector and returns how many pages were fit

In this setup, `is_interesting_page() && (n_out = output_range()) <
n_pages` means this is the final range, no more will fit. And if
`n_out == 0` then no pages fit and no WP is needed (no other special
cases).

> >>> For flags name: PM_REQUIRE_WRITE_ACCESS?
> >>> Or Is it intended to be checked only if doing WP (as the current name
> >>> suggests) and so it would be redundant as WP currently requires
> >>> `p->required_mask = PAGE_IS_WRITTEN`?
> >> This is intended to indicate that if userfaultfd is needed. If
> >> PAGE_IS_WRITTEN is mentioned in any of mask, we need to check if
> >> userfaultfd has been initialized for this memory. I'll rename to
> >> PM_SCAN_REQUIRE_UFFD.
> >
> > Why do we need that check? Wouldn't `is_written = false` work for vmas
> > not registered via uffd?
> UFFD_FEATURE_WP_ASYNC and UNPOPULATED needs to be set on the memory region
> for it to report correct written values on the memory region. Without UFFD
> WP ASYNC and UNPOUPULATED defined on the memory, we consider UFFD_WP state
> undefined. If user hasn't initialized memory with UFFD, he has no right to
> set is_written = false.

How about calculating `is_written = is_uffd_registered() &&
is_uffd_wp()`? This would enable a user to apply GET+WP for the whole
address space of a process regardless of whether all of it is
registered.

> > While here, I wonder if we really need to fail the call if there are
> > unknown bits in those masks set: if this bit set is expanded with
> > another category flags, a newer userspace run on older kernel would
> > get EINVAL even if the "treat unknown as 0" be what it requires.
> > There is no simple way in the API to discover what bits the kernel
> > supports. We could allow a no-op (no WP nor GET) call to help with
> > that and then rejecting unknown bits would make sense.
> I've not seen any examples of this. But I've seen examples of returning
> error if kernel doesn't support a feature. Each new feature comes with a
> kernel version, greater than this version support this feature. If user is
> trying to use advanced feature which isn't present in a kernel, we should
> return error and not proceed to confuse the user/kernel. In fact if we look
> at userfaultfd_api(), we return error immediately if feature has some bit
> set which kernel doesn't support.

I think we should have a way of detecting the supported flags if we
don't want a forward compatibility policy for flags here. Maybe it
would be enough to allow all the no-op combinations for this purpose?

> >>> [...]
> >>>> --- a/include/uapi/linux/fs.h
> >>>> +++ b/include/uapi/linux/fs.h
> >>>> +/*
> >>>> + * struct page_region - Page region with bitmap flags
> >>>> + * @start:     Start of the region
> >>>> + * @len:       Length of the region in pages
> >>>> + * bitmap:     Bits sets for the region
> >>>
> >>> '@' is missing for the third field. BTW, maybe we can call it
> >>> something like `flags` or `category` (something that hints at the
> >>> meaning of the value instead of its data representation).
> >> The deification of this struct says, "with bitmap flags". Bitmap was a
> >> different name. I'll update it to flags.
> >
> > From the implementation and our discussions I guess the
> > `bitmap`/`flags` field is holding a set of matching categories: a bit
> > value 1 = pages are in this category, value 0 = pages are not in this
> > category.
> >
> >>>> +/*
> >>>> + * struct pm_scan_arg - Pagemap ioctl argument
> >>>> + * @size:              Size of the structure
> >>>> + * @flags:             Flags for the IOCTL
> >>>> + * @start:             Starting address of the region
> >>>> + * @len:               Length of the region (All the pages in this length are included)
> >>>
> >>> Maybe `scan_start`, `scan_len` - so that there is a better distinction
> >>> from the structure's `size` field?
> >> As start and len already communicate the meaning. We are making things more
> >> verbose.
> >
> > We are describing (in the name) only that it is a range, but not of
> > what or what purpose. That information is only in the docstring, but
> > it is harder to get by someone just reading the code.
> Agreed. But I'm using same names, start and len which mincore (a historic
> syscall) is using. I've followed mincore here.

mincore() doesn't take parameters as a struct, but as three positional
arguments (whose names don't matter nor appear at call point) - I
wouldn't take it as a precedent for structure field naming.

> >>>> + * @vec:               Address of page_region struct array for output
> >>>> + * @vec_len:           Length of the page_region struct array
> >>>> + * @max_pages:         Optional max return pages
> >>>> + * @required_mask:     Required mask - All of these bits have to be set in the PTE
> >>>> + * @anyof_mask:                Any mask - Any of these bits are set in the PTE
> >>>> + * @excluded_mask:     Exclude mask - None of these bits are set in the PTE
> >>>> + * @return_mask:       Bits that are to be reported in page_region
> >>>> + */
> >>>
> >>> I skipped most of the page walk implementation as maybe the comments
> >>> above could make it simpler. Reading this patch and the documentation
> >>> I still feel confused about how the filtering/limiting parameters
> >> I'm really sad to hear this. I've been working on making this series from
> >> so many revisions. I was hopping that it would make complete sense to
> >> reviewers and later to users.
> >>
> >> What do you think is missing which is restricting these patches getting
> >> accepted to upstream?
> >>
> >>> should affect GET, WP and WP+GET. Should they limit the pages walked
> >>> (and WP-ed)? Or only the GET's output? How about GET+WP case?
> >> The address range needs to be walked until max pages pages are found, user
> >> buffer is full or whole range is walked. If the page will be added to user
> >> buffer or not depends on the selection criteria (*masks). There is no
> >> difference in case of walk for GET, WP and GET+WP. Only that WP doesn't
> >> take any user buffer and just WPs the whole region.
> >
> > Ok, then this intent (if I understand correctly) does not entirely
> > match the implementation. Let's split up the conditions:
> >
> > 1. The address range needs to be walked until max pages pages are found
> >
> > current implementation: the address range is walked until max pages
> > matching masks (incl. return_mask) are reported by GET (or until end
> > of range if GET is not requested).
> > Maybe we need to describe what "found" means here?
> Found means all the pages which are found to be fulfilling the masks and we
> have added it to the user buffer. I can add the comment on top of
> pagemap_scan_private struct? But I don't think that it is difficult to
> understand the meaning of found_pages and also we compare it with max_pages
> which makes things very easy to understand.

After fixing `return_mask` and the selection/action split I think
"pages found" might work - as now the count will be exactly what pages
match the required/anyof/excluded criteria.

> > 2. user buffer is full
> > Matches implementation except in GET+WP edge cases.
> I'm not sure which edge case you are referring to? Probably for hugetlb
> error return case?

Yes, that one.

Best Regards
Michał Mirosław