linux-kernel - Re: [PATCH] perf: map pages in advance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <03f654d5-424a-4d23-828e-323aff46fa61@lucifer.local>
Date: Fri, 29 Nov 2024 12:48:41 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Namhyung Kim <namhyung@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
        Adrian Hunter <adrian.hunter@...el.com>,
        Kan Liang <kan.liang@...ux.intel.com>,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH] perf: map pages in advance

Will reply inline also but to be clear - we should differentiate between
ongoing discussion about how best to tackle these things going forward
vs. whether _this patch_ is OK :)

I don't think you're objecting to the patch as such, just disappointed
about VM_PFNMAP and wanting to discuss this more generally?

As I say below, we can't use VM_MIXEDMAP as it's broken for our case (we
have to use page->mapping if there's a struct page), so I think _right now_
this is the only sane solution.

On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
> On 28.11.24 15:23, Lorenzo Stoakes wrote:
> > On Thu, Nov 28, 2024 at 02:37:17PM +0100, David Hildenbrand wrote:
> > > On 28.11.24 14:20, Lorenzo Stoakes wrote:
> > > > On Thu, Nov 28, 2024 at 02:08:27PM +0100, David Hildenbrand wrote:
> > > > > On 28.11.24 12:37, Lorenzo Stoakes wrote:
> > [snip]
> > > > > > It makes sense semantically to establish a PFN map too - we are managing
> > > > > > the pages internally and so it is appropriate to mark this as a special
> > > > > > mapping.
> > > > >
> > > > > It's rather sad seeing more PFNMAP users where PFNMAP is not really required
> > > > > (-> this is struct page backed).
> > > > >
> > > > > Especially having to perform several independent remap_pfn_range() calls
> > > > > rather looks like yet another workaround ...
> > > > >
> > > > > Would we be able to achieve something comparable with vm_insert_pages(), to
> > > > > just map them in advance?
> > > >
> > > > Well, that's the thing, we can't use VM_MIXEDMAP as vm_insert_pages() and
> > > > friends all refer vma->vm_page_prot which is not yet _correctly_ established at
> > > > the point of the f_op->mmap() hook being invoked :)
> > >
> > > So all you want is a vm_insert_pages() variant where we can pass in the
> > > vm_page_prot?
> >
> > Hmm, looking into the code I don't think VM_MIXEDMAP is correct after all.
> >
> > We don't want these pages touched at all, we manage them ourselves, and
> > VM_MIXEDMAP, unless mapping memory mapped I/O pages, will treat them as such.
> >
> > For instance, vm_insert_page() -> insert_page() -> insert_page_into_pte_locked()
> > acts as if this is a folio, manipulating the ref count and invoking
> > folio_add_file_rmap_pte() - which we emphatically do not want.
>
> Right, but that should be independent of what you want to achieve in this
> series, or am I wrong?
>

No, we can't implement things this way if we use VM_MIXEDMAP, as
pfn_mkwrite() will no longer work (we have a page now) and page_mkwrite()
will require the same page->mapping hack as we had before, which breaks the
whole thing.

Also, this is memory that makes absolutely no sense being reference/map
counted/placed in the rmap. It was worked around in the previous code by
pinning the pages on fault, but this is ugly and unnecessary.

The whole issue we saw here arose because we tried to treat this memory as
if it were not kernel memory but instead just standard
refcount/mapcount... yes to below about needing a specific way of doing
this with e.g. memdesc :)) previously the pages were being pinned to avoid
presumably migration, compaction etc. but this isn't something we want to
be worrying about...

This change will break GUP for the range, but I don't see anywhere that
needs GUP for these mappings and the memory is now specifically pinned.

So TL;DR - it's VM_PFNMAP or I have to find a completely different way of
solving this problem afaict.

> vm_insert_page()/vm_insert_pages() is our mechanism to install kernel
> allocations into the page tables. (note "kernel allocations", not "kernel
> memory", which might or might not have "struct pages")
>
> There is the bigger question how we could convert all users to either (a)
> not refcount + mapcount (and we discussed a separate memdesc type for that)
> (b) still refcount (similarly, likely separate memdesc).

I think we definitely need the ability to differentiate between:

1. 'I allocated pages, but I want them to be treated like userland memory'

2. (maybe) 'I allocated pages, and own them, pin them for me (maybe) and do
   refcounts/mapcounts, but I will still manage them

3. Yes there are RAM allocations, but do not touch them at all, I am
   managing them privately.

4. This is a pure PFN mapping of memory-mapped I/O pages.

5. Mix of the above?

I mean roughly speaking :)

We also have nuances, like the issue this patch fixes

>
> But that will be a problem to be solved by all similar drives.
>
> Slapping in a remap_pfn_range() + VM_PFNMAP in now in the absence of having
> solved the bigger problem there sounds quite suboptimal to me.

See above, sadly it is the only way to solve the problem.

Note though that perf has a _very specific_ requirement for 1st page r/w,
rest r/o. I think most drivers will not have this requirement.

> remap_pfn_range() saw sufficient abuse already, and the way we hacked in
> VM_PAT handling in there really makes it something we don't want to reuse as
> is when trying to come up with a clean way to map kernel allocations. I
> strongly assume that some of the remap_pfn_range() users we currently have
> do actually deal with kernel allocations as well, and likely they should all
> get converted to something better once we have it.

Yeah I think all this is a bloody mess and is crying out for a much clearer
way to abstract things.

>
>
> So, isn't this something to just solve independently of what you are
> actually trying to achieve in this series (page->index and page->mapping)?

Again, VM_PFNMAP is necessary here for this to work (sadly). This other
discussion we're having is kind of separate though :)

>
> [...]
>
>
> > > >
> > > > We set the field in __mmap_new_vma(), _but_ importantly, we defer the
> > > > writenotify check to __mmap_complete() (set in vma_set_page_prot()) - so if we
> > > > were to try to map using VM_MIXEDMAP in the f_op->mmap() hook, we'd get
> > > > read/write mappings, which is emphatically not what we want - we want them
> > > > read-only mapped, and for vm_ops->pfn_mkwrite() to be called so we can make the
> > > > first page read/write and the rest read-only.
> > > >
> > > > It's this requirement that means this is really the only way to do this as far
> > > > as I can tell.
> > > >
> > > > It is appropriate and correct that this is either a VM_PFNMAP or VM_MIXEDMAP
> > > > mapping, as the pages reference kernel-allocated memory and are managed by perf,
> > > > not put on any LRU, etc.
> > > >
> > > > It sucks to have to loop like this and it feels like a workaround, which makes
> > > > me wonder if we need a new interface to better allow this stuff on mmap...
> > > >
> > > > In any case I think this is the most sensible solution currently available that
> > > > avoids the pre-existing situation of pretending the pages are folios but
> > > > somewhat abusing the interface to allow page_mkwrite() to work correctly by
> > > > setting page->index, mapping.
> > >
> > > Yes, that page->index stuff is nasty.
> >
> > It's the ->mapping that is more of the issue I think, as that _has_ to be set in
> > the original version, I can't actually see why index _must_ be set, there should
> > be no case in which rmap is used on the page, so possibly was a mistake, but
> > both fields are going from struct page so both must be eliminated :)
>
> :) Yes.
>
> > > > The alternative to this would be to folio-fy, but these are emphatically _not_
> > > > folios, that is userland memory managed as userland memory, it's a mapping onto
> > > > kernel memory exposed to userspace.
> > >
> > > Yes, we should even move away from folios completely in the future for
> > > vm_insert_page().
> >
> > Well, isn't VM_MIXEDMAP intended specifically so you can mix normal user pages
> > that live in the LRU and have an rmap etc. etc. with PFN mappings to I/O mapped
> > memory? :) so then that's folios + raw PFN's.
>
> VM_MIXEDMAP was abused over the years for all kinds of stuff. I consider
> this rather a current "side effect" of using vm_insert_pages() than
> something we'll need in the long term (below).
>
> >
> > >
> > > >
> > > > It feels like probably VM_MIXEDMAP is a better way of doing it, but you'd need
> > > > to expose an interface that doesn't assume the VMA is already fully set
> > > > up... but I think one for a future series perhaps.
> > >
> > > If the solution to your problem is as easy as making vm_insert_pages() pass
> > > something else than vma->vm_page_prot to insert_pages(), then I think we
> > > should go for that. Like ... vm_insert_pages_prot().
> >
> > Sadly no for reasons above.
>
> Is the reason "refcount+mapcount"? Then it might be a problem better tackled
> separately as raised above. Sorry if I missed another point.

See above, vm_normal_page() cannot return a struct page * or otherwise we
end up having to invoke do_page_mkwrite() which interprets a missing
folio->mapping as requiring a retry.

We could hack it to lock the folio in the page_mkwrite() hook but that's
just horrible and worse than using VM_PFNMAP. Plus we're folio-fying by the
back door at that point.

This is not helped by the fact we allocate non-compound higher order pages
(if not using vmalloc) in the perf code :)

We are in an ugly situation, so it's a question of drinking the least
horrible pint of beer here.

>
> >
> > >
> > > Observe how we already have vmf_insert_pfn() vs. vmf_insert_pfn_prot(). But
> > > yes, in an ideal world we'd avoid having temporarily messed up
> > > vma->vm_page_prot. So we'd then document clearly how vm_insert_pages_prot()
> > > may be used.
> >
> > I think the thing with the delay in setting vma->vm_page_prot properly that is
> > we have a chicken and egg scenario (oh so often the case in mmap_region()
> > logic...) in that the mmap hook might change some of these flags which changes
> > what that function will do...
>
> Yes, that's ugly.
>
> >
> > I was discussing with Liam recently how perhaps we should see how feasible it is
> > to do away with this hook and replace it with something where drivers specify
> > which VMA flags they want to set _ahead of time_, since this really is the only
> > thing they should be changing other than vma->vm_private_data.
>
> Yes.
>
> >
> > Then we could possibly have a hook _only_ for assigning vma->vm_private_data to
> > allow for any driver-specific init logic and doing mappings, and hey presto we
> > have made things vastly saner. Could perhaps pass a const struct vm_area_struct
> > * to make this clear...
> >
> > But I may be missing some weird corner cases (hey probably am) or being too
> > optimistic :>)
>
> It's certainly one area we should be cleaning up ...
>
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> > I wonder if we need a new interface then for 'pages which we don't want touched
> > but do have a struct page' that is more expressed by the interface than
> > remap_pfn_range() expresses.
> >
> > I mean from the comment around vm_normal_page():
> >
> >   * "Special" mappings do not wish to be associated with a "struct page" (either
> >   * it doesn't exist, or it exists but they don't want to touch it). In this
> >   * case, NULL is returned here. "Normal" mappings do have a struct page.
> >
> > ...
> >
> >   * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
> >   * special mapping (even if there are underlying and valid "struct pages").
> >   * COWed pages of a VM_PFNMAP are always normal.
> >
> > So there's precedence for us just putting pages we allocate/manage ourselves in
> > a VM_PFNMAP.
> >
> > So I guess this interface would be something like:
> >
> > 	int remap_kernel_pages(struct vm_area_struct *vma, unsigned long addr,
> > 			       struct page **pages, unsigned long size,
> > 			       pgprot_t prot);
> >
>
>
> Well, I think we simply will want vm_insert_pages_prot() that stops treating
> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
>
> We could start that work right now by making some user (iouring,
> ring_buffer) set a new page->_type, and checking that in
> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
> and the mapcount.
>
> Because then, we can just make all the relevant drivers set the type, refuse
> in vm_insert_pages_prot() anything that doesn't have the type set, and
> refuse in vm_normal_page() any pages with this memdesc.
>
> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
> these things will stop working, I hope that is not a problem.
>
>
> There is one question is still had for a long time: maybe we *do* want to
> refcount these kernel allocations. When refcounting them, it's impossible
> that we might free them in our driver without having some reference lurking
> somewhere in some page table of a process. I would hope that this is being
> take care of differently. (e.g., VMA lifetime)
>
>
> But again, I'd hope this is something we can sort out independent of this
> series.

Yes, in a way you wonder if _everything_ should be refcounted and _nothing_
'manually' controlled by drivers which __free_pages() unconditionally at
the end.

This would avoid uaf's and such for still-mapped memory.

Here we're cleaning up on VMA close and doing our own reference counting
there so we're relatively safe...

Anyway I agree with you that we should have something that explicitly
describes what is desired like presumably a memdesc would.

>
> --
> Cheers,
>
> David / dhildenb
>