linux-kernel - Re: [PATCH] perf: map pages in advance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84fed269-3f82-47f7-89cb-671fcee5a23a@redhat.com>
Date: Fri, 29 Nov 2024 13:12:57 +0100
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
 Adrian Hunter <adrian.hunter@...el.com>,
 Kan Liang <kan.liang@...ux.intel.com>, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH] perf: map pages in advance

On 28.11.24 15:23, Lorenzo Stoakes wrote:
> On Thu, Nov 28, 2024 at 02:37:17PM +0100, David Hildenbrand wrote:
>> On 28.11.24 14:20, Lorenzo Stoakes wrote:
>>> On Thu, Nov 28, 2024 at 02:08:27PM +0100, David Hildenbrand wrote:
>>>> On 28.11.24 12:37, Lorenzo Stoakes wrote:
> [snip]
>>>>> It makes sense semantically to establish a PFN map too - we are managing
>>>>> the pages internally and so it is appropriate to mark this as a special
>>>>> mapping.
>>>>
>>>> It's rather sad seeing more PFNMAP users where PFNMAP is not really required
>>>> (-> this is struct page backed).
>>>>
>>>> Especially having to perform several independent remap_pfn_range() calls
>>>> rather looks like yet another workaround ...
>>>>
>>>> Would we be able to achieve something comparable with vm_insert_pages(), to
>>>> just map them in advance?
>>>
>>> Well, that's the thing, we can't use VM_MIXEDMAP as vm_insert_pages() and
>>> friends all refer vma->vm_page_prot which is not yet _correctly_ established at
>>> the point of the f_op->mmap() hook being invoked :)
>>
>> So all you want is a vm_insert_pages() variant where we can pass in the
>> vm_page_prot?
> 
> Hmm, looking into the code I don't think VM_MIXEDMAP is correct after all.
> 
> We don't want these pages touched at all, we manage them ourselves, and
> VM_MIXEDMAP, unless mapping memory mapped I/O pages, will treat them as such.
> 
> For instance, vm_insert_page() -> insert_page() -> insert_page_into_pte_locked()
> acts as if this is a folio, manipulating the ref count and invoking
> folio_add_file_rmap_pte() - which we emphatically do not want.

Right, but that should be independent of what you want to achieve in 
this series, or am I wrong?

vm_insert_page()/vm_insert_pages() is our mechanism to install kernel 
allocations into the page tables. (note "kernel allocations", not 
"kernel memory", which might or might not have "struct pages")

There is the bigger question how we could convert all users to either 
(a) not refcount + mapcount (and we discussed a separate memdesc type 
for that) (b) still refcount (similarly, likely separate memdesc).

But that will be a problem to be solved by all similar drives.

Slapping in a remap_pfn_range() + VM_PFNMAP in now in the absence of 
having solved the bigger problem there sounds quite suboptimal to me. 
remap_pfn_range() saw sufficient abuse already, and the way we hacked in 
VM_PAT handling in there really makes it something we don't want to 
reuse as is when trying to come up with a clean way to map kernel 
allocations. I strongly assume that some of the remap_pfn_range() users 
we currently have do actually deal with kernel allocations as well, and 
likely they should all get converted to something better once we have it.


So, isn't this something to just solve independently of what you are 
actually trying to achieve in this series (page->index and page->mapping)?

[...]


>>>
>>> We set the field in __mmap_new_vma(), _but_ importantly, we defer the
>>> writenotify check to __mmap_complete() (set in vma_set_page_prot()) - so if we
>>> were to try to map using VM_MIXEDMAP in the f_op->mmap() hook, we'd get
>>> read/write mappings, which is emphatically not what we want - we want them
>>> read-only mapped, and for vm_ops->pfn_mkwrite() to be called so we can make the
>>> first page read/write and the rest read-only.
>>>
>>> It's this requirement that means this is really the only way to do this as far
>>> as I can tell.
>>>
>>> It is appropriate and correct that this is either a VM_PFNMAP or VM_MIXEDMAP
>>> mapping, as the pages reference kernel-allocated memory and are managed by perf,
>>> not put on any LRU, etc.
>>>
>>> It sucks to have to loop like this and it feels like a workaround, which makes
>>> me wonder if we need a new interface to better allow this stuff on mmap...
>>>
>>> In any case I think this is the most sensible solution currently available that
>>> avoids the pre-existing situation of pretending the pages are folios but
>>> somewhat abusing the interface to allow page_mkwrite() to work correctly by
>>> setting page->index, mapping.
>>
>> Yes, that page->index stuff is nasty.
> 
> It's the ->mapping that is more of the issue I think, as that _has_ to be set in
> the original version, I can't actually see why index _must_ be set, there should
> be no case in which rmap is used on the page, so possibly was a mistake, but
> both fields are going from struct page so both must be eliminated :)

:) Yes.

>>> The alternative to this would be to folio-fy, but these are emphatically _not_
>>> folios, that is userland memory managed as userland memory, it's a mapping onto
>>> kernel memory exposed to userspace.
>>
>> Yes, we should even move away from folios completely in the future for
>> vm_insert_page().
> 
> Well, isn't VM_MIXEDMAP intended specifically so you can mix normal user pages
> that live in the LRU and have an rmap etc. etc. with PFN mappings to I/O mapped
> memory? :) so then that's folios + raw PFN's.

VM_MIXEDMAP was abused over the years for all kinds of stuff. I consider 
this rather a current "side effect" of using vm_insert_pages() than 
something we'll need in the long term (below).

> 
>>
>>>
>>> It feels like probably VM_MIXEDMAP is a better way of doing it, but you'd need
>>> to expose an interface that doesn't assume the VMA is already fully set
>>> up... but I think one for a future series perhaps.
>>
>> If the solution to your problem is as easy as making vm_insert_pages() pass
>> something else than vma->vm_page_prot to insert_pages(), then I think we
>> should go for that. Like ... vm_insert_pages_prot().
> 
> Sadly no for reasons above.

Is the reason "refcount+mapcount"? Then it might be a problem better 
tackled separately as raised above. Sorry if I missed another point.

> 
>>
>> Observe how we already have vmf_insert_pfn() vs. vmf_insert_pfn_prot(). But
>> yes, in an ideal world we'd avoid having temporarily messed up
>> vma->vm_page_prot. So we'd then document clearly how vm_insert_pages_prot()
>> may be used.
> 
> I think the thing with the delay in setting vma->vm_page_prot properly that is
> we have a chicken and egg scenario (oh so often the case in mmap_region()
> logic...) in that the mmap hook might change some of these flags which changes
> what that function will do...

Yes, that's ugly.

> 
> I was discussing with Liam recently how perhaps we should see how feasible it is
> to do away with this hook and replace it with something where drivers specify
> which VMA flags they want to set _ahead of time_, since this really is the only
> thing they should be changing other than vma->vm_private_data.

Yes.

> 
> Then we could possibly have a hook _only_ for assigning vma->vm_private_data to
> allow for any driver-specific init logic and doing mappings, and hey presto we
> have made things vastly saner. Could perhaps pass a const struct vm_area_struct
> * to make this clear...
> 
> But I may be missing some weird corner cases (hey probably am) or being too
> optimistic :>)

It's certainly one area we should be cleaning up ...

> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> I wonder if we need a new interface then for 'pages which we don't want touched
> but do have a struct page' that is more expressed by the interface than
> remap_pfn_range() expresses.
> 
> I mean from the comment around vm_normal_page():
> 
>   * "Special" mappings do not wish to be associated with a "struct page" (either
>   * it doesn't exist, or it exists but they don't want to touch it). In this
>   * case, NULL is returned here. "Normal" mappings do have a struct page.
> 
> ...
> 
>   * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
>   * special mapping (even if there are underlying and valid "struct pages").
>   * COWed pages of a VM_PFNMAP are always normal.
> 
> So there's precedence for us just putting pages we allocate/manage ourselves in
> a VM_PFNMAP.
> 
> So I guess this interface would be something like:
> 
> 	int remap_kernel_pages(struct vm_area_struct *vma, unsigned long addr,
> 			       struct page **pages, unsigned long size,
> 			       pgprot_t prot);
> 


Well, I think we simply will want vm_insert_pages_prot() that stops 
treating these things like folios :) . *likely*  we'd want a distinct 
memdesc/type.

We could start that work right now by making some user (iouring, 
ring_buffer) set a new page->_type, and checking that in 
vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the 
refcount and the mapcount.

Because then, we can just make all the relevant drivers set the type, 
refuse in vm_insert_pages_prot() anything that doesn't have the type 
set, and refuse in vm_normal_page() any pages with this memdesc.

Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of 
these things will stop working, I hope that is not a problem.


There is one question is still had for a long time: maybe we *do* want 
to refcount these kernel allocations. When refcounting them, it's 
impossible that we might free them in our driver without having some 
reference lurking somewhere in some page table of a process. I would 
hope that this is being take care of differently. (e.g., VMA lifetime)


But again, I'd hope this is something we can sort out independent of 
this series.

-- 
Cheers,

David / dhildenb