linux-kernel - Re: [PATCH] perf: map pages in advance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f637de06-41de-44be-8e1f-6a5d429e0ec9@redhat.com>
Date: Fri, 29 Nov 2024 15:09:49 +0100
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
 Adrian Hunter <adrian.hunter@...el.com>,
 Kan Liang <kan.liang@...ux.intel.com>, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH] perf: map pages in advance

>>> I just tested it and write() works fine, it uses uaccess afaict as part of the
>>> lib/iov_iter.c code:
>>>
>>> generic_perform_write()
>>> -> copy_folio_from_iter_atomic()
>>> -> copy_page_from_iter_atomic()
>>> -> __copy_from_iter()
>>> -> copy_from_user_iter()
>>> -> raw_copy_from_user()
>>> -> copy_user_generic()
>>> -> [uaccess asm]
>>>
>>
>> Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.
>>
>> Ptrace access and friends should also no longer work on these pages, likely
>> that's tolerable.
> 
> Yeah Peter can interject if not, but I'd be _very_ surprised if anybody expects
> to be able to ptrace perf counter mappings in another process (it'd be kind of
> totally insane to do that anyway since it's a ring buffer that needs special
> handing anyway).

I think so as well. Disallowing GUP has some side effects, like not 
getting these pages included in a coredump etc ... at least I hope 
nobody uses O_DIRECT on them.

> 
>>
>>>>>
>>>>> If we can't do pfnmap, and we definitely can't do mixedmap (because it's
>>>>> basically entirely equivalent in every way to just faulting in the pages as
>>>>> before and requires the same hacks) then I will have to go back to the drawing
>>>>> board or somehow change the faulting code.
>>>>>
>>>>> This really sucks.
>>>>>
>>>>> I'm not quite sure I even understand why we don't allow GUP used _just for
>>>>> pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
>>>>> whatever mapped it will maintain the lifetime.
>>>>>
>>>>> What a mess...
>>>>
>>>> Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
>>>> raised my refcounting comment earlier) just read recent:
>>>>
>>>> commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
>>>> Author: Linus Torvalds <torvalds@...ux-foundation.org>
>>>> Date:   Wed Sep 11 17:11:23 2024 -0700
>>>>
>>>>       mm: avoid leaving partial pfn mappings around in error case
>>>>       As Jann points out, PFN mappings are special, because unlike normal
>>>>       memory mappings, there is no lifetime information associated with the
>>>>       mapping - it is just a raw mapping of PFNs with no reference counting of
>>>>       a 'struct page'.
>>>>
>>>
>>> I'm _very_ aware of this, having worked extensively on things around this kind
>>> of issue recently (was a little bit involved with the above one too :), and
>>> explicitly zap on error in this patch to ensure we leave no broken code paths.
>>>
>>> I agree it's horrible, but we need to have a way of mapping this memory without
>>> having to 'trick' the faulting mechanism to behave correctly.
>>
>> What's completely "surprising" to me is, if there is no page_mkwrite, but
>> the VMA is writable, then just map the PTE writable ...
> 
> I've had a number of surprises on this journey :)
> 
> To make sure I understand what you mean do you mean the whole:
> 
> do_wp_page()
> -> wp_page_shared()
> -> if (page_mkwrite) { ... } else { wp_page_reuse(); }
> -> wp_page_reuse()
> -> maybe_mkwrite() [hey VM_WRITE is set, so yes make writable!]
> 
> Path?

Yes. I can see how it might be relevant (mprotect(PROT_READ) + 
mprotect(READ|WRITE)), but it's a bit counter-intuitive ... to default 
to "writable".

[...]

>>
> 
> So overall - given all the above and the fact that the alternatives are _even
> worse_ (having to spuriously folio lock if that'll even work, totally
> unnecessary and semantically incorrect folio-fication or a complete rework of
> mapping) - while you might be sicked by this use of the VM_PFNMAP - are you ok
> with the patch, all things considered? :)

It hurts my eyes, and feels like a step into the wrong direction. But I 
don't really see how to do it differently right now.

Can we add a comment/TODO in there that using remap_pfn_range this to 
map a kernel allocation really is suboptimal and we should switch to 
something better once we figured the details out?


BTW, I think the whole reason vm_insert_pages() touches the mapcount is 
because we didn't really have a way to identify during unmap that we 
must not adjust it either. A page->type will be able to sort that out 
(while still allowing to refcount them for now).

Actually, the whole thing is on my todo list, because messing with the 
RMAP with vm_insert_pages() doesn't make any sense (whereby the refcount 
might!). See the TODO I added in __folio_rmap_sanity_checks().

-- 
Cheers,

David / dhildenb