lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <be33a685-f6e0-41b0-ba3b-d1d7c2d743b8@lucifer.local>
Date: Fri, 29 Nov 2024 14:35:36 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Namhyung Kim <namhyung@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
        Adrian Hunter <adrian.hunter@...el.com>,
        Kan Liang <kan.liang@...ux.intel.com>,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH] perf: map pages in advance

On Fri, Nov 29, 2024 at 03:09:49PM +0100, David Hildenbrand wrote:
> > > > I just tested it and write() works fine, it uses uaccess afaict as part of the
> > > > lib/iov_iter.c code:
> > > >
> > > > generic_perform_write()
> > > > -> copy_folio_from_iter_atomic()
> > > > -> copy_page_from_iter_atomic()
> > > > -> __copy_from_iter()
> > > > -> copy_from_user_iter()
> > > > -> raw_copy_from_user()
> > > > -> copy_user_generic()
> > > > -> [uaccess asm]
> > > >
> > >
> > > Ah yes. O_DIRECT is the problematic bit I suspect, which will use GUP.
> > >
> > > Ptrace access and friends should also no longer work on these pages, likely
> > > that's tolerable.
> >
> > Yeah Peter can interject if not, but I'd be _very_ surprised if anybody expects
> > to be able to ptrace perf counter mappings in another process (it'd be kind of
> > totally insane to do that anyway since it's a ring buffer that needs special
> > handing anyway).
>
> I think so as well. Disallowing GUP has some side effects, like not getting
> these pages included in a coredump etc ... at least I hope nobody uses
> O_DIRECT on them.

We set VM_DONTDUMP anyway (set by remap_pfn_range() also) so this part won't be
a problem, and I can't see anybody using O_DIRECT on them, sensibly.

>
> >
> > >
> > > > > >
> > > > > > If we can't do pfnmap, and we definitely can't do mixedmap (because it's
> > > > > > basically entirely equivalent in every way to just faulting in the pages as
> > > > > > before and requires the same hacks) then I will have to go back to the drawing
> > > > > > board or somehow change the faulting code.
> > > > > >
> > > > > > This really sucks.
> > > > > >
> > > > > > I'm not quite sure I even understand why we don't allow GUP used _just for
> > > > > > pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
> > > > > > whatever mapped it will maintain the lifetime.
> > > > > >
> > > > > > What a mess...
> > > > >
> > > > > Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
> > > > > raised my refcounting comment earlier) just read recent:
> > > > >
> > > > > commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
> > > > > Author: Linus Torvalds <torvalds@...ux-foundation.org>
> > > > > Date:   Wed Sep 11 17:11:23 2024 -0700
> > > > >
> > > > >       mm: avoid leaving partial pfn mappings around in error case
> > > > >       As Jann points out, PFN mappings are special, because unlike normal
> > > > >       memory mappings, there is no lifetime information associated with the
> > > > >       mapping - it is just a raw mapping of PFNs with no reference counting of
> > > > >       a 'struct page'.
> > > > >
> > > >
> > > > I'm _very_ aware of this, having worked extensively on things around this kind
> > > > of issue recently (was a little bit involved with the above one too :), and
> > > > explicitly zap on error in this patch to ensure we leave no broken code paths.
> > > >
> > > > I agree it's horrible, but we need to have a way of mapping this memory without
> > > > having to 'trick' the faulting mechanism to behave correctly.
> > >
> > > What's completely "surprising" to me is, if there is no page_mkwrite, but
> > > the VMA is writable, then just map the PTE writable ...
> >
> > I've had a number of surprises on this journey :)
> >
> > To make sure I understand what you mean do you mean the whole:
> >
> > do_wp_page()
> > -> wp_page_shared()
> > -> if (page_mkwrite) { ... } else { wp_page_reuse(); }
> > -> wp_page_reuse()
> > -> maybe_mkwrite() [hey VM_WRITE is set, so yes make writable!]
> >
> > Path?
>
> Yes. I can see how it might be relevant (mprotect(PROT_READ) +
> mprotect(READ|WRITE)), but it's a bit counter-intuitive ... to default to
> "writable".

Yeah, no this whole thing very much resembles a labyrinth with a minotaur at the
end, whose name is VM_PFNMAP :>)

>
> [...]
>
> > >
> >
> > So overall - given all the above and the fact that the alternatives are _even
> > worse_ (having to spuriously folio lock if that'll even work, totally
> > unnecessary and semantically incorrect folio-fication or a complete rework of
> > mapping) - while you might be sicked by this use of the VM_PFNMAP - are you ok
> > with the patch, all things considered? :)
>
> It hurts my eyes, and feels like a step into the wrong direction. But I
> don't really see how to do it differently right now.

Yeah I agree totally.

>
> Can we add a comment/TODO in there that using remap_pfn_range this to map a
> kernel allocation really is suboptimal and we should switch to something
> better once we figured the details out?

Sure, let me do a v2 then to make Peter's life easier, can do the nommu fixup in
there too.

>
>
> BTW, I think the whole reason vm_insert_pages() touches the mapcount is
> because we didn't really have a way to identify during unmap that we must
> not adjust it either. A page->type will be able to sort that out (while
> still allowing to refcount them for now).

Yeah. I think not being able to differentiate between different cases is the
problem here...

>
> Actually, the whole thing is on my todo list, because messing with the RMAP
> with vm_insert_pages() doesn't make any sense (whereby the refcount might!).
> See the TODO I added in __folio_rmap_sanity_checks().

How long is your TODO list I wonder? :)) I imagine it requires a huge page to
map at this point..

>
> --
> Cheers,
>
> David / dhildenb
>

Cool will respin and send out shortly with an appropriately 'forgive us father
for we have sinned' comment.

Thanks for taking the time to discuss this! Much appreciated.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ