lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6795cc9a-6230-431f-b089-7909f7bc4f30@redhat.com>
Date: Fri, 29 Nov 2024 14:24:24 +0100
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
 Adrian Hunter <adrian.hunter@...el.com>,
 Kan Liang <kan.liang@...ux.intel.com>, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH] perf: map pages in advance

On 29.11.24 14:19, Lorenzo Stoakes wrote:
> On Fri, Nov 29, 2024 at 02:12:23PM +0100, David Hildenbrand wrote:
>> On 29.11.24 14:02, Lorenzo Stoakes wrote:
>>> On Fri, Nov 29, 2024 at 01:59:01PM +0100, David Hildenbrand wrote:
>>>> On 29.11.24 13:55, Lorenzo Stoakes wrote:
>>>>> On Fri, Nov 29, 2024 at 01:45:42PM +0100, David Hildenbrand wrote:
>>>>>> On 29.11.24 13:26, Peter Zijlstra wrote:
>>>>>>> On Fri, Nov 29, 2024 at 01:12:57PM +0100, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> Well, I think we simply will want vm_insert_pages_prot() that stops treating
>>>>>>>> these things like folios :) . *likely*  we'd want a distinct memdesc/type.
>>>>>>>>
>>>>>>>> We could start that work right now by making some user (iouring,
>>>>>>>> ring_buffer) set a new page->_type, and checking that in
>>>>>>>> vm_insert_pages_prot() + vm_normal_page(). If set, don't touch the refcount
>>>>>>>> and the mapcount.
>>>>>>>>
>>>>>>>> Because then, we can just make all the relevant drivers set the type, refuse
>>>>>>>> in vm_insert_pages_prot() anything that doesn't have the type set, and
>>>>>>>> refuse in vm_normal_page() any pages with this memdesc.
>>>>>>>>
>>>>>>>> Maybe we'd have to teach CoW to copy from such pages, maybe not. GUP of
>>>>>>>> these things will stop working, I hope that is not a problem.
>>>>>>>
>>>>>>> Well... perf-tool likes to call write() upon these pages in order to
>>>>>>> write out the data from the mmap() into a file.
>>>>>
>>>>> I'm confused about what you mean, write() using the fd should work fine, how
>>>>> would they interact with the mmap? I mean be making a silly mistake here
>>>>
>>>> write() to file from the mmap()'ed address range to *some* file.
>>>>
>>>
>>> Yeah sorry my brain melted down briefly, for some reason was thinking of read()
>>> writing into the buffer...
>>>
>>>> This will GUP the pages you inserted.
>>>>
>>>> GUP does not work on PFNMAP.
>>>
>>> Well it _does_ if struct page **pages is set to NULL :)
>>
>> Hm? :)
>>
>> check_vma_flags() unconditionally refuses VM_PFNMAP.
> 
> Ha, funny with my name all over git blame there... ok yup missed this, the
> vm_normal_page() == NULL stuff must but for mixed map (and those other weird
> cases I think you can get0...
> 
> Well good. Where is write() invoking GUP? I'm kind of surprised it's not just
> using uaccess?
> 
> One thing to note is I did run all the perf tests with no issues whatsoever. You
> would _think_ this would have come up...
> 
> I'm editing some test code to explicitly write() from the buffer anyway to see.
> 
> If we can't do pfnmap, and we definitely can't do mixedmap (because it's
> basically entirely equivalent in every way to just faulting in the pages as
> before and requires the same hacks) then I will have to go back to the drawing
> board or somehow change the faulting code.
> 
> This really sucks.
> 
> I'm not quite sure I even understand why we don't allow GUP used _just for
> pinning_ on VM_PFNMAP when it is -in effect- already pinned on assumption
> whatever mapped it will maintain the lifetime.
> 
> What a mess...

Because VM_PFNMAP is dangerous as hell. To get a feeling for that (and also why I
raised my refcounting comment earlier) just read recent:

commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3
Author: Linus Torvalds <torvalds@...ux-foundation.org>
Date:   Wed Sep 11 17:11:23 2024 -0700

     mm: avoid leaving partial pfn mappings around in error case
     
     As Jann points out, PFN mappings are special, because unlike normal
     memory mappings, there is no lifetime information associated with the
     mapping - it is just a raw mapping of PFNs with no reference counting of
     a 'struct page'.


GUP relies on the refcount. In a PFNMAP you don't have any way to make sure the
driver won't go down, free the page, to have it used by someone else while IO is still
happening to that page.

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ