linux-kernel - Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <341cf567-468e-b5cf-6813-b26c49e851b5@nvidia.com>
Date:   Sun, 19 Dec 2021 00:01:59 -0800
From:   John Hubbard <jhubbard@...dia.com>
To:     Nadav Amit <namit@...are.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        David Hildenbrand <david@...hat.com>
Cc:     Jason Gunthorpe <jgg@...dia.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Yang Shi <shy828301@...il.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        Donald Dutile <ddutile@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
        Linux-MM <linux-mm@...ck.org>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>,
        "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via
 FAULT_FLAG_UNSHARE (!hugetlb)

On 12/18/21 22:02, Nadav Amit wrote:
> 
>> On Dec 18, 2021, at 4:35 PM, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
>>
>> (I have only ever seen the kernel side of uffd, not the actual user
>> side, so I'm not sure about the use patterns).
> 
> I use it in a very fine granularity, and I suspect QEMU and CRIU do so
> too.
> 
>>
>> That said, your suggestion of a shadow sw page table bit thing would
>> also work. And it would solve some problems we have in core areas
>> (notably "page_special()" which right now has that
>> ARCH_HAS_PTE_SPECIAL thing).
>>
>> It would make it really easy to have that "this page table entry is
>> pinned" flag too.
> 
> I found my old messy code for the software-PTE thing.
> 
> I see that eventually I decided to hold a pointer to the “extra PTEs”
> of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
> pages approach but this code is long gone. ]
> 
> My rationale was that:
> 
> 1. It does not bound you to have the same size for PTE and “extra-PTE”
> 2. The PMD-page struct is anyhow hot (since you acquired the PTL)
> 3. Allocating “extra-PTE” dynamically does not require to rewire the
>     page-tables, which requires a TLB flush.
> 
> I think there is a place to hold a pointer in the PMD-page-struct
> (_pt_pad_1, we just need to keep the lowest bit clear so the kernel
> won’t mistaken it to be a compound page).
> 
> I still don’t know what exactly you have in mind for making use
> out of it for the COW issue. Keeping a pin-count (which requires
> internal API changes for unpin_user_page() and friends?) or having
> “was ever pinned” sticky bit? And then changing
> page_needs_cow_for_dma() to look at the PTE so copy_present_pte()
> would break the COW eagerly?
> 
> Anyhow, I can clean it up and send (although it is rather simple
> and I ignored many thing, such as THP, remap, etc), but I am not
> sure I have the time now to fully address the COW problem. I will
> wait for Monday for David’s response.
> 

Hi Nadav,

A couple of thoughts about this part of the design:

a) The PMD-page-struct approach won't help as much, because (assuming
that we're using it in an attempt to get a true, perfect pin count), you
are combining the pin counts of a PMD's worth of pages. OTOH...maybe
that actually *is* OK, assuming you don't overflow--except that you can
only answer the "is it dma-pinned?" question at a PMD level. That's a
contradiction of your stated desire above to have very granular control.

Also, because of not having bit 0 available in page._pt_pad_1, I think
the count would have to be implemented as adding and subtracting 2,
instead of 1 (in order to keep the value even), further reducing the
counter range.

b) If, instead, you used your older 2-adjacent pages approach, then
Linus' comment makes more sense here: we could use the additional struct
page to hold an exact pin count, per page. That way, you can actually
build a wrapper function such as:

     page_really_is_dma_pinned()

...and/or simply get a stronger "maybe" for page_maybe_dma_pinned().

Furthermore, this direction is extensible and supports solving other "I
am out of space in struct page" problems, at the cost of more pages, of
course.

As an aside, I find it instructive that we're talking about this
approach, instead of extending struct page. The lesson I'm taking away
is: allocating more space for some cases (2x pages) is better than
having *all* struct pages be larger than they are now.

Anyway, the pin count implementation would look somewhat like the
existing hpage_pincount, which similarly has ample space for a separate,
exact pin count. In other words, this sort of thing (mostly-pseudo
code):


diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..646761388025 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -938,6 +938,16 @@ static inline bool hpage_pincount_available(struct page *page)
  	return PageCompound(page) && compound_order(page) > 1;
  }

+static inline bool shadow_page_pincount_available(struct page *page)
+{
+	/*
+	 * TODO: Nadav: connect this up with the shadow page table
+	 * implementation, and return an appropriate answer.
+	 */
+
+	return false; // hardcoded for now, for compile testing
+}
+
  static inline int head_compound_pincount(struct page *head)
  {
  	return atomic_read(compound_pincount_ptr(head));
@@ -950,6 +960,13 @@ static inline int compound_pincount(struct page *page)
  	return head_compound_pincount(page);
  }

+static inline int shadow_page_pincount(struct page *page)
+{
+	VM_BUG_ON_PAGE(!shadow_page_pincount_available(page), page);
+
+	return atomic_read(shadow_page_pincount_ptr(page));
+}
+
  static inline void set_compound_order(struct page *page, unsigned int order)
  {
  	page[1].compound_order = order;
@@ -1326,6 +1343,9 @@ static inline bool page_maybe_dma_pinned(struct page *page)
  	if (hpage_pincount_available(page))
  		return compound_pincount(page) > 0;

+	if (shadow_page_pincount_available(page))
+		return shadow_page_pincount(page) > 0;
+
  	/*
  	 * page_ref_count() is signed. If that refcount overflows, then
  	 * page_ref_count() returns a negative value, and callers will avoid

c) The "was it ever pinned" sticky bit is not a workable concept, at the
struct page level. A counter is required, in order to allow pages to
live out their normal lives to their fullest potential. The only time we
even temporarily got away with this kind of stickiness was at a higher
level, and only per-process, not per-physical-page. Processes come and
go, but the struct pages are more or less forever, so once you mark one
sticky like this, it's out of play.

thanks,
-- 
John Hubbard
NVIDIA