linux-kernel - Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45806a5a-65c2-67ce-fc92-dc8c2144d766@nvidia.com>
Date:   Sun, 10 Jan 2021 23:26:57 -0800
From:   John Hubbard <jhubbard@...dia.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        "Minchan Kim" <minchan@...nel.org>, Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Hugh Dickins <hughd@...gle.com>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Matthew Wilcox <willy@...radead.org>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Kees Cook <keescook@...omium.org>,
        Leon Romanovsky <leonro@...dia.com>,
        Jason Gunthorpe <jgg@...pe.ca>, Jan Kara <jack@...e.cz>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        Nadav Amit <nadav.amit@...il.com>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

On 1/10/21 11:30 AM, Linus Torvalds wrote:
> On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds
> <torvalds@...ux-foundation.org> wrote:
>
> Just having a bit in the page flags for "I already made this
> exclusive, and fork() will keep it so" is I feel the best option. In a
> way, "page is writable" right now _is_ that bit. By definition, if you
> have a writable page in an anonymous mapping, that is an exclusive
> user.
> 
> But because "writable" has these interactions with other operations,
> it would be better if it was a harder bit than that "maybe_pinned()",
> though. It would be lovely if a regular non-pinning write-GUP just
> always set it, for example.
> 
> "maybe_pinned()" is good enough for the fork() case, which is the one
> that matters for long-term pinning. But it's admittedly not perfect.
> 

There is at least one way to improve this part of it--maybe.

IMHO, a lot of the bits in page _refcount are still being wasted (even
after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
there are many callers of gup/pup per page. If anyone points out that
that is wrong, then the rest of this falls apart, but...if we were to
make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever
simultaneously allowed on a given page", then several things become
possible:

1) "maybe dma pinned" could become "very likely dma-pinned", by allowing
perhaps 23 bits for normal page refcounts (leaving just 8 bits for dma
pins), thus all but ensuring no aliasing between normal refcounts and
dma pins. The name change below to "likely" is not actually necessary,
I'm just using it to make the point clear:


diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..28f7f64e888f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1241,7 +1241,7 @@ static inline void put_page(struct page *page)
   * get_user_pages and page_mkclean and other calls that race to set up page
   * table entries.
   */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define GUP_PIN_COUNTING_BIAS (1U << 23)

  void unpin_user_page(struct page *page);
  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
@@ -1274,7 +1274,7 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
   * @Return:    True, if it is likely that the page has been "dma-pinned".
   *             False, if the page is definitely not dma-pinned.
   */
-static inline bool page_maybe_dma_pinned(struct page *page)
+static inline bool page_likely_dma_pinned(struct page *page)
  {
         if (hpage_pincount_available(page))
                 return compound_pincount(page) > 0;


2) Doing (1) allows, arguably, failing mprotect calls if
page_likely_dma_pinned() returns true, because the level of confidence
is much higher now.

3) An additional counter, for normal gup (FOLL_GET) is possible: divide
up the top 8 bits into two separate counters of just 4 bits each. Only
allow either FOLL_GET or FOLL_PIN (and btw, I'm now mentally equating
FOLL_PIN with FOLL_LONGTERM), not both, on a given page. Like this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8cd6ae..ce5af27e3057 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1241,7 +1241,8 @@ static inline void put_page(struct page *page)
   * get_user_pages and page_mkclean and other calls that race to set up page
   * table entries.
   */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define DMA_PIN_COUNTING_BIAS (1U << 23)
+#define GUP_PIN_COUNTING_BIAS (1U << 27)

  void unpin_user_page(struct page *page);
  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,

So:

FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount.
These are long term pins for dma.

FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount.
These are not long term pins.

Doing (3) would require yet another release call variant:
unpin_user_pages() would need to take a flag to say which refcount to
release: FOLL_GET or FOLL_PIN. However, I think that's relatively easy
(compared to the original effort of teasing apart generic put_page()
calls, into unpin_user_pages() calls). We already have all the
unpin_user_pages() calls in place, and so it's "merely" a matter of
adding a flag to 74 call sites.

The real question is whether we can get away with supporting only a very
low count of FOLL_PIN and FOLL_GET pages. Can we?


thanks,
-- 
John Hubbard
NVIDIA