[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210115183721.GG4605@ziepe.ca>
Date: Fri, 15 Jan 2021 14:37:21 -0400
From: Jason Gunthorpe <jgg@...pe.ca>
To: David Hildenbrand <david@...hat.com>
Cc: Andrea Arcangeli <aarcange@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Yu Zhao <yuzhao@...gle.com>,
Andy Lutomirski <luto@...nel.org>,
Peter Xu <peterx@...hat.com>,
Pavel Emelyanov <xemul@...nvz.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...ux.vnet.ibm.com>,
Minchan Kim <minchan@...nel.org>,
Will Deacon <will@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>,
"Kirill A. Shutemov" <kirill@...temov.name>,
Matthew Wilcox <willy@...radead.org>,
Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
Kees Cook <keescook@...omium.org>,
John Hubbard <jhubbard@...dia.com>,
Leon Romanovsky <leonro@...dia.com>, Jan Kara <jack@...e.cz>,
Kirill Tkhai <ktkhai@...tuozzo.com>,
Nadav Amit <nadav.amit@...il.com>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
On Fri, Jan 15, 2021 at 09:59:23AM +0100, David Hildenbrand wrote:
> AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g.,
> passthrough of a PCI device) can essentially be corrupted by "echo 4 >
> /proc/[pid]/clear_refs".
I've been told when doing migration with RDMA the VM's memory also
ends up pinned, and then it does the stuff of #4. So it deliberately
does clear_refs(4) on RDMA pinned memory and requires no COW. This is
now a real world uABI break, unfortunately.
> 7) There is no easy way to detect if a page really was pinned: we might
> have false positives. Further, there is no way to distinguish if it was
> pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking
> we most probably would need more counters, which we cannot fit into
> struct page. (AFAIU, for huge pages it's easier).
I think this is the real issue. We can only store so much information,
so we have to decide which things work and which things are broken. So
far someone hasn't presented a way to record everything at least..
> However, AFAIU, even being able to detect if (and how) a page was pinned
> would not completely help to solve the puzzle.
At least for COW reuuse, uf we assign labels to every page user, and
imagine we can track everything, I think we get this list:
- # of ptes referencing the page (mapcount?)
- # of page * pointer references that don't touch data
(ie the speculative page cache ref)
- # of DMA/CPU readers
- # of DMA/CPU writers
- # of long term data accesses
- # of other reader/writers (specifically process incoherent reader/writers,
not "DMA with the CPU" like vmsplice/iouring)
Maybe there are more? This is what I've understood so far from
this thread?
Today's kernel makes the COW reuse decision as:
# ptes == 1 &&
# refs == 0 &&
# DMA readers == 0 &&
# DMA writers == 0 &&
# of longterm == 0 &&
# other reader/writers == 0
(in essence this is what _refcount == 1 is saying, I think)
>From a GUP perspective I think the useful property is "a physical page
under GUP is not indirectly removed from the mm_struct that pinned
it". This is the idea that the process CPU page table and the ongoing
DMA remain synchronized. This is a generalized statement from the
clear_refs(4) and fork() regressions.
Therefore, COW should not copy a page just because it is under GUP, it
breaks the idea directly.
We've also said speculative #refs should not cause COW. Removing both
of those gets us to the COW reuse decision as:
# ptes == 1 &&
# other reader/writers == 0
And I think where Linus is coming from is '# ptes' (eg mapcount) alone
is not right because there are other relavent reader/writers too. (I'm
not sure what these are, has someone pointed at one?)
So, we have 64 bits for _refcount and _mapcount and we currently encode
things as:
- # ptes (_mapcount)
- # page pointers + (low bits of _refcount)
# DMA reader + writers +
# other reader/writers +
# ptes # We incr both _mapcount and_refcount?
- # long term data acesses (high bits of _refcount
If we move '# other reader/writers' to _mapcount (maybe with a shift),
does it help?
We also talked about GUP as meaning wrprotect == 0, but we could also
change that to the idea that GUP means COW will always re-use, eg
'#ptes == 1 && # other reader/writers == 0'.
This gives some definition what mprotect(PROT_READ) means to pages
under DMA (though I still think PROT_READ of pages under DMA write is
weird)
> 8) We have a vmsplice security issue that has to be fixed by touching
> the code in question. A forked child process can read memory content of
> its parent, which was modified by the parent after fork. AFAIU, the fix
> will further lock us in into the direction of the code we are heading.
No, vmsplice is just wrong. vmsplice has to do
FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE for read only access to pages if
userspace controls the duration of the pin.
There are other bad bugs, like permanently locking
DAX/CMA/ZONE_MIGRATE memory if the above pattern is not used.
There was some debate over alternatives, but for a backport security
fix it has to be above. AFAIK.
Jason
Powered by blists - more mailing lists