linux-kernel - Re: [PATCH 0/2] page_count can't be used to decide when wp_page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210111143041.GI504133@ziepe.ca>
Date:   Mon, 11 Jan 2021 10:30:41 -0400
From:   Jason Gunthorpe <jgg@...pe.ca>
To:     Andrea Arcangeli <aarcange@...hat.com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Kees Cook <keescook@...omium.org>,
        John Hubbard <jhubbard@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>, Jan Kara <jack@...e.cz>,
        Kirill Tkhai <ktkhai@...tuozzo.com>
Subject: Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

On Fri, Jan 08, 2021 at 09:50:08PM -0500, Andrea Arcangeli wrote:

> For all those that aren't using mmu notifier and that rely solely on
> page pins, they still require privilege, except they do through /dev/
> permissions.

It is normal that the dev nodes are a+rw so it doesn't really require
privilege in any real sense.

> Actually the mmu notifier doesn't strictly require pins, it only
> requires GUP. All users tend to use FOLL_GET just as a safety
> precaution (I already tried to optimize away the two atomics per GUP,
> but we were naked by the KVM maintainer that didn't want to take the
> risk, I would have, but it's a fair point indeed, obviously it's safer
> with the pin plus the mmu notifier, two is safer than one).

I'm not sure what holding the pin will do to reduce risk?

If you get into a situation where you are stuffing a page into the
SMMU that is not in the CPU's MMU then everything is lost. Holding a
pin while carrying a page from the CPU page table to the SMMU just
ensures that page isn't freed until it is installed, but once
installed you are back to being broken.
 
> I'm not sure how any copy-user could obviate a secondary MMU mapping,
> mappings and copies are mutually exclusive. Any copy would be breaking
> memory coherency in this environment.

Because most places need to copy from user to stable kernel memory
before processing data under user control. You can't just cast a user
controlled pointer to a kstruct and use it - that is very likely a
security bug.

Still, the general version is something like kmap:

  map = user_map_setup(user_ptr, length)
  kptr = user_map_enter(map)
  [use kptr]
  user_map_leave(map, kptr)

And inside it could use mmu notifiers, or gup, or whatever.

user_map_setup() would register the notifier and user_map_enter()
would validate the cache'd page pointer and block cached invalidation
until user_map_leave().

> The primary concern with the mmu notifier in io_uring is the
> take_all_locks latency.

Just enabling mmu_notifier takes a performance hit on the entire
process too, it is not such a simple decision.. We'd need benchmarks
against a database or scientific application to see how negative the
notifier actually becomes.

> The problem with the mmu notifier as an universal solution, for
> example is that it can't wait for I/O completion of O_DIRECT since it
> has no clue where the put_page is to wait for it, otherwise we could
> avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be
> completed before paging or anything can unmap the page under I/O from
> the pagetable.

GPU is already doing something like this, waiting in a notifier
invalidate callback for DMA to finish before allowing invalidate to
complete.

It is horrendously complicated and I'm not sure blocking invalidate
for a long time is actually much better for the MM..

> I see the incompatibility you describe as problem we have today, in
> the present, and that will fade with time.
> 
> Reminds me when we had >4G of RAM and 32bit devices doing DMA. How
> many 32bit devices are there now?

I'm not so sure anymore. A few years ago OpenCAPI and PCI PRI seemed
like good things, but now with experience they carry pretty bad
performance hits to use them. Lots of places are skipping them.

CXL offers another chance at this, so we'll see again in another 5
years or so if it works out. It is not any easy problem to solve from
a HW perspective.

> We're not talking here about any random PCI device, we're talking here
> about very special and very advanced devices that need to have "long
> term" GUP pins in order to operate, not the usual nvme/gigabit device
> where GUP pins are never long term.

Beyond RDMA, netdev's XDP uses FOLL_LONGTERM, so do various video
devices, lots of things related to virtualization like vfio, vdpa and
vhost. I think this is a bit defeatist to say it doesn't matter. If
anything as time goes on it seems to be growing, not shrinking
currently.

> The point is that if you do echo ... >/proc/self/clear_refs on your
> pid, that has any FOLL_LONGTERM on its mm, it'll just cause your
> device driver to go out of sync with the mm. It'll see the old pages,
> before the spurious COWs. The CPU will use new pages (the spurious
> COWs).

But if you do that then clear-refs isn't going to work they way it
thought either - this first needs some explanation for how clear_refs
is supposed to work when DMA WRITE is active on the page.

I'd certainly say causing a loss of synchrony is not acceptable, so if
we keep Linus's version of COW then clear_refs has to not write
protect pages under DMA.

> > secondary-mmu drivers using mmu notifier should not trigger this logic
> > and should not restrict write protect.
> 
> That's a great point. I didn't think the mmu notifier will invalidate
> the secondary MMU and ultimately issue a GUP after the wp_copy_page to
> keep it in sync.

It had better, or mmu notifiers are broken, right?

> The funny thing that doesn't make sense is that wp_copy_page will only
> be invoked because the PIN was left by KVM on the page for that extra
> safety I was talking about earlier.

Yes, with the COW change if kvm cares about this inefficiency it
should not have the unnecessary pin.

> You clearly contemplate the existance of a read mode, long term. That
> is also completely compatible with wrprotection. 

We talked about a read mode, but we didn't flesh it out. It is not
unconditionally compatible with wrprotect - most likely you still
can't write protect a page under READ DMA because when you eventually
take the COW there will be ambiguous situations that will break the
synchrony.

Jason