linux-kernel - Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9c3ba92e-9e36-75a9-9572-a08694048c1d@redhat.com>
Date:   Fri, 17 Dec 2021 21:45:12 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        John Hubbard <jhubbard@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Yang Shi <shy828301@...il.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Nadav Amit <namit@...are.com>, Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        Donald Dutile <ddutile@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
        Linux-MM <linux-mm@...ck.org>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>,
        "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via
 FAULT_FLAG_UNSHARE (!hugetlb)

On 17.12.21 20:04, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand <david@...hat.com> wrote:
>>
>> + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
>> + * set) and then unmaps the target page, we have:
>> + *
>> + * * page has mapcount == 1 and refcount > 1
> 

Hi Linus,

> All these games with mapcount makes me think this is still broken.
> 
> mapcount has been a horribly broken thing in the past, and I'm not
> convinced it's not a broken thing now.

It all started when Jann detected the security issue in GUP – and this
patch set is fixing the problem exactly there, in GUP itself. Are you
aware of COW issues regarding the mapcount if we would remove GUP from
the equation? My point is that COW without GUP works just perfectly
fine, but I’ll be happy to learn about other cases I was ignoring so far.

Unfortunately, page_count() is even more unreliable, and the issues
we're just detecting (see the link in the cover letter: memory
corruptions inside user space -- e.g., lost DMA writes) are even worse
than what we had before -- IMHO of course.

> 
>> +       vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
>> +       if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
>> +           page_mapcount(vmf->page) > 1) {
> 
> What keeps the mapcount stable in here?

So, we're reading an atomic value here. It’s read via atomic_read for
regular pages, and the THP mapcount case has also been made atomic (as
lockless as page_count) in patch #5.

If a page is mapped exactly once, page_mapcount(page) == 1 and there is
nothing to do.

If the page is mapped more than once, page_mapcount(page) > 1 and we
would have to trigger unsharing. And it’s true that the value is
unstable in this case, but we really only care about page_mapcount(page)
> 1 vs. page_mapcount(page) == 1. In this respect, there is no
difference from the instability of the page_count and the mapcount – we
still only care if it’s >1 or == 1.

So the only case we could care about is concurrent additional mappings
that can increment the mapcount -- which can only happen due to
concurrent fork. So if we're reading page_mapcount(page) == 1 the only
way we can get page_mapcount(page) > 1 is due to fork(). But we're
holding the mmap_lock in read mode during faults and fork requires the
mmap_lock in write mode.

> 
> And I still believe that the whole notion that "COW should use
> mapcount" is pure and utter garbage.
> 
> If we are doing a COW, we need an *exclusive* access to the page. That
> is not mapcount, that is the page ref.

I thought about this a lot, because initially I had the same opinion.

But really, we don't care about any speculative references (pagecache,
migration, daemon, pagevec, ...) or any short-term "I just want to grab
this reference real quick such that the page can't get freed" references.

All we care about are GUP references, and we attack that problem at the
root by triggering unsharing exactly at the point where GUP comes into play.

So IMHO GUP is the problem and needs unsharing either:
* On write access to a shared anonymous page, which is just COW as we
know it.
* On read access to a shared anonymous page, which is what we’re
proposing in this patch set.

So as soon as GUP comes into play, even if only pinning R/O, we have to
trigger unsharing . Doing so enforces the invariant that it is
impossible to take a GUP pin on an anonymous page with a mapcount > 1.
In turn, the COW does not need to worry about the GUP after fork()
security issue anymore and it can focus on doing optimally the COW
faults as if GUP just wouldn’t exist.

-- 
Thanks,

David / dhildenb