lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <329e01f6-813a-9212-97c9-9440894dcf2c@redhat.com>
Date:   Tue, 1 Mar 2022 09:24:41 +0100
From:   David Hildenbrand <david@...hat.com>
To:     linux-kernel@...r.kernel.org
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        David Rientjes <rientjes@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        John Hubbard <jhubbard@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Yang Shi <shy828301@...il.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Nadav Amit <namit@...are.com>, Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        Donald Dutile <ddutile@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
        Liang Zhang <zhangliang5@...wei.com>,
        Pedro Gomes <pedrodemargomes@...il.com>,
        Oded Gabbay <oded.gabbay@...il.com>, linux-mm@...ck.org,
        Khalid Aziz <khalid.aziz@...cle.com>
Subject: Re: [PATCH RFC 00/13] mm: COW fixes part 2: reliable GUP pins of
 anonymous pages

On 24.02.22 13:26, David Hildenbrand wrote:
> This series is the result of the discussion on the previous approach [2].
> More information on the general COW issues can be found there. It is based
> on [1], which resides in -mm and -next:
> 	[PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for
> 	THP and swap
> 
> I keep the latest state, including some hacky selftest on:
> 	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2
> 
> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> on an anonymous page and COW logic fails to detect exclusivity of the page
> to then replacing the anonymous page by a copy in the page table: The
> GUP pin lost synchronicity with the pages mapped into the page tables.
> 
> This issue, including other related COW issues, has been summarized in [3]
> under 3):
> "
>   3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
> 
>   page_maybe_dma_pinned() is used to check if a page may be pinned for
>   DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
>   tolerable, false negatives are problematic: pages that are pinned for
>   DMA must not be added to the swapcache. If it happens, the (now pinned)
>   page could be faulted back from the swapcache into page tables
>   read-only. Future write-access would detect the pinning and COW the
>   page, losing synchronicity. For the interested reader, this is nicely
>   documented in feb889fb40fa ("mm: don't put pinned pages into the swap
>   cache").
> 
>   Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
>   cases and can result in a violation of the documented semantics:
>   giving false negatives because of the race.
> 
>   There are cases where we call it without properly taking a per-process
>   sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
>   one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
>   handle, there is especially one rmap case (shrink_page_list) that's hard
>   to fix: in the rmap world, we're not limited to a single process.
> 
>   The shrink_page_list() issue is really subtle. If we race with
>   someone pinning a page, we can trigger the same issue as in the FOLL_GET
>   case. See the detail section at the end of this mail on a discussion how
>   bad this can bite us with VFIO or other FOLL_PIN user.
> 
>   It's harder to reproduce, but I managed to modify the O_DIRECT
>   reproducer to use io_uring fixed buffers [15] instead, which ends up
>   using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
>   similarly trigger a loss of synchronicity and consequently a memory
>   corruption.
> 
>   Again, the root issue is that a write-fault on a page that has
>   additional references results in a COW and thereby a loss of
>   synchronicity and consequently a memory corruption if two parties
>   believe they are referencing the same page.
> "
> 
> This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
> especially also taking care of concurrent pinning via GUP-fast,
> for example, also fully fixing an issue reported regarding NUMA
> balancing [4] recently. While doing that, it further reduces "unnecessary
> COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
> COW security for hugetlb for FOLL_PIN.
> 
> In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
> anonymous page is exclusive. Exclusive anonymous pages that are mapped
> R/O can directly be mapped R/W by the COW logic in the write fault handler.
> Exclusive anonymous pages that want to be shared (fork(), KSM) first have
> to mark a mapped anonymous page shared -- which will fail if there are
> GUP pins on the page. GUP is only allowed to take a pin on anonymous pages
> that is exclusive. The PT lock is the primary mechanism to synchronize
> modifications of PG_anon_exclusive. GUP-fast is synchronized either via the
> src_mm->write_protect_seq or via clear/invalidate+flush of the relevant
> page table entry.
> 
> Special care has to be taken about swap, migration, and THPs (whereby a
> PMD-mapping can be converted to a PTE mapping and we have to track
> information for subpages). Besides these, we let the rmap code handle most
> magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
> logic as part of our previous approach [2], however, it's now 100% mapcount
> free and I further simplified it a bit.
> 
>   #1 is a fix
>   #3-#7 are mostly rmap preparations for PG_anon_exclusive handling
>   #8 introduces PG_anon_exclusive
>   #9 uses PG_anon_exclusive and make R/W pins of anonymous pages
>    reliable
>   #10 is a preparation for reliable R/O pins
>   #11 and #12 is reused/modified GUP-triggered unsharing for R/O GUP pins
>    make R/O pins of anonymous pages reliable
>   #13 adds sanity check when (un)pinning anonymous pages
> 
> I'm not proud about patch #8, suggestions welcome. Patch #9 contains
> excessive explanations and the main logic for R/W pins. #11 and #12
> resemble what we proposed in the previous approach [2]. I consider the
> general approach of #13 very nice and helpful, and I remember Linus even
> envisioning something like that for finding BUGs, although we might want to
> implement the sanity checks eventually differently
> 
> It passes my growing set of tests for "wrong COW" and "missed COW",
> including the ones in [30 -- I'd really appreciate some experienced eyes
> to take a close look at corner cases. Only tested on x86_64, testing with
> CONT-mapped hugetlb pages on arm64 might be interesting.
> 
> Once we converted relevant users of FOLL_GET (e.g., O_DIRECT) to FOLL_PIN,
> the issue described in [3] under 2) will be fixed as well. Further, once
> that's in place we can streamline our COW logic for hugetlb to rely on
> page_count() as well and fix any possible COW security issues.

Hi,

I did excessive tests on aarch64 with CONT hugetlb pages yesterday and
didn't find any surprises, at least not in these changes. [1]

While thinking on how to get O_DIRECT fixed immediately, I realized the
following:
(1) This series turns FOLL_PIN on anonymous pages completely reliable,
    for any case of GUP+fork (GUP before fork, GUP during fork, GUP
    after fork in child/parent), which is pretty nice IMHO.
(2) For O_DIRECT and friends (FOLL_GET) we primarily care about fixing
    short-term FOLL_WRITE *without* fork(), which are the memory
    corruptions we're experiencing. Long-term FOLL_GET (meaning, even
    staying reliable after fork()) already has a bad smell to it.

Especially, (2) never worked reliably with fork() involved. I came to
the conclusion that this series pretty much fixes (2) already *except*
the swapout case. I might be wrong, but I think it should be possible to
handle that as well, meaning that: a FOLL_GET|FOLL_WRITE on an anonymous
page will be reliable as long as we don't fork().

I'm planning on sending a part3 to cover that, so we don't have to wait
for the FOLL_PIN conversion to get our O_DIRECT reproducers fixed. Stay
tuned.

If there isn't any more feedback, I'll be sending out a v1 soonish.


[1]
https://lkml.kernel.org/r/811c5c8e-b3a2-85d2-049c-717f17c3a03a@redhat.com


-- 
Thanks,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ