lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 19 Mar 2022 12:17:16 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Jason Gunthorpe <jgg@...dia.com>
Cc:     linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        David Rientjes <rientjes@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        John Hubbard <jhubbard@...dia.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Yang Shi <shy828301@...il.com>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Nadav Amit <namit@...are.com>, Rik van Riel <riel@...riel.com>,
        Roman Gushchin <guro@...com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Peter Xu <peterx@...hat.com>,
        Donald Dutile <ddutile@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Oleg Nesterov <oleg@...hat.com>, Jan Kara <jack@...e.cz>,
        Liang Zhang <zhangliang5@...wei.com>,
        Pedro Gomes <pedrodemargomes@...il.com>,
        Oded Gabbay <oded.gabbay@...il.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Paul Mackerras <paulus@...ba.org>,
        Heiko Carstens <hca@...ux.ibm.com>,
        Vasily Gorbik <gor@...ux.ibm.com>,
        Alexander Gordeev <agordeev@...ux.ibm.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, linux-mm@...ck.org,
        x86@...nel.org, linux-arm-kernel@...ts.infradead.org,
        linuxppc-dev@...ts.ozlabs.org, linux-s390@...r.kernel.org
Subject: Re: [PATCH v1 0/7] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of
 anonymous pages

On 19.03.22 00:48, Jason Gunthorpe wrote:
> On Tue, Mar 15, 2022 at 03:18:30PM +0100, David Hildenbrand wrote:
>> This is just the natural follow-up of part 2, that will also further
>> reduce "wrong COW" on the swapin path, for example, when we cannot remove
>> a page from the swapcache due to concurrent writeback, or if we have two
>> threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
>> nice side-product :)

Hi Jason,

thanks or the review!

> 
> I know I would benefit alot from a description of the swap specific
> issue a bit more. Most of this message talks about clear_refs which I
> do understand a bit better.

Patch #1 contains some additional information. In general, it's the same
issue as with any other mechanism that could get the page mapped R/O
while there is a FOLL_GET | FOLL_WRITE reference to it --  for example,
DMA to that page as happens with our O_DIRECT reproducer.

Part 2 essentially fixed the other cases (i.e., clear_refs), but the
remaining swapout+refault from swapcache case is handled in this series.

> 
> Is this talking about what happens after a page gets swapped back in?
> eg the exclusive bit is missing when the page is recreated?

Right, try_to_unmap() was the last remaining case where we'd have lost
the exclusivity information -- it wasn't required for reliable GUP pins
in part 2.

Here is what happens without PG_anon_exclusive:

1. The application uses parts of an anonymous base page for direct I/O,
let's assume the first 512 bytes of page.

fd = open(filename, O_DIRECT| ...);
pread(fd, page, 512, 0);

O_DIRECT will take a FOLL_GET|FOLL_WRITE reference on the page

2. Reclaim kicks in and wants to swapout the page -- mm/vmscan.c

shrink_page_list() first adds the page to the swapcache and then unmaps
it via try_to_unmap().

After the page was successfully unmapped, pageout() will start
triggering writeback but will realize that there are additional
references on the page (via is_page_cache_freeable()) and fail.

3. The application uses unrelated parts of the page for other purposes
while the DMA is not completed, e.g., doing a a simple

page[4095]++;

The read access will fault in the page readable from the swap cache in
do_swap_page(). The write access will trigger our COW fault handler. As
we have an additional reference on the page, we will create a copy and
map it into out page table. At this point, the page table and the GUP
reference are out of sync.

4. O_DIRECT completes

The read targets the page that is no longer referenced in the page
tables. For the application, it looks like the read() never happened, as
we lost our DMA read to our page.


With PG_anon_exclusive from series part 2, we don't remember exclusivity
information in try_to_unmap() yet. do_swap_page() cannot restore it as
it has to assume the page is possibly shared.

With this series, we remember exclusivity information in try_to_unmap()
in the SWP PTE. do_swap_page() can restore it. Consequently, our COW
fault handler won't create a wrong copy and we won't go out of sync
between GUP and the page mapped into the page table.


Hope that helps!

-- 
Thanks,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ