linux-kernel - Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X/prosulFrEoNnoF@redhat.com>
Date:   Sat, 9 Jan 2021 21:51:14 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Hugh Dickins <hughd@...gle.com>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Matthew Wilcox <willy@...radead.org>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Kees Cook <keescook@...omium.org>,
        John Hubbard <jhubbard@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>,
        Jason Gunthorpe <jgg@...pe.ca>, Jan Kara <jack@...e.cz>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        Nadav Amit <nadav.amit@...il.com>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

Hello Linus,

On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote:
> +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE)
> +
> +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
> +{
> +	struct page *page;
> +
> +	if (!is_cow_mapping(vma->vm_flags))
> +		return false;
> +	if (likely(!atomic_read(&vma->vm_mm->has_pinned)))
> +		return false;
> +	page = vm_normal_page(vma, addr, pte);
> +	if (!page)
> +		return false;
> +	if (page_mapcount(page) != 1)
> +		return false;
> +	return page_maybe_dma_pinned(page);
> +}

I just don't see the simplification coming from
09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking
page_mapcount above as an optimization, to me it looks much simpler to
check it in a single place, in do_wp_page, that in addition of
optimizing away the superfluous copy, would optimize away the above
complexity as well.

And I won't comment if it's actually safe to skip random pages or
not. All I know is for mprotect and uffd-wp, definitely the above
approach wouldn't work.

In addition I dislike has_pinned and FOLL_PIN.

has_pinned        450 include/linux/mm_types.h 		 * for instance during page table copying for fork().
has_pinned       1021 kernel/fork.c    	atomic64_set(&mm->pinned_vm, 0);
has_pinned       1239 mm/gup.c         		atomic_set(&mm->has_pinned, 1);
has_pinned       2579 mm/gup.c         		atomic_set(&current->mm->has_pinned, 1);
has_pinned       1099 mm/huge_memory.c 		     atomic_read(&src_mm->has_pinned) &&
has_pinned       1213 mm/huge_memory.c 		     atomic_read(&src_mm->has_pinned) &&
has_pinned        819 mm/memory.c      	if (likely(!atomic_read(&src_mm->has_pinned)))

Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1)
on it? Why isn't it a MMF_HAS_PINNED that already can be set
atomically under mmap_read_lock too? There's bit left free there, we
didn't run out yet to justify wasting another 31 bits. I hope I'm
overlooking something.

The existence of FOLL_LONGTERM is good and makes a difference at times
for writeback if it's on a MAP_SHARED, or it makes difference during
GUP to do a page_migrate before taking the pin, but for the whole rest
of the VM it's irrelevant if it's long or short term, so I'm also
concerned from what Jason mentioned about long term pins being treated
differently within the VM. That to me looks fundamentally as flawed as
introducing inaccuracies in do_wp_page, that even ignoring the
performance implications caused by the inaccuracy, simply prevent to
do some useful things.

I obviously agree all common workloads with no GUP pins and no
clear_refs and no uffd, are way more important to be optimal, but I
haven't seen a single benchmark showing the benefit of not taking the
PG_lock before a page copy or any other runtime benefit coming from
page_count in do_wp_page.

To the contrary now I see additional branches in fork and in various
other paths.

The only thing again where I see page_count provides a tangible
benefit, is the vmsplice attack from child to parent, but that has not
been fully fixed and if page_count is added to fix it in all COW
faults, it'll introduce extra inefficiency to the the very common
important workloads, not only to the special GUP/clear_refs/uffd-wp
workloads as your patch above shows.

Thanks,
Andrea