linux-kernel - Re: [PATCH 0/2] page_count can't be used to decide when wp_page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210115143058.GG27380@quack2.suse.cz>
Date:   Fri, 15 Jan 2021 15:30:58 +0100
From:   Jan Kara <jack@...e.cz>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Matthew Wilcox <willy@...radead.org>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Yu Zhao <yuzhao@...gle.com>, Andy Lutomirski <luto@...nel.org>,
        Peter Xu <peterx@...hat.com>,
        Pavel Emelyanov <xemul@...nvz.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...ux.vnet.ibm.com>,
        Minchan Kim <minchan@...nel.org>,
        Will Deacon <will@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Hugh Dickins <hughd@...gle.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Kees Cook <keescook@...omium.org>,
        John Hubbard <jhubbard@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>, Jan Kara <jack@...e.cz>,
        Kirill Tkhai <ktkhai@...tuozzo.com>
Subject: Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

On Sat 09-01-21 11:46:46, Linus Torvalds wrote:
> On Sat, Jan 9, 2021 at 11:33 AM Matthew Wilcox <willy@...radead.org> wrote:
> >
> > On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote:
> > > Side note, and not really related to UFFD, but the mmap_sem in
> > > general: I was at one point actually hoping that we could make the
> > > mmap_sem a spinlock, or at least make the rule be that we never do any
> > > IO under it. At which point a write lock hopefully really shouldn't be
> > > such a huge deal.
> >
> > There's a (small) group of us working towards that.  It has some
> > prerequisites, but where we're hoping to go currently:
> >
> >  - Replace the vma rbtree with a b-tree protected with a spinlock
> >  - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset
> >  - If we need to do I/O, take a refcount on the VMA
> >
> > After that, we can gradually move things out from mmap_sem protection
> > to just the vma tree spinlock, or whatever makes sense for them.  In a
> > very real way the mmap_sem is the MM layer's BKL.
> 
> Well, we could do the "no IO" part first, and keep the semaphore part.
> 
> Some people actually prefer a semaphore to a spinlock, because it
> doesn't end up causing preemption issues.
> 
> As long as you don't do IO (or memory allocations) under a semaphore
> (ok, in this case it's a rwsem, same difference), it might even be
> preferable to keep it as a semaphore rather than as a spinlock.
> 
> So it doesn't necessarily have to go all the way - we _could_ just try
> something like "when taking the mmap_sem, set a thread flag" and then
> have a "warn if doing allocations or IO under that flag".
> 
> And since this is about performance, not some hard requirement, it
> might not even matter if we catch all cases.  If we fix it so that any
> regular load on most normal filesystems never see the warning, we'd
> already be golden.

Honestly, I'd *love* if a filesystem can be guaranteed that ->fault and
->mkwrite callbacks do not happen under mmap_sem (or if at least fs would
be free to drop mmap_sem if it finds the page is not already cached /
prepared for writing). Because for filesystems the locking of page fault is
really painful as the lock ordering wrt mmap_sem is exactly oposite
compared to read / write path (read & write path must be designed so that
mmap_sem can be taken inside it to copy user data, fault path may be all
happening under mmap_sem). As a result this has been a long term source of
deadlocks, stale data exposure issues, and filesystem corruption issues due
to insufficient locking for multiple filesystems.

But when I was looking at what it would take to achieve this several years
ago, fixing all GUP users to deal with mmap_sem being dropped during a
fault was a gigantic task because there were users of GUP relying on
mmap_sem being held for large code sections around the GUP call...

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR