linux-kernel - Re: [PATCH v4 2/3] mm/khugepaged: skip shmem with userfaultfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y/VB4pSCoJMQQn5o@x1n>
Date:   Tue, 21 Feb 2023 17:12:50 -0500
From:   Peter Xu <peterx@...hat.com>
To:     David Stevens <stevensd@...omium.org>
Cc:     linux-mm@...ck.org, Matthew Wilcox <willy@...radead.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Yang Shi <shy828301@...il.com>,
        David Hildenbrand <david@...hat.com>,
        Hugh Dickins <hughd@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 2/3] mm/khugepaged: skip shmem with userfaultfd

On Fri, Feb 17, 2023 at 05:54:38PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@...omium.org>
> 
> Make sure that collapse_file respects any userfaultfds registered with
> MODE_MISSING. If userspace has any such userfaultfds registered, then
> for any page which it knows to be missing, it may expect a
> UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
> collapsing a shmem range would result in replacing an empty page with a
> THP, to avoid breaking userfaultfd.
> 
> Synchronization when checking for userfaultfds in collapse_file is
> tricky because the mmap locks can't be used to prevent races with the
> registration of new userfaultfds. Instead, we provide synchronization by
> ensuring that userspace cannot observe the fact that pages are missing
> before we check for userfaultfds. Although this allows registration of a
> userfaultfd to race with collapse_file, it ensures that userspace cannot
> observe any pages transition from missing to present after such a race
> occurs. This makes such a race indistinguishable to the collapse
> occurring immediately before the userfaultfd registration.
> 
> The first step to provide this synchronization is to stop filling gaps
> during the loop iterating over the target range, since the page cache
> lock can be dropped during that loop. The second step is to fill the
> gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> time, to avoid races with accesses to the page cache that only take the
> RCU read lock.
> 
> The fact that we don't fill holes during the initial iteration means
> that collapse_file now has to handle faults occurring during the
> collapse. This is done by re-validating the number of missing pages
> after acquiring the page cache lock for the final time.
> 
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> return EBUSY if there are any missing pages (instead of succeeding on
> shmem and returning EINVAL on anonymous memory). There is also now a
> window during MADV_COLLAPSE where a fault on a missing page will cause
> the syscall to fail with EAGAIN.
> 
> The fact that intermediate page cache state can no longer be observed
> before the rollback of a failed collapse is also technically a
> userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> is exceedingly unlikely that anything relies on being able to observe
> that transient state.
> 
> Signed-off-by: David Stevens <stevensd@...omium.org>

It'll be great to have another eye looking, but... AFAICT this works for
us.  Thanks David, this is better than what I suggested.

Acked-by: Peter Xu <peterx@...hat.com>

-- 
Peter Xu